ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil
Fri 17 Apr 2026 11:45 - 12:00 at Asia I - AI for Software Engineering 20 Chair(s): Ipek Ozkaya

As software systems increasingly rely on natural language interfaces, ensuring the reliability of these systems is crucial. One critical component is the ability to accurately translate natural language queries into corresponding SQL queries, a field known as Text-to-SQL. However, the scarcity of high-quality, large-scale, and domain-specific Text-to-SQL datasets hinders the development of reliable and robust models. To tackle these challenges, we propose SelectCraft, a novel automatic generation approach designed to create realistic Text-to-SQL datasets tailored to specific domains. Our method leverages existing databases and their structures to generate complex text-SQL pairs that mirror real-world usage scenarios. As a proof of concept, we have successfully generated a substantial financial Text-to-SQL dataset, denominated as BanQies, encompassing over 1 million samples utilizing our proposed approach. Moreover, we introduce BanQL, a new large language model (LLM) based on StarCoder2, a state-of-the-art code-based LLM, and fine-tuned on our newly created dataset. We evaluate BanQL performance against several state-of-the-art models, demonstrating significant enhancements in accuracy and generalizability, highlighting the advantages of incorporating domain-specific data in Text-to-SQL tasks. We firmly believe that our contributions have the potential to improve the overall reliability of Text-to-SQL software systems.

PDF (3746226-1.pdf)11.75MiB

Fri 17 Apr

Displayed time zone: Brasilia, Distrito Federal, Brazil change

11:00 - 12:30
AI for Software Engineering 20New Ideas and Emerging Results (NIER) / Research Track / Journal-first Papers at Asia I
Chair(s): Ipek Ozkaya Carnegie Mellon University
11:00
15m
Talk
Is Hyper-Parameter Optimization Different for Software Analytics?
Journal-first Papers
Rahul Yedida LexisNexis, Tim Menzies North Carolina State University
Link to publication Pre-print
11:15
15m
Talk
On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization
Journal-first Papers
Giuseppe Crupi Università della Svizzera italiana, Rosalia Tufano Università della Svizzera Italiana, Alejandro Velasco William & Mary, Antonio Mastropaolo William and Mary, USA, Denys Poshyvanyk William & Mary, Gabriele Bavota Software Institute @ Università della Svizzera Italiana
11:30
15m
Talk
A Catalog of Data Smells for Coding Tasks
Journal-first Papers
Antonio Vitale Politecnico di Torino, University of Molise, Rocco Oliveto University of Molise, Simone Scalabrino University of Molise
Link to publication
11:45
15m
Talk
Towards Automating Domain-Specific Data Generation for Text-to-SQL: A Comprehensive Approach
Journal-first Papers
Salmane Chafik UM6P College of Computing, Saad Ezzini King Fahd University of Petroleum and Minerals, Ismail Berrada UM6P College of Computing
Link to publication DOI Pre-print File Attached
12:00
15m
Talk
Empirical and Sustainability Aspects of Software Engineering Research in the Era of Large Language Models: A Reflection
New Ideas and Emerging Results (NIER)
David Williams University College London, Maria Kechagia National and Kapodistrian University of Athens, Max Hort Simula Research Laboratory, Aldeida Aleti Monash University, Justyna Petke University College London, Federica Sarro University College London
12:15
15m
Talk
FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset ConstructionVirtual Attendance
Research Track
Jiachi Chen Sun Yat-sen University, Yiming Shen Sun Yat-sen University, Jiashuo Zhang Peking University, China, Zihao Li Hong Kong Polytechnic University, John Grundy Monash University, Zhenzhe Shao Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Jiashui Wang Zhejiang University, Ting Chen University of Electronic Science and Technology of China, Zibin Zheng Sun Yat-sen University
Pre-print Media Attached File Attached