ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil
Fri 17 Apr 2026 17:15 - 17:30 at Oceania VII - Software Engineering for AI 8 Chair(s): Sheila Reinehr

Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, leading to their adoption in high-stakes domains such as healthcare, law, and scientific research. However, their reasoning often contains subtle logical errors masked by fluent language, posing significant risks for critical applications. While existing approaches like fact-checking, self-consistency methods, and rule-based validation provide partial solutions, they fail to detect complex logical flaws in multi-step reasoning.

To overcome these challenges, we present MATP, an evaluation framework for systematically verifying LLM reasoning via Multi-step Automatic Theorem Proving. MATP translates natural language reasoning into First-Order Logic (FOL) and applies automated theorem provers to assess step-by-step logical validity. This approach identifies hidden logical errors and provides fine-grained classifications of reasoning correctness. Evaluations on a benchmark comprising 10,830 reasoning instances generated by 10 LLMs across tasks from PrOntoQA-OOD, ProofWriter, and FOLIO show that MATP surpasses prompting-based baselines by over 42 percentage points in reasoning step verification. It further reveals model-level disparities, with reasoning models generating more logically coherent outputs than general models. These results demonstrate MATP’s potential to enhance the trustworthiness of LLM-generated reasoning.

Fri 17 Apr

Displayed time zone: Brasilia, Distrito Federal, Brazil change

16:00 - 17:30
Software Engineering for AI 8Research Track / New Ideas and Emerging Results (NIER) at Oceania VII
Chair(s): Sheila Reinehr Pontifícia Universidade Católica do Paraná (PUCPR)
16:00
15m
Talk
TaskEval: Synthesised Evaluation for Foundation-Model Tasks
New Ideas and Emerging Results (NIER)
Dilani Widanapathiranage Applied Artificial Intelligence Initiative, Deakin University, Scott Barnett Applied Artificial Intelligence Initiative, Deakin University, Stefanus Kurniawan Deakin University, Wannita Takerngsaksiri Applied Artificial Intelligence Initiative, Deakin University
16:15
15m
Talk
SpecOps: A Fully Automated AI Agent Testing Framework in Real-World GUI Environments
Research Track
Syed Yusuf Ahmed Purdue University, Shiwei Feng Purdue University, Chanwoo Bae Purdue University, Calix Barrus University of Texas at San Antonio, Xiangyu Zhang Purdue University
16:30
15m
Talk
Revisiting "Revisiting Neuron Coverage for DNN Testing: A Layer-Wise and Distribution-Aware Criterion": A Critical Review and Implications on DNN Coverage Testing
Research Track
Jinhan Kim Università della Svizzera italiana, Nargiz Humbatova Università della Svizzera italiana, Gunel Jahangirova King's College London, Shin Yoo KAIST, Paolo Tonella USI Lugano
Pre-print
16:45
15m
Talk
VADA: A Multicultural Benchmark for Value-Aware Data Generation and Alignment Evaluation in LLMs
Research Track
Zhenlun Zhang Nanjing University, Yang Feng Nanjing University, Shihao Weng Nanjing University, Yining Yin Nanjing University, Jincheng Li Nanjing University, Jia Liu Nanjing University
17:00
15m
Talk
Evaluating the effectiveness of LLM-based interoperability
Research Track
Rodrigo Falcão Fraunhofer IESE, Stefan Schweitzer Fraunhofer Institute for Experimental Software Engineering, Julien Siebert Fraunhofer IESE, Emily Calvet Fraunhofer Institute for Experimental Software Engineering, Frank Elberzhager Fraunhofer Institute for Experimental Software Engineering
17:15
15m
Talk
Beyond Correctness: Exposing LLM-generated Logical Flaws in Reasoning via Multi-step Automated Theorem Proving
Research Track
Xinyi Zheng Huazhong University of Science and Technology, Ningke Li National University of Singapore, Xiaokun Luan Peking University, Kailong Wang Huazhong University of Science and Technology, Ling Shi Nanyang Technological University, Meng Sun Peking University, Haoyu Wang Huazhong University of Science and Technology