AI-SQE: The 1st International Workshop on AI for Software Quality Evaluation: Judgment, Metrics, Benchmarks, and Beyond
Overview
Recent advancements in large language models (LLMs) and autonomous agents have introduced a major shift: the use of AI not only for generating software artifacts but also for evaluating them. This shift - from generation to judgment - has the potential to significantly reshape how software quality is assessed, how evaluation pipelines are designed, and how rigorous benchmarks are established.
AI-SQE brings together leading researchers and practitioners to examine the latest developments, fundamental challenges, and future directions in AI-driven software evaluation. The workshop aspires to become the leading forum for exploring how LLMs and agentic systems can serve as dependable evaluators of software quality, performance, and correctness.
AI-SQE invites contributions addressing the theoretical foundations, empirical research, engineering methodologies, and tool development related to AI-based software quality evaluation. As the concept of “LLM-as-a-Judge” (LaaJ) gains traction within the AI community, this workshop offers a vital platform for comprehensive investigation into judgment models, benchmarking strategies, trust calibration, and the automation of evaluation processes.
Motivation and Objectives
AI-SQE addresses a timely shift in software engineering: the rise of AI, especially LLMs and agents, as evaluators of software quality. As traditional human-centric and tool-based methods are being challenged, this workshop explores AI’s emerging role in tasks like code review, testing, and quality assurance. Sitting at the intersection of AI, software engineering, and HCI, the workshop tackles critical issues such as trust, interpretability, and reproducibility, fostering progress in both academic research and industry practices.
Goals and Outcomes
The workshop aims to advance AI-driven software evaluation by developing reliable metrics, benchmarks, and tools for automated assessment. It will explore challenges like AI hallucination and evaluator alignment with human judgment through empirical studies. AI-SQE also seeks to build a collaborative research community focused on reproducibility and practical impact, ultimately shaping how AI is integrated into modern software quality workflows.
Target Audience
The AI-SQE workshop is intended for a diverse group of professionals and researchers who operate at the crossroads of software engineering and artificial intelligence. This includes those involved in software quality assurance, program analysis, automated testing, and the development of AI-based tools. It also welcomes participants exploring human-AI collaboration, empirical methods in software engineering, and tool benchmarking. The workshop aims to create a space where interdisciplinary experts can engage with one another to advance the integration of AI in ensuring software quality.
Mix of Industry and Research Participation
To bridge academic innovation and practical application, AI-SQE aims for a balanced mix of participants from both industry and academia. Key strategies include inviting speakers from major tech companies and AI tool providers, encouraging real-world case study submissions from industry, and disseminating calls for participation through both academic networks and professional communities. This mix will ensure that the workshop remains both theoretically rich and grounded in current practice, supporting dynamic, real-world-relevant discussions.
Call for Papers
Recent advancements in large language models (LLMs) and autonomous agents have introduced a major shift: the use of AI not only for generating software artifacts but also for evaluating them. This shift - from generation to judgment - has the potential to significantly reshape how software quality is assessed, how evaluation pipelines are designed, and how rigorous benchmarks are established. This workshop brings together researchers and practitioners to examine cutting-edge developments, core challenges, and future directions in AI-based software evaluation. It aims to be the leading forum for understanding how LLMs and agentic systems can serve as reliable judges of software quality, correctness, and performance. We invite contributions on theory, empirical studies, engineering practices, and tools. As the “LLM-as-a-Judge” (LaaJ) paradigm gains momentum, AI-SQE offers a timely venue for advancing models of judgment, evaluation metrics, benchmarking frameworks, and scalable automation
Topics of Interest
We welcome submissions across a broad spectrum of topics related to AI-driven software evaluation, including but not limited to:
-
LLMs as Evaluators in Software Engineering
Explorations of how LLMs can assess code quality, correctness, security, maintainability, and performance across software artifacts. -
LLM-as-Judge (LaaJ): Foundations & Latest Techniques
Theoretical frameworks and practical implementations for treating LLMs as evaluators, including prompt engineering, voting schemes, and confidence calibration. -
Agent as a Judge: Agentic Approaches for Evaluation
Use of autonomous or multi-agent systems to conduct evaluation tasks collaboratively, iteratively, or in multi-turn contexts. -
Evaluating Code Agents and Code Agentic Systems
Benchmarks and evaluation methods for assessing the performance and reliability of AI agents that write, test, refactor, or debug software in autonomous or semi-autonomous settings. -
Scalable Evaluation Pipelines for Software Systems
Design and implementation of automated pipelines for large-scale software quality evaluation using LLMs and hybrid human-AI approaches, including CI/CD integration and workflow automation. -
Metrics and Benchmarks for Software Evaluation
Development of new metrics and standardized benchmark tasks tailored to AI-based evaluation of software quality, correctness, readability, and maintainability. -
Trust, Reliability & Explainability in LLM-based Judgment
Techniques to assess and improve the trustworthiness, reproducibility, explainability, interpretability, and robustness of LLMs when used for software quality judgments, including methods for generating clear rationales and transparent decision processes. -
Task-Specific Fine-Tuning for LLM-as-a-Judge
Techniques for fine-tuning or adapting LLMs for specialized evaluation tasks, including human-in-the-loop and RLHF approaches. -
Generative AI for Software Quality Improvement
Approaches where generative models not only detect quality issues but also propose fixes, refactoring, test cases, or architectural improvements—closing the loop from evaluation to enhancement. -
Real-World Applications and Case Studies
Practical deployments of AI-based evaluators in industry, open-source projects, or education.
Paper Submission and Review
All submitted papers should describe original work and not be published or under review anywhere. The papers must be written in English. All submissions will undergo single-blind peer review by at least three members of the program committee. Submissions will be evaluated based on workshop relevance, novelty and technical quality, clarity of presentation, and potential to stimulate discussion or lead to future research. Extended abstracts will be reviewed for relevance and soundness but will not be held to the same technical depth standards as full research papers. and will be judged by at least three reviewers on the basis of their clarity, relevance, originality, and contribution.
Types of Contributions and Page Limits:
AI-SQE will accept the following types of submissions:
-
Research Papers: Length: Up to 8 pages (excluding references). Description: Full-length papers presenting novel research results, comprehensive empirical studies, or significant engineering contributions related to AI-based software evaluation.
-
Extended Abstracts: Length: Up to 5 pages including references. Description: Concise submissions presenting new ideas, summaries of recent work, early insights, or position statements. Extended abstracts will be clearly marked as such in the proceedings.
The page limits include abstract, all figures, tables, and references. Extended abstracts will be published free of article processing charges (APCs) as per ACM policy, provided they are explicitly labeled as “extended abstracts” in the submission and proceedings.
All authors must use the official ACM Primary Article Template and submit their papers in PDF format through the HotCRP system. LaTeX authors must use \documentclass[sigconf,review]{acmart} in the preamble of the main file, allowing typesetting the paper in a double-column format with line numbers for easy reference by the reviewers.
Accepted papers will be published in the ACM Digital Library. The publication of the accepted papers will require the registration of at least one author in AI-SQE 2026, as well as the oral presentation of the paper during the workshop.
All questions about ACM template should be emailed to Onn Shehory (Proceedings Chair) - Onn.Shehory@biu.ac.il