FORGE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil
co-located with ICSE 2026

Current code generation benchmarks focus primarily on functional correctness while overlooking two critical aspects of real-world programming: algorithmic efficiency and code quality. We introduce COMPASS (COdility’s Multi-dimensional Programming ASSessment), a comprehensive evaluation framework that assesses code generation across three dimensions: correctness, efficiency, and quality. COMPASS consists of 50 competitive programming problems from real Codility competitions, providing authentic human baselines from 393,150 submissions. Unlike existing benchmarks that treat algorithmically inefficient solutions identically to optimal ones provided they pass test cases, COMPASS systematically evaluates runtime efficiency and code quality using industry-standard analysis tools. Our evaluation of three leading reasoning-enhanced models, Anthropic Claude Opus 4, Google Gemini 2.5 Pro, and OpenAI O4-Mini-High, reveals that models achieving high correctness scores do not necessarily produce efficient algorithms or maintainable code. These findings highlight the importance of evaluating more than just correctness to truly understand the real-world capabilities of code generation models. COMPASS serves as a guiding framework, charting a path for future research toward AI systems that are robust, reliable, and ready for production use.

Mon 13 Apr

Displayed time zone: Brasilia, Distrito Federal, Brazil change

11:00 - 12:30
Session III - Code Generation & MigrationData and Benchmarking / Research Papers at Oceania I
Chair(s): Daniel Rodriguez-Cardenas William & Mary
11:00
6m
Talk
Deep Graph-Language Fusion for Structure-Aware Code Generation
Research Papers
Mert Tiftikci TU Darmstadt; hessian.AI, Amir Molzam Sharifloo TU Darmstadt, Mira Mezini TU Darmstadt; hessian.AI; National Research Center for Applied Cybersecurity ATHENE, Mert Tiftikci Technical University of Darmstadt
11:06
12m
Talk
Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation
Research Papers
Laboni Sarker University of California at Santa Barbara, Mara Downing , Achintya Desai University of California, Santa Barbara, Tevfik Bultan University of California at Santa Barbara
11:18
6m
Talk
Detecting and Correcting Hallucinations in LLM-Generated Code via Deterministic AST Analysis
Research Papers
Dipin Khati William & Mary, Daniel Rodriguez-Cardenas William & Mary, Paul Pantzer William & Mary, Denys Poshyvanyk William & Mary
11:24
12m
Talk
Code for Machines, Not Just Humans: Quantifying AI-Friendliness with Code Health Metrics
Research Papers
Markus Borg CodeScene, Nadim Hagatulah Lund University, Adam Tornhill Codescene AB, Emma Söderberg Lund University
Pre-print
11:36
6m
Talk
VHDL-Instruct: Training Open Dataset for LLMs Benchmarking and HDL Code GenerationVirtual Attendance
Data and Benchmarking
Patrik Drazic University of Southern Denmark, Benaoumeur Senouci University of Southern Denmark, Boualem Benatallah Dublin City University
Media Attached
11:48
6m
Talk
COMPASS: A Psychometrics-Guided Multi-Dimensional Benchmark for Code Generation Evaluation
Data and Benchmarking
James Meaden Codility, Markus Borg CodeScene
Pre-print
11:54
6m
Talk
A Hybrid LLM-Guided Approach to Code Migration Using API-Derived RulesVirtual Attendance
Research Papers
Gabriel Vitor Klaumann Gubert Technische Hochschule Ingolstadt (THI), Stefan Kugele Technische Hochschule Ingolstadt, Munir Georges Technische Hochschule Ingolstadt (THI)
Media Attached
12:00
12m
Talk
An Experience Report on LLM-Based Agentic Translation from Android to iOS: Pitfalls and Insights
Research Papers
Zhili Zeng York University, Kimya Khakzad Shahandashti York University, Alvine Boaye Belle York University, Song Wang York University, Zhen Ming (Jack) Jiang York University
12:12
6m
Talk
MiG.4: A Curated Dataset of Library Migrations in Java and Python
Data and Benchmarking
Matheus Barbosa UFMG, Pedro Baptista UFMG, João Eduardo Montandon Universidade Federal de Minas Gerais (UFMG), MATHEUS LIMA Ufmg, Pedro Henrique Fernandes Baptista UFMG
12:18
6m
Talk
PromiseAwait: A Dataset of JavaScript Migrations from Promises to Async/Await
Data and Benchmarking
Rafael Araujo Magesty UFMG, João Eduardo Montandon Universidade Federal de Minas Gerais (UFMG), Rafael Magesty