Large Language Models (LLMc) have shown significant potential in automating software engineering tasks, particularly in code generation. This research bridges the gap between rigorous causal analysis and human interpretability in evaluating LLMc within software engineering (SE) contexts. The study introduces a novel approach \textit{InterpretSE} that combines a taxonomy of code syntax concepts with a token clustering method to enhance the interpretability of LLMc outputs, facilitate debugging, and advance the field of interpretable AI for SE. A dynamic benchmark is proposed, comprising a testbed, a metric, and a protocol for applying the metric. To bridge the gap between statistical rigor and human interpretability, we propose to introduce a taxonomy of code syntax concepts. This taxonomy maps low-level, non-interpretable tokens to higher-level, human-understandable concepts, enabling practitioners to debug LLMc outputs effectively. The research aims to address two key challenges in LLMc evaluation: the lack of formal, transparent, and interpretable benchmarking methods, and the difficulty in interpreting LLMc outputs due to the complexity of token-level representations. By providing a dynamic, causal analysis-driven benchmark and a human-interpretable taxonomy, this work offers actionable insights for improving LLMc behavior, supporting prompt engineering, and advancing the reliability and effectiveness of LLMc in SE tasks. The expected contribution is a significant advancement in the field of interpretable AI for software engineering, enabling more rigorous and reliable evaluation of LLMc capabilities and limitations.
Wed 25 JunDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
Andromeda is located close to the restaurant and the bar, at the end of the corridor on the side of the bar.
From the registration desk, go towards the restaurant, turn left towards the bar, walk until the end of the corridor.