A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code
Recent advances in large language models (LLMs) have enabled their widespread adoption in software engineering tasks, including code completion, test generation, and program repair. However, despite their impressive fluency, concerns remain about the structural quality of the code they produce. In particular, LLMs often replicate poor coding practices, introducing code smells (i.e., patterns that hinder readability, maintainability, or design integrity). Although prior research has explored the detection or repair of smells, there is limited understanding of how and when these issues emerge in generated code. In this paper, we aim to address this gap by systematically measuring the propensity, explaining, and mitigating code smells in LLM-generated code. To this end, we set up the Propensity Smelly Score (PSC), a probabilistic metric that estimates the likelihood of generating types of code smells. We evaluated PSC robustness and demonstrated that it better captures the type of code smells and explains the model output than canonical metrics, such as BLEU and CodeBLEU. We then apply causal inference techniques to uncover the applicability of PSC, such as generation strategy, model architecture, or prompt design. Our findings show that prompt formulation and model design play a pivotal role. Finally, we conduct a mitigation study and a user evaluation. The results show that prompt-based interventions significantly reduce smells’ presence during inference and that practitioners find smell estimates useful and actionable for reasoning about model outputs. Our work provides a foundation for integrating quality-aware assessments into the evaluation and deployment of LLMs for code.