How Natural Language Proficiency Shapes GenAI Code for Software Engineering Tasks
Foundation Model (FM)-powered coding assistants, such as GitHub Copilot, ChatGPT, Claude, and Gemini, have disrupted the software development landscape, becoming essential interfaces between developers and code generation. While the efficacy of these Large Language Models (LLMs) relies heavily on the quality of natural language prompts, existing research has predominantly focused on prompt engineering techniques rather than the linguistic proficiency of the user.
This paper addresses the natural language proficiency, which is a critical, underexplored factor. A mismatch between the complexity of a prompt and the resulting code can create significant friction in the software engineering lifecycle. For instance, a developer with high linguistic proficiency (CEFR C1) but novice programming skills (A1) may trigger an LLM to generate highly idiomatic, complex code that they cannot maintain or debug. In contrast, simple prompts might yield inefficient solutions for expert users. Furthermore, as diverse teams adopt these tools, ensuring that AI-generated code aligns with the team’s specific proficiency is crucial for preventing subtle bugs and maintaining development velocity.
To investigate this, we conducted an empirical study using the HumanEval dataset (164 hand-written Python problem-solving tasks) and the HumanEvalPlus test suite. We utilized three state-of-the-art models, including GPT-4o, Gemini 2.5 Pro, and Claude Sonnet 4. Our methodology involved two primary alignment standards. First, the English proficiency level of problem descriptions based on the Common European Framework of Reference for Languages (CEFR), ranging from A1 (Beginner) to C2 (Proficient). Second, the code proficiency adapted the pycefr standard, which rates Python elements by conceptual difficulty (e.g., print() is A1, while zip() and map() are C2). Then, we structured our investigation around two research questions. RQ1: What is the baseline natural language proficiency of software engineering problem descriptions generated by LLMs? We investigated the default linguistic level LLMs employ when describing technical tasks to establish a baseline for comprehension requirements.RQ2: Does the natural language proficiency of the prompt influence the proficiency and correctness of the generated code? We systematically varied the prompt proficiency into different levels, then observed causal links between the user’s language input and the model’s code output.
Our analysis yields several key insights regarding the relationship between natural language and code generation. LLMs default to an Intermediate (B2) or higher natural language level when describing software engineering problems. This suggests that a baseline level of English proficiency is effectively a prerequisite for developers to fully comprehend standard AI-generated explanations. On another hand, the impact of prompt language on code proficiency varied between models. However, the impact on correctness was consistent. Higher-proficiency prompts (C1/C2) consistently yielded code with higher correctness rates across all models. Conversely, simplifying the natural language of the prompt to lower CEFR levels resulted in a measurable decrease in code correctness. This study demonstrates that natural language proficiency is not only a user characteristic but also a functional control lever for code generation. The results highlight potential issues. For example, simplifying language to accommodate non-native speakers or novices may degrade the reliability of AI-generated solutions. These findings suggest that future FM-powered tools must explicitly account for user language proficiency to tailor solutions that are correct and comprehensible.
Fri 20 MarDisplayed time zone: Athens change
11:00 - 12:30 | Session 6A - Tools and Techniques for Effective Software DevelopmentIndustrial Track / Journal First Track / Tool Demo Track / Research Track at Panorama Chair(s): NIKIEMA Beninwende Serge Lionel University of Luxembourg | ||
11:00 15mTalk | How Natural Language Proficiency Shapes GenAI Code for Software Engineering Tasks Journal First Track Ruksit Rojpaisarnkit Nara Institute of Science and Technology, Youmei Fan Nara Institute of Science and Technology, Kenichi Matsumoto Nara Institute of Science and Technology, Raula Gaikovina Kula The University of Osaka | ||
11:15 15mTalk | Data Catalog Tools: A Systematic Multivocal Literature Review Journal First Track Marco Tonnarelli JADS - TU/e, Indika Kumara Tilburg University, Stefan Driessen JADS, Tilburg University, Damian Andrew Tamburri University of Sannio - JADS/NXP Semiconductors, Willem-Jan van den Heuvel JADS, Tilburg University, Patrick Oor NXP Semiconductors | ||
11:30 15mTalk | On the Practical Adoption of a Static Performance Anti-Pattern Detector: An Industrial Case Study Industrial Track Lizhi Liao University of Guelph, Weiyi Shang University of Waterloo, Catalin Sporea ERA Environmental Management Solutions, Andrei Toma ERA Environmental Management Solutions, Sarah Sajedi ERA Environmental Management Solutions | ||
11:45 15mTalk | Multi-CoLoR: Context-Aware Localization and Reasoning across Multi-Language Codebases Industrial Track Indira Vats University of Toronto; Advanced Micro Devices (AMD), Sanjukta De Advanced Micro Devices, Subhayan Roy , Saurabh Bodhe , Lejin Varghese , Max Kiehn , Yonas Bedasso Advanced Micro Devices, Marsha Chechik University of Toronto Pre-print | ||
12:00 15mTalk | Diagram-Aware Automatic Review of Software Design Documents Using Multimodal Large Language Models Industrial Track | ||
12:15 7mTalk | Source Code-Driven GDPR Documentation: Supporting RoPA with Assessor View Tool Demo Track Mugdha Khedkar Heinz Nixdorf Institute, Paderborn University, Michael Schlichtig Heinz Nixdorf Institut, Paderborn University, Eric Bodden Heinz Nixdorf Institute at Paderborn University & Fraunhofer IEM Pre-print Media Attached | ||
12:22 7mTalk | RefineID: A Developer-Centric IDE Assistant for Better Identifiers Tool Demo Track | ||