ASE 2024
Sun 27 October - Fri 1 November 2024 Sacramento, California, United States

Source code modeling represents a promising avenue for automating software development, such as code generation, bug repair, and program analysis. This research direction aims to train deep neural nets to learn the statistical predictability inherent in human-written programs to enhance developer productivity, code quality, and the overall software development life cycle.

Although existing code modeling approaches, particularly those underpinned by Transformer-based language models, have demonstrated effectiveness across various software engineering tasks, most of them have directly adopted learning schemes from natural language processing (e.g., data collection and processing, training objectives) to source code, primarily focusing on learning code text and syntax. However, such a direct transplant limits the models’ capability to capture deep program semantics, such as code functionality, dependencies, and program states during execution.

In this research proposal, we highlight the critical role of program semantics in source code modeling. We propose a range of innovative methodologies to bridge the gap between the text-based language models for large-scale code training and the requirement of deep semantic understanding to assist with software engineering tasks effectively. Furthermore, we showcase the efficacy of the proposed semantic-aware code modeling through a handful of published papers and preliminary results, with motivations to delve deeper into this avenue during doctoral research.