The Landscape of Source Code Representation Learning in AI-Driven Software Engineering Tasks
Appropriate representation of source code and its relevant properties form the backbone of Artificial Intelligence (AI)/ Machine Learning (ML) pipelines for various software engineering tasks such as \textit{code classification}, \textit{bug prediction}, \textit{code clone detection}, and \textit{code summarization}. In the literature, researchers have extensively experimented with different kinds of source code representations (syntactic, semantic, integrated, customized) and properties ranging from tree/graph representations such as Abstract Syntax Trees (ASTs) to pre-trained transformer models like CodeBERT. In addition, it is common for researchers to create hand-crafted and customized source code representations for an appropriate software engineering task. In a 2018 survey, Allamanis et al. listed ~35 different ways of source code representations for different software engineering (SE) tasks like ASTs, customized ASTs, Control Flow Graphs (CFGs), Data Flow Graphs (DFGs) and so on. The main goal of this tutorial is two-fold (i) Present an overview of the state-of-the-art of source code representations and corresponding ML pipelines with an explicit focus on the pros and cons of each of the representations (ii) Practical challenges in infusing different code views in the state-of-the-art ML models.
Fri 19 MayDisplayed time zone: Hobart change
11:00 - 12:30 | |||
11:00 90mTalk | The Landscape of Source Code Representation Learning in AI-Driven Software Engineering Tasks Technical Briefings Sridhar Chimalakonda IIT Tirupati, Debeshee Das Indian Institute of Technology Tirupati, Alex Mathai IBM India Research Labs, Srikanth Tamilselvam IBM Research, Atul Kumar IBM India Research Labs |