CodeMapper: A Language-Agnostic Approach to Mapping Code Regions Across Commits
The evolution of software involves numerous code changes. To follow changes made by others, developers are commonly facing the problem of mapping a specific code region from one commit to another. For example, they may want to determine how the condition of an if-statement, a specific line in a configuration file, or the definition of a function is changing. We call this problem the \emph{code mapping problem}. Existing techniques, such as git diff, address this problem insufficiently because they show all changes made to a file, instead of focusing on a code region of the developer’s choice. Other techniques focus on specific code elements and programming languages, e.g., methods in Java, limiting their applicability. This paper introduces CodeMapper, an approach to address the code mapping problem in a way that is independent of specific program elements and programming languages. Given a code region in one commit, CodeMapper finds the corresponding region in another commit. The approach consists of two phases: (i) computing candidate regions by analyzing diffs, detecting code movements, and searching for specific code fragments, and (ii) selecting the most likely target region by calculating similarities. Our evaluation applies CodeMapper to four datasets: two new, hand-annotated datasets, each containing 100 code region pairs of different sizes, written in ten popular programming languages, a dataset of 187 evolving comments used to suppress static analysis warnings in Python, and a dataset of 2,005 pairs of code elements from prior work. CodeMapper correctly identifies the expected target region (i.e., exact match rate) in 71.0%–94.5% of all cases, improving over the best available baselines by 1.5–58.8 absolute percent points.