Manual Abstraction in the Wild: A Multiple-Case Study on OSS Systems
Models are a useful tool for software design, analysis, and to support the onboarding of new maintainers. However, these benefits are often lost over time, as the system implementation evolves and the original models are not updated. Reverse engineering methods and tools could help to keep models and implementation code in sync; however, automatically reverse-engineered models are typically not abstract and contain extensive information that prevents understanding. Recent advances in AI-based content generation make it likely that we will soon see reverse engineering tools with support for human-grade abstraction. To inform the design and validation of such tools, we need a principled understanding of what manual abstraction is, a question that has received little attention in the literature so far. Towards this goal, in this paper, we present a multiple-case study of model-to-code differences, investigating five substantial open-source software projects retrieved via repository mining. To explore characteristics of model-to-code differences, we, all in all, manually matched 466 classes, 1352 attributes, and 2634 operations from source code to 338 model elements (classes, attributes, operations, and relationships). These mappings precisely capture the differences between a provided class diagram design and implementation codebase. Studying all differences in detail allowed us to derive a taxonomy of difference types and to provide a sorted list of cases corresponding to the identified types of differences. As we discuss, our contributions pave the way for improved reverse engineering methods and tools, new mapping rules for model-to-code consistency checks, and guidelines for avoiding over-abstraction and over-specification during design.