Software development activities produce numerous types of artifacts as byproducts: source code, version control system metadata, bug reports, mailing list conversations, test data, etc. Empirical software engineering (ESE) has thrived mining all those artifacts to uncover the inner workings of software development and improve its practices. But which artifacts are studied in the field is a moving target, which we study empirically in this paper.
We perform a meta-analysis of software artifact mining studies published in top conferences in (empirical) software engineering, for a total of 9622 papers, which we analyze using natural language processing (NLP) techniques. We characterize quantitatively the types of software artifacts that are most often mined in those studies and their evolution over a 16-year period (2004-2020). We analyze the combinations of artifact types that are most often mined together, as well as the relationship between study purposes and mined artifacts.
We discuss the implications of our findings to inform research policy decisions about study repeatability and the production of open datasets to enable future studies in the field.