Thu 26 Oct 2023 10:50 - 11:10 at Rhythms 3 - 1B - Machine learning in SE Chair(s): Davide Taibi

CONTEXT: Large language models trained on source code can support a variety of software development tasks, such as code recommendation and program repair. Large amounts of data for training such models benefit the models’ performance. However, the size of the data and models results in long training times and high energy consumption. While publishing source code allows for replicability, users need to repeat the expensive training process if models are not shared.

GOALS: The main goal of the study is to investigate if publications that trained language models for software engineering (SE) tasks share source code and trained artifacts. The second goal is to analyze the transparency on training energy usage.

METHODS: We perform a snowballing-based literature search to find publications on language models for source code, and analyze their reusability from a sustainability standpoint.

RESULTS: From a total of 494 unique publications, we identified 293 relevant publications that use language models to address code-related tasks. Among them, 27% (79 out of 293) make artifacts available for reuse. This can be in the form of tools or IDE plugins designed for specific tasks or task-agnostic models that can be fine-tuned for a variety of downstream tasks. Moreover, we collect insights on the hardware used for model training, as well as training time, which together determine the energy consumption of the development process.

CONCLUSION: We find that there are deficiencies in the sharing of information and artifacts for current studies on source code models for software engineering tasks, with 40% of the surveyed papers not sharing source code or trained artifacts. We recommend the sharing of source code as well as trained artifacts, to enable sustainable reproducibility. Moreover, comprehensive information on training times and hardware configurations should be shared for transparency on a model’s carbon footprint.

KEYWORDS: sustainability, reuse, replication, energy, DL4SE.

Thu 26 Oct

Displayed time zone: Central Time (US & Canada) change

10:30 - 12:15
1B - Machine learning in SEESEM Technical Papers / ESEM Journal-First Papers / ESEM IGC at Rhythms 3
Chair(s): Davide Taibi University of Oulu
10:30
20m
Full-paper
What is the Carbon Footprint of ML Models on Hugging Face? A Repository Mining Study
ESEM Technical Papers
Joel Castaño Fernández Universitat Politècnica de Catalunya (UPC), Silverio Martínez-Fernández UPC-BarcelonaTech, Xavier Franch Universitat Politècnica de Catalunya, Justus Bogner Vrije Universiteit Amsterdam
Link to publication Pre-print
10:50
20m
Full-paper
An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code
ESEM Technical Papers
Max Hort Simula Research Laboratory, Anastasiia Grishina Simula Research Laboratory, Leon Moonen Simula Research Laboratory and BI Norwegian Business School
Pre-print Media Attached
11:10
20m
Full-paper
An Empirical Study on Low- and High-Level Explanations of Deep Learning Misbehaviours
ESEM Technical Papers
Tahereh Zohdinasab USI Lugano, Vincenzo Riccio University of Udine, Paolo Tonella USI Lugano
11:30
20m
Full-paper
Assessing the Use of AutoML for Data-Driven Software Engineering
ESEM Technical Papers
Fabio Calefato University of Bari, Luigi Quaranta University of Bari, Italy, Filippo Lanubile University of Bari, Marcos Kalinowski Pontifical Catholic University of Rio de Janeiro (PUC-Rio)
Pre-print
11:50
10m
Journal Early-Feedback
An Empirical Study on ML DevOps Adoption Trends, Efforts, and Benefits Analysis
ESEM Journal-First Papers
Dhia Elhaq Rzig University of Michigan - Dearborn, Foyzul Hassan University of Michigan at Dearborn, Marouane Kessentini Oakland University
12:00
15m
Industry talk
The Perspective of Software Professionals on Algorithmic Racism
ESEM IGC