Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes (MSR 2022 - Technical Papers)

Who

Jingzhi Gong, Tao Chen

Track

MSR 2022 Technical Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 19 May 2022 11:04 - 11:11 at MSR Main room - odd hours - Session 11: Machine Learning & Information Retrieval Chair(s): Phuong T. Nguyen

Abstract

Learning and predicting the performance of a configurable software system helps to provide better quality assurance. One important engineering decision therein is how to encode the configuration into the model built. Despite the presence of different encoding schemes, there is still little understanding of which is better and under what circumstances, as the community often relies on some general beliefs that inform the decision in an ad-hoc manner. To bridge this gap, in this paper, we empirically compared the widely used encoding schemes for software performance learning, namely label, scaled label, and one-hot encoding. The study covers five systems, seven models, and three encoding schemes, leading to 105 cases of investigation. Our key findings reveal that: (1) conducting trial-and-error to find the best encoding scheme in a case by case manner can be rather expensive, requiring up to 400+ hours on some models and systems; (2) the one-hot encoding often leads to the most accurate results while the scaled label encoding is generally weak on accuracy over different models; (3) conversely, the scaled label encoding tends to result in the fastest training time across the models/systems while the one-hot encoding is the slowest; (4) for all models studied, label and scaled label encoding often lead to relatively less biased outcomes between accuracy and training time, but the paired model varies according to the system.

We discuss the actionable suggestions derived from our findings, hoping to provide a better understanding of this topic for the community. To promote open science, the data and code of this work can be publicly accessed at https://doi.org/10.5281/zenodo.5884197.

Link to Preprint

https://arxiv.org/abs/2203.15988

DOI

https://doi.org/10.1145/3524842.3528431

Jingzhi Gong

Loughborough University

United Kingdom

Tao Chen

Loughborough University

United Kingdom

#2-Technical_Track-Gong

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 19 May
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 11:50	Session 11: Machine Learning & Information RetrievalTechnical Papers at MSR Main room - odd hours Chair(s): Phuong T. Nguyen University of L’Aquila

11:00 4m Short-paper		On the Naturalness of Fuzzer Generated Code Technical Papers Rajeswari Hita Kambhamettu Carnegie Mellon University, John Billos Wake Forest University, Carolyn "Tomi" Oluwaseun-Apo Pennsylvania State University, Benjamin Gafford Carnegie Mellon University, Rohan Padhye Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University
11:04 7m Talk		Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes Technical Papers Jingzhi Gong Loughborough University, Tao Chen Loughborough University DOI Pre-print Media Attached
11:11 7m Talk		Multimodal Recommendation of Messenger Channels Technical Papers Ekaterina Koshchenko JetBrains Research, Egor Klimov JetBrains Research, Vladimir Kovalenko JetBrains Research
11:18 7m Talk		Senatus: A Fast and Accurate Code-to-Code Recommendation Engine Technical Papers Fran Silavong JP Morgan Chase & Co., Sean Moran JP Morgan Chase & Co., Antonios Georgiadis JP Morgan Chase & Co., Rohan Saphal JP Morgan Chase & Co., Robert Otter JP Morgan Chase & Co. DOI Pre-print Media Attached
11:25 7m Talk		Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study Technical Papers Tatiana Castro Vélez City University of New York (CUNY) Graduate Center, Raffi Khatchadourian City University of New York (CUNY) Hunter College, Mehdi Bagherzadeh Oakland University, Anita Raja City University of New York (CUNY) Hunter College Pre-print Media Attached
11:32 7m Talk		GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses Technical Papers Wei Ma SnT, University of Luxembourg, Mengjie Zhao LMU Munich, Ezekiel Soremekun SnT, University of Luxembourg, Qiang Hu University of Luxembourg, Jie M. Zhang King's College London, Mike Papadakis University of Luxembourg, Luxembourg, Maxime Cordy University of Luxembourg, Luxembourg, Xiaofei Xie Singapore Management University, Singapore, Yves Le Traon University of Luxembourg, Luxembourg Pre-print
11:39 11m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Thu 19 May 2022 11:00 - 11:50 at MSR Main room - odd hours - Session 11: Machine Learning & Information Retrieval Chair(s): Phuong T. Nguyen

Info for room MSR Main room - odd hours:

Click here to go to the room on Midspace