On the Naturalness of Fuzzer Generated Code (MSR 2022 - Technical Papers)

Who

Rajeswari Hita Kambhamettu, John Billos, Carolyn "Tomi" Oluwaseun-Apo, Benjamin Gafford, Rohan Padhye, Vincent J. Hellendoorn

Track

MSR 2022 Technical Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 19 May 2022 11:00 - 11:04 at MSR Main room - odd hours - Session 11: Machine Learning & Information Retrieval Chair(s): Phuong T. Nguyen
Mon 23 May 2022 14:23 - 14:31 at Room 315+316 - Blended Technical Session 2 (Machine Learning and Information Retrieval) Chair(s): Preetha Chatterjee

Abstract

Compiler fuzzing tools such as Csmith have uncovered many bugs in compilers by randomly sampling programs from a generative model. The success of these tools is often attributed to their ability to generate unexpected corner case inputs that developers tend to overlook during manual testing. At the same time, their chaotic nature makes fuzzer-generated test cases notoriously hard to interpret, which has lead to the creation of input simplification tools such as C-Reduce (for C compiler bugs). In until now unrelated work, researchers have also shown that human-written software tends to be rather repetitive and predictable to language models. Studies show that developers deliberately write more predictable code, whereas code with bugs is relatively unpredictable. In this study, we ask the natural questions of whether this high predictability property of code also, and perhaps counter-intuitively, applies to fuzzer-generated code. That is, we investigate whether fuzzer-generated compiler inputs are deemed unpredictable by a language model built on human-written code and surprisingly conclude that \emph{it is not}. To the contrary, Csmith fuzzer-generated programs are \emph{more} predictable on a per-token basis than human-written C programs. Furthermore, bug-triggering tended to be more predictable still than random inputs, and the C-Reduce minimization tool did not substantially increase this predictability. Rather, we find that bug-triggering inputs are unpredictable relative to \emph{Csmith’s own} generative model. This is encouraging; our results suggest promising research directions on incorporating predictability metrics in the fuzzing and reduction tools themselves.

Rajeswari Hita Kambhamettu

Carnegie Mellon University

John Billos

Wake Forest University

Carolyn "Tomi" Oluwaseun-Apo

Pennsylvania State University

Benjamin Gafford

Carnegie Mellon University

Rohan Padhye

Carnegie Mellon University

Vincent J. Hellendoorn

Carnegie Mellon University

United States

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 19 May
Displayed time zone: Eastern Time (US & Canada) change

11:00 - 11:50	Session 11: Machine Learning & Information RetrievalTechnical Papers at MSR Main room - odd hours Chair(s): Phuong T. Nguyen University of L’Aquila

11:00 4m Short-paper		On the Naturalness of Fuzzer Generated Code Technical Papers Rajeswari Hita Kambhamettu Carnegie Mellon University, John Billos Wake Forest University, Carolyn "Tomi" Oluwaseun-Apo Pennsylvania State University, Benjamin Gafford Carnegie Mellon University, Rohan Padhye Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University
11:04 7m Talk		Does Configuration Encoding Matter in Learning Software Performance? An Empirical Study on Encoding Schemes Technical Papers Jingzhi Gong Loughborough University, Tao Chen Loughborough University DOI Pre-print Media Attached
11:11 7m Talk		Multimodal Recommendation of Messenger Channels Technical Papers Ekaterina Koshchenko JetBrains Research, Egor Klimov JetBrains Research, Vladimir Kovalenko JetBrains Research
11:18 7m Talk		Senatus: A Fast and Accurate Code-to-Code Recommendation Engine Technical Papers Fran Silavong JP Morgan Chase & Co., Sean Moran JP Morgan Chase & Co., Antonios Georgiadis JP Morgan Chase & Co., Rohan Saphal JP Morgan Chase & Co., Robert Otter JP Morgan Chase & Co. DOI Pre-print Media Attached
11:25 7m Talk		Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study Technical Papers Tatiana Castro Vélez City University of New York (CUNY) Graduate Center, Raffi Khatchadourian City University of New York (CUNY) Hunter College, Mehdi Bagherzadeh Oakland University, Anita Raja City University of New York (CUNY) Hunter College Pre-print Media Attached
11:32 7m Talk		GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses Technical Papers Wei Ma SnT, University of Luxembourg, Mengjie Zhao LMU Munich, Ezekiel Soremekun SnT, University of Luxembourg, Qiang Hu University of Luxembourg, Jie M. Zhang King's College London, Mike Papadakis University of Luxembourg, Luxembourg, Maxime Cordy University of Luxembourg, Luxembourg, Xiaofei Xie Singapore Management University, Singapore, Yves Le Traon University of Luxembourg, Luxembourg Pre-print
11:39 11m Live Q&A		Discussions and Q&A Technical Papers

Mon 23 May
Displayed time zone: Eastern Time (US & Canada) change

13:30 - 15:00	Blended Technical Session 2 (Machine Learning and Information Retrieval) Technical Papers / Data and Tool Showcase Track at Room 315+316 Chair(s): Preetha Chatterjee Drexel University, USA

13:30 15m Talk		Methods for Stabilizing Models across Large Samples of Projects(with case studies on Predicting Defect and Project Health) Technical Papers Suvodeep Majumder North Carolina State University, Tianpei Xia North Carolina State University, Rahul Krishna North Carolina State University, Tim Menzies North Carolina State University Pre-print Media Attached
13:45 15m Talk		GraphCode2Vec: Generic Code Embedding via Lexical and Program Dependence Analyses Technical Papers Wei Ma SnT, University of Luxembourg, Mengjie Zhao LMU Munich, Ezekiel Soremekun SnT, University of Luxembourg, Qiang Hu University of Luxembourg, Jie M. Zhang King's College London, Mike Papadakis University of Luxembourg, Luxembourg, Maxime Cordy University of Luxembourg, Luxembourg, Xiaofei Xie Singapore Management University, Singapore, Yves Le Traon University of Luxembourg, Luxembourg Pre-print
14:00 15m Talk		Senatus: A Fast and Accurate Code-to-Code Recommendation Engine Technical Papers Fran Silavong JP Morgan Chase & Co., Sean Moran JP Morgan Chase & Co., Antonios Georgiadis JP Morgan Chase & Co., Rohan Saphal JP Morgan Chase & Co., Robert Otter JP Morgan Chase & Co. DOI Pre-print Media Attached
14:15 8m Short-paper		Comments on Comments: Where Code Review and Documentation Meet Technical Papers Nikitha Rao Carnegie Mellon University, Jason Tsay IBM Research, Martin Hirzel IBM Research, Vincent J. Hellendoorn Carnegie Mellon University DOI Pre-print File Attached
14:23 8m Short-paper		On the Naturalness of Fuzzer Generated Code Technical Papers Rajeswari Hita Kambhamettu Carnegie Mellon University, John Billos Wake Forest University, Carolyn "Tomi" Oluwaseun-Apo Pennsylvania State University, Benjamin Gafford Carnegie Mellon University, Rohan Padhye Carnegie Mellon University, Vincent J. Hellendoorn Carnegie Mellon University
14:31 8m Talk		SOSum: A Dataset of Stack Overflow Post Summaries Data and Tool Showcase Track Bonan Kou Purdue University, Yifeng Di Purdue University, Muhao Chen University of Southern California, Tianyi Zhang Purdue University
14:39 21m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Thu 19 May 2022 11:00 - 11:50 at MSR Main room - odd hours - Session 11: Machine Learning & Information Retrieval Chair(s): Phuong T. Nguyen

Info for room MSR Main room - odd hours:

Click here to go to the room on Midspace