Varangian: A Git Bot for Augmented Static Analysis (MSR 2022 - Industry Track)

Who

Saurabh Pujar, Yunhui Zheng, Luca Buratti, Burn Lewis, Alessandro Morari, Jim A. Laredo, Kevin Postlethwait, Christoph Görn

Track

MSR 2022 Industry Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 20 May 2022 14:18 - 14:25 at MSR Main room - even hours - Session 16: Non-functional Properties (Availability, Security, Legal Aspects) Chair(s): Maxime Lamothe, Jin L.C. Guo

Abstract

The complexity and scale of modern software programs often lead to overlooked programming errors and security vulnerabilities. Developers often rely on automatic tools, like static analysis tools, to look for bugs and vulnerabilities. Static analysis tools are widely used because they can understand nontrivial program behaviors, scale to millions of lines of code, and detect subtle bugs. However, they are known to generate an excess of false alarms which hinder their utilization as it is counterproductive for developers to go through a long list of reported issues, only to find a few true positives. One of the ways proposed to suppress false positives is to use machine learning to identify them. However, training machine learning models requires good quality labeled datasets. For this purpose, we developed D2A , a differential analysis based approach that uses the commit history of a code repository to create a labeled dataset of Infer static analysis output.

The data generated by D2A can be used to train AI Models like Voting, Stacking Ensembles and C-BERT , which learn to identify False Positives. Ensembles are built on top of the Boosting and Tree based classifiers which use hand-crafted features for classifying static analyzer output as True Positives. C-BERT is a BERT-base language model pretrained from scratch on C source code extracted from 10,000 C repositories, thus leveraging “big code”. This model is then fine-tuned on D2A labeled data. This approach views source code as language and automates the extraction of features. The output of the models is a prioritized list of defects ordered by the likelihood of being True Positive.

One way to use the Augmented Static Analyzer to improve developer productivity would be to insert the application in the developer workflow so that its use would seem natural. With this idea in mind, we created Varangian, which is a Git bot that automatically creates issues on the repository based on defects prioritized by the Augmented Static Analyzer. Models trained on the D2A dataset created from the same repository are used to create the prioritized list.

The inference pipeline which produces the prioritized list first applies Infer static analyzer on the latest commit of the repository. Relevant source code is then extracted based on bug report for each defect highlighted by the static analyzer. Bug reports and relevant source code extracted from repositories are used to generate features for AI Models, which then assign a likelihood of being a True Positive to each defect. The git bot then takes the defects most likely to be True Positive and creates issues for each defect. The issue created by Varangian has a lot of information the developer can use for debugging.

The most recent run of Varangian bot prototype on the latest commit of an opensource project created 5 issues out of which 1 was a TP. This gives us an FP/TP ratio of 4/1 which is five times better than the 20/1 ratio we observe for Infer on the test set of the same project. In this presentation, we will showcase Varangian, compare different model training approaches, discuss the challenges involved in building a training and inference pipeline based on code repository, and the impact on performance when moving from the test set to the latest commit.

Saurabh Pujar

IBM Research

United States

Yunhui Zheng

IBM Research

United States

Luca Buratti

IBM Research

United States

Burn Lewis

IBM Research

United States

Alessandro Morari

IBM Research

United States

Jim A. Laredo

IBM Research

United States

Kevin Postlethwait

Red Hat

Christoph Görn

Red Hat

Germany

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 20 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:00	Session 16: Non-functional Properties (Availability, Security, Legal Aspects)Industry Track / Technical Papers / Registered Reports / Data and Tool Showcase Track at MSR Main room - even hours Chair(s): Maxime Lamothe Polytechnique Montreal, Montreal, Canada, Jin L.C. Guo McGill University

14:00 7m Talk		A Deep Study of the Effects and Fixes of Server-Side Request Races in Web Applications Technical Papers Zhengyi Qiu North Carolina State University, Shudi Shao North Carolina State University, Qi Zhao North Carolina State University, Hassan Ali Khan North Carolina State University, Xinning Hui North Carolina State University, Guoliang Jin North Carolina State University Media Attached
14:07 4m Talk		A Large-scale Dataset of (Open Source) License Text VariantsData and Tool Showcase Award Data and Tool Showcase Track Stefano Zacchiroli Télécom Paris, Polytechnic Institute of Paris DOI Pre-print
14:11 7m Talk		SECOM: Towards a convention for security commit messagesFOSS Impact Paper Award Industry Track Sofia Reis Instituto Superior Técnico, U. Lisboa & INESC-ID, Rui Abreu Faculty of Engineering, University of Porto, Portugal, Hakan Erdogmus Carnegie Mellon University, Corina S. Păsăreanu Carnegie Mellon University Pre-print
14:18 7m Talk		Varangian: A Git Bot for Augmented Static Analysis Industry Track Saurabh Pujar IBM Research, Yunhui Zheng IBM Research, Luca Buratti IBM Research, Burn Lewis IBM Research, Alessandro Morari IBM Research, Jim A. Laredo IBM Research, Kevin Postlethwait Red Hat, Christoph Görn Red Hat
14:25 7m Talk		Detecting Privacy-Sensitive Code Changes with Language Modeling Industry Track Gökalp Demirci Meta Platforms, Inc., Vijayaraghavan Murali Meta Platforms, Inc., Imad Ahmad Meta Platforms, Inc., Rajeev Rao Meta Platforms, Inc., Gareth Ari Aye Meta Platforms, Inc.
14:32 4m Talk		Is GitHub's Copilot as Bad As Humans at Introducing Vulnerabilities in Code? Registered Reports Owura Asare University of Waterloo, Mei Nagappan University of Waterloo, N. Asokan University of Waterloo Pre-print
14:36 7m Talk		Finding the Fun in Fundraising: Public Issues and Pull Requests in VC-backed Open-Core Companies Industry Track Kevin Xu GitHub
14:43 17m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Fri 20 May 2022 14:00 - 15:00 at MSR Main room - even hours - Session 16: Non-functional Properties (Availability, Security, Legal Aspects) Chair(s): Maxime Lamothe, Jin L.C. Guo

Info for room MSR Main room - even hours:

Click here to go to the room on Midspace