BotHunter: An Approach to Detect Software Bots in GitHub (MSR 2022 - Technical Papers)

Who

Ahmad Abdellatif, Mairieli Wessel, Igor Steinmacher, Marco Gerosa, Emad Shihab

Track

MSR 2022 Technical Papers

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 17 May 2022 22:22 - 22:29 at MSR Main room - even hours - Session 1 Chair(s): Hongyu Zhang, Masud Rahman

Abstract

Bots have become popular in software projects as they play critical roles, from running tests to fixing bugs/vulnerabilities. However, the large number of software bots adds extra effort on practitioners and researchers to distinguish human accounts from bot accounts to avoid bias in data-driven studies. Researchers developed several approaches to identify bots at specific activity levels (issue/pull request or commit), considering a single repository, and disregarding features that were shown to be effective in other domains. To address this gap, we propose using a machine learning based approach to identify the bot accounts regardless of their activity level. We extracted 19 features related to the account’s profile information, activities, and comment similarity. Then, we evaluated the performance of five machine learning classifiers using a dataset that has more than 5,000 GitHub accounts. Our results show that the Random Forest classifier performs the best with an F1-score of 92.4% and AUC of 98.7%. Furthermore, the account profile information (e.g., account login) are the most important features to identify the account type. Finally, we compare the performance of the Random Forest classifier to the state-of-the-art approaches, and our results show that our Random Forest model outperforms the state-of-the-art techniques in identifying the account types regardless of their activity level.

Link to Preprint

http://das.encs.concordia.ca/uploads/Abdellatif2022MSR.pdf

Ahmad Abdellatif

Concordia University

Canada

Mairieli Wessel

Delft University of Technology

Netherlands

Igor Steinmacher

Northern Arizona University

Brazil

Marco Gerosa

Northern Arizona University, USA

United States

Emad Shihab

Concordia University

Canada

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 17 May
Displayed time zone: Eastern Time (US & Canada) change

22:00 - 22:50	Session 1Technical Papers / Registered Reports at MSR Main room - even hours Chair(s): Hongyu Zhang University of Newcastle, Masud Rahman Dalhousie University

22:00 4m Short-paper		An Empirical Evaluation of GitHub Copilot’s Code Suggestions Technical Papers Nhan Nguyen University of Alberta, Sarah Nadi University of Alberta DOI Pre-print
22:04 4m Short-paper		Comments on Comments: Where Code Review and Documentation Meet Technical Papers Nikitha Rao Carnegie Mellon University, Jason Tsay IBM Research, Martin Hirzel IBM Research, Vincent J. Hellendoorn Carnegie Mellon University DOI Pre-print File Attached
22:08 7m Talk		Does This Apply to Me? An Empirical Study of Technical Context in Stack Overflow Technical Papers Akalanka Galappaththi University of Alberta, Sarah Nadi University of Alberta, Christoph Treude University of Melbourne DOI Pre-print Media Attached
22:15 7m Talk		Towards Reliable Agile Iterative Planning via Predicting Documentation Changes of Work Items Technical Papers Jirat Pasuksmit University of Melbourne, Patanamon Thongtanunam University of Melbourne, Shanika Karunasekera The University of Melbourne
22:22 7m Talk		BotHunter: An Approach to Detect Software Bots in GitHub Technical Papers Ahmad Abdellatif Concordia University, Mairieli Wessel Delft University of Technology, Igor Steinmacher Northern Arizona University, Marco Gerosa Northern Arizona University, USA, Emad Shihab Concordia University Pre-print
22:29 7m Talk		Recommending Code Improvements Based on Stack Overflow Answer Edits Registered Reports Chaiyong Ragkhitwetsagul Mahidol University, Thailand, Matheus Paixao University of Fortaleza Pre-print
22:36 14m Live Q&A		Discussions and Q&A Technical Papers

Information for Participants

Tue 17 May 2022 22:00 - 22:50 at MSR Main room - even hours - Session 1 Chair(s): Hongyu Zhang, Masud Rahman

Info for room MSR Main room - even hours:

Click here to go to the room on Midspace