GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT (ESEIW 2024 - ESEM Journal-First Papers)

Who

Phuong T. Nguyen, Juri Di Rocco, Claudio Di Sipio, Riccardo Rubei, Davide Di Ruscio, Massimiliano Di Penta

Track

ESEIW 2024 ESEM Journal-First Papers

Time Zone

The program is currently displayed in (GMT+02:00) Brussels, Copenhagen, Madrid, Paris.

Use conference time zone: (GMT+02:00) Brussels, Copenhagen, Madrid, ParisSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 24 Oct 2024 17:15 - 17:30 at Telensenyament (B3 Building - 1st Floor) - Machine learning for software engineering Chair(s): Luigi Quaranta

Abstract

Since its launch in November 2022, ChatGPT has gained popularity among users, especially programmers who use it to solve development issues. However, while offering a practical solution to programming problems, ChatGPT should be used primarily as a supporting tool (e.g., in software education) rather than as a replacement for humans. Thus, detecting automatically generated source code by ChatGPT is necessary, and tools for identifying AI-generated content need to be adapted to work effectively with code. This paper presents GPTSniffer–a novel approach to the detection of source code written by AI–built on top of CodeBERT. We conducted an empirical study to investigate the feasibility of automated identification of AI-generated code, and the factors that influence this ability. The results show that GPTSniffer can accurately classify whether code is human-written or AI-generated, outperforming two baselines, GPTZero and OpenAI Text Classifier. Also, the study shows how similar training data or a classification context with paired snippets helps boost the prediction. We conclude that GPTSniffer can be leveraged in different contexts, e.g., in software engineering education, where teachers use the tool to detect cheating and plagiarism, or in development, where AI-generated code may require peculiar quality assurance activities.

Link to Publication

https://www.sciencedirect.com/science/article/pii/S0164121224001043

Link to Preprint

https://www.researchgate.net/publication/379879523_GPTSniffer_A_CodeBERT-based_classifier_to_detect_source_code_written_by_ChatGPT

DOI

https://doi.org/10.1016/j.jss.2024.112059

Phuong T. Nguyen

University of L’Aquila

Italy

Juri Di Rocco

University of L'Aquila

Italy

Claudio Di Sipio

University of l'Aquila

Italy

Riccardo Rubei

University of L'Aquila

Italy

Davide Di Ruscio

University of L'Aquila

Italy

Massimiliano Di Penta

University of Sannio, Italy

Italy

Time Zone

The program is currently displayed in (GMT+02:00) Brussels, Copenhagen, Madrid, Paris.

Use conference time zone: (GMT+02:00) Brussels, Copenhagen, Madrid, ParisSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 24 Oct
Displayed time zone: Brussels, Copenhagen, Madrid, Paris change

16:00 - 17:30	Machine learning for software engineeringESEM Technical Papers / ESEM Emerging Results, Vision and Reflection Papers Track / ESEM Journal-First Papers at Telensenyament (B3 Building - 1st Floor) Chair(s): Luigi Quaranta University of Bari, Italy

16:00 20m Full-paper		A Transformer-based Approach for Augmenting Software Engineering Chatbots Datasets ESEM Technical Papers Ahmad Abdellatif University of Calgary, Khaled Badran Concordia University, Canada, Diego Elias Costa Concordia University, Canada, Emad Shihab Concordia University
16:20 20m Full-paper		Unsupervised and Supervised Co-learning for Comment-based Codebase Refining and its Application in Code Search ESEM Technical Papers Gang Hu School of Information Science & Engineering, Yunnan University, Xiaoqin Zeng School of Information Science & Engineering, Yunnan University, Wanlong Yu , Min Peng , YUAN Mengting School of Computer Science, Wuhan University, Wuhan, China, Liang Duan
16:40 20m Full-paper		Good things come in three: Generating SO Post Titles with Pre-Trained Models, Self Improvement and Post Ranking ESEM Technical Papers Duc Anh Le Hanoi University of Science and Technology, Anh M. T. Bui Hanoi University of Science and Technology, Phuong T. Nguyen University of L’Aquila, Davide Di Ruscio University of L'Aquila Pre-print
17:00 15m Vision and Emerging Results		PromptLink: Multi-template prompt learning with adversarial training for issue-commit link recovery ESEM Emerging Results, Vision and Reflection Papers Track Yang Deng The School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China, Bangchao Wang Wuhan Textile University, Zhiyuan Zou The School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China, Luyao Ye The School of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China
17:15 15m Journal Early-Feedback		GPTSniffer: A CodeBERT-based classifier to detect source code written by ChatGPT ESEM Journal-First Papers Phuong T. Nguyen University of L’Aquila, Juri Di Rocco University of L'Aquila, Claudio Di Sipio University of l'Aquila, Riccardo Rubei University of L'Aquila, Davide Di Ruscio University of L'Aquila, Massimiliano Di Penta University of Sannio, Italy Link to publication DOI Pre-print