A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering (ASE 2023 - Research Papers)

Who

Wentao Liang, Xiang Ling, Jingzheng Wu, Tianyue Luo, Yanjun Wu

Track

ASE 2023 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 12 Sep 2023 13:30 - 13:42 at Room E - Vulnerability and Security 1 Chair(s): Fatemeh Hendijani Fard

Abstract

As the most popular Python software repository, PyPI has become an indispensable part of the Python ecosystem. Regrettably, the open nature of PyPI exposes end-users to substantial security risks stemming from malicious packages. Consequently, the timely and effective identification of malware within the vast number of newly-uploaded PyPI packages has emerged as a pressing concern. Existing detection methods are dependent on difficult-to-obtain explicit knowledge, such as taint sources, sinks, and malicious code patterns, rendering them susceptible to overlooking emergent malicious packages.

In this paper, we present a lightweight and effective method, namely MPHunter, to detect malicious packages without requiring any explicit prior knowledge. MPHunter is founded upon two fundamental and insightful observations. First, malicious packages are considerably rarer than benign ones, and second, the functionality of installation scripts for malicious packages diverges significantly from those of benign packages, with the latter frequently forming clusters. Consequently, MPHunter utilizes clustering techniques to group the installation scripts of PyPI packages and identifies outliers. Subsequently, MPHunter ranks the outliers according to their outlierness and the distance between them and known malicious instances, thereby effectively highlighting potential evil packages.

With MPHunter, we successfully identified 60 previously unknown malicious packages from a pool of 31,329 newly-uploaded packages over a two-month period. All of them have been confirmed by the PyPI official. Moreover, a manual analysis shows that MPHunter recognizes all potentially malicious installation scripts with a recall of 100% across all analyzed packages. We assert that MPHunter offers a valuable and advantageous supplement to existing detection techniques, augmenting the arsenal of software supply chain security analysis.

File attachments

Speech_Slice (ASE-Speech-v1.9.pdf)	2.14MiB

Wentao Liang

Institute of Software, Chinese Academy of Sciences

Xiang Ling

Institute of Software, Chinese Academy of Sciences

China

Jingzheng Wu

Institute of Software, The Chinese Academy of Sciences

Tianyue Luo

Institute of Software, Chinese Academy of Sciences

Yanjun Wu

Institute of Software, Chinese Academy of Sciences

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 12 Sep
Displayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change

13:30 - 15:00	Vulnerability and Security 1Research Papers / Journal-first Papers at Room E Chair(s): Fatemeh Hendijani Fard University of British Columbia

13:30 12m Talk		A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering Research Papers Wentao Liang Institute of Software, Chinese Academy of Sciences, Xiang Ling Institute of Software, Chinese Academy of Sciences, Jingzheng Wu Institute of Software, The Chinese Academy of Sciences, Tianyue Luo Institute of Software, Chinese Academy of Sciences, Yanjun Wu Institute of Software, Chinese Academy of Sciences File Attached
13:42 12m Talk		Merge-Replay: Efficient IFDS-Based Taint Analysis by Consolidating Equivalent Value Flows Research Papers Yujiang Gui UNSW Sydney, Dongjie He UNSW, Jingling Xue UNSW Pre-print File Attached
13:54 12m Talk		Learning to Locate and Describe Vulnerabilities Research Papers Jian Zhang Nanyang Technological University, Shangqing Liu Nanyang Technological University, Xu Wang Beihang University, Li Tianlin Nanyang Technological University, Yang Liu Nanyang Technological University
14:06 12m Talk		When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection Research Papers Xin-Cheng Wen Harbin Institute of Technology, Xinchen Wang Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Shaohua Wang New Jersey Institute of Technology, Yang Liu Nanyang Technological University, Zhaoquan Gu Harbin Institute of Technology
14:18 12m Talk		The Secret Life of Software Vulnerabilities: A Large-Scale Empirical Study Journal-first Papers Emanuele Iannone University of Salerno, Roberta Guadagni University of Salerno, Filomena Ferrucci University of Salerno, Andrea De Lucia University of Salerno, Fabio Palomba University of Salerno Link to publication DOI Pre-print Media Attached
14:30 12m Talk		SCPatcher: Mining Crowd Security Discussions to Enrich Secure Coding Practices Research Papers Ziyou Jiang Institute of Software at Chinese Academy of Sciences, Lin Shi Beihang University, Guowei Yang University of Queensland, Qing Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences Media Attached File Attached