A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering
As the most popular Python software repository, PyPI has become an indispensable part of the Python ecosystem. Regrettably, the open nature of PyPI exposes end-users to substantial security risks stemming from malicious packages. Consequently, the timely and effective identification of malware within the vast number of newly-uploaded PyPI packages has emerged as a pressing concern. Existing detection methods are dependent on difficult-to-obtain explicit knowledge, such as taint sources, sinks, and malicious code patterns, rendering them susceptible to overlooking emergent malicious packages.
In this paper, we present a lightweight and effective method, namely MPHunter, to detect malicious packages without requiring any explicit prior knowledge. MPHunter is founded upon two fundamental and insightful observations. First, malicious packages are considerably rarer than benign ones, and second, the functionality of installation scripts for malicious packages diverges significantly from those of benign packages, with the latter frequently forming clusters. Consequently, MPHunter utilizes clustering techniques to group the installation scripts of PyPI packages and identifies outliers. Subsequently, MPHunter ranks the outliers according to their outlierness and the distance between them and known malicious instances, thereby effectively highlighting potential evil packages.
With MPHunter, we successfully identified 60 previously unknown malicious packages from a pool of 31,329 newly-uploaded packages over a two-month period. All of them have been confirmed by the PyPI official. Moreover, a manual analysis shows that MPHunter recognizes all potentially malicious installation scripts with a recall of 100% across all analyzed packages. We assert that MPHunter offers a valuable and advantageous supplement to existing detection techniques, augmenting the arsenal of software supply chain security analysis.
Speech_Slice (ASE-Speech-v1.9.pdf) | 2.14MiB |
Tue 12 SepDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
13:30 - 15:00 | Vulnerability and Security 1Research Papers / Journal-first Papers at Room E Chair(s): Fatemeh Hendijani Fard University of British Columbia | ||
13:30 12mTalk | A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering Research Papers Wentao Liang Institute of Software, Chinese Academy of Sciences, Xiang Ling Institute of Software, Chinese Academy of Sciences, Jingzheng Wu Institute of Software, The Chinese Academy of Sciences, Tianyue Luo Institute of Software, Chinese Academy of Sciences, Yanjun Wu Institute of Software, Chinese Academy of Sciences File Attached | ||
13:42 12mTalk | Merge-Replay: Efficient IFDS-Based Taint Analysis by Consolidating Equivalent Value Flows Research Papers Pre-print File Attached | ||
13:54 12mTalk | Learning to Locate and Describe Vulnerabilities Research Papers Jian Zhang Nanyang Technological University, Shangqing Liu Nanyang Technological University, Xu Wang Beihang University, Li Tianlin Nanyang Technological University, Yang Liu Nanyang Technological University | ||
14:06 12mTalk | When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection Research Papers Xin-Cheng Wen Harbin Institute of Technology, Xinchen Wang Harbin Institute of Technology, Cuiyun Gao Harbin Institute of Technology, Shaohua Wang New Jersey Institute of Technology, Yang Liu Nanyang Technological University, Zhaoquan Gu Harbin Institute of Technology | ||
14:18 12mTalk | The Secret Life of Software Vulnerabilities: A Large-Scale Empirical Study Journal-first Papers Emanuele Iannone University of Salerno, Roberta Guadagni University of Salerno, Filomena Ferrucci University of Salerno, Andrea De Lucia University of Salerno, Fabio Palomba University of Salerno Link to publication DOI Pre-print Media Attached | ||
14:30 12mTalk | SCPatcher: Mining Crowd Security Discussions to Enrich Secure Coding Practices Research Papers Ziyou Jiang Institute of Software at Chinese Academy of Sciences, Lin Shi Beihang University, Guowei Yang University of Queensland, Qing Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of Sciences Media Attached File Attached |