A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering
As the most popular Python software repository, PyPI has become an indispensable part of the Python ecosystem. Regrettably, the open nature of PyPI exposes end-users to substantial security risks stemming from malicious packages. Consequently, the timely and effective identification of malware within the vast number of newly-uploaded PyPI packages has emerged as a pressing concern. Existing detection methods are dependent on difficult-to-obtain explicit knowledge, such as taint sources, sinks, and malicious code patterns, rendering them susceptible to overlooking emergent malicious packages.
In this paper, we present a lightweight and effective method, namely MPHunter, to detect malicious packages without requiring any explicit prior knowledge. MPHunter is founded upon two fundamental and insightful observations. First, malicious packages are considerably rarer than benign ones, and second, the functionality of installation scripts for malicious packages diverges significantly from those of benign packages, with the latter frequently forming clusters. Consequently, MPHunter utilizes clustering techniques to group the installation scripts of PyPI packages and identifies outliers. Subsequently, MPHunter ranks the outliers according to their outlierness and the distance between them and known malicious instances, thereby effectively highlighting potential evil packages.
With MPHunter, we successfully identified 60 previously unknown malicious packages from a pool of 31,329 newly-uploaded packages over a two-month period. All of them have been confirmed by the PyPI official. Moreover, a manual analysis shows that MPHunter recognizes all potentially malicious installation scripts with a recall of 100% across all analyzed packages. We assert that MPHunter offers a valuable and advantageous supplement to existing detection techniques, augmenting the arsenal of software supply chain security analysis.
Tue 12 SepDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
13:30 - 15:00
|A Needle is an Outlier in a Haystack: Hunting Malicious PyPI Packages with Code Clustering|
Wentao Liang Institute of Software, Chinese Academy of Sciences, Xiang Ling Institute of Software, Chinese Academy of Sciences, Jingzheng Wu Institute of Software, The Chinese Academy of Sciences, Tianyue Luo Institute of Software, Chinese Academy of Sciences, Yanjun Wu Institute of Software, Chinese Academy of SciencesFile Attached
|Merge-Replay: Efficient IFDS-Based Taint Analysis by Consolidating Equivalent Value Flows|
Research PapersPre-print File Attached
|Learning to Locate and Describe Vulnerabilities|
|When Less is Enough: Positive and Unlabeled Learning Model for Vulnerability Detection|
|The Secret Life of Software Vulnerabilities: A Large-Scale Empirical Study|
Emanuele Iannone University of Salerno, Roberta Guadagni University of Salerno, Filomena Ferrucci University of Salerno, Andrea De Lucia University of Salerno, Fabio Palomba University of SalernoLink to publication DOI Pre-print Media Attached
|SCPatcher: Mining Crowd Security Discussions to Enrich Secure Coding Practices|
Ziyou Jiang Institute of Software at Chinese Academy of Sciences, Lin Shi Beihang University, Guowei Yang University of Queensland, Qing Wang Institute of Software at Chinese Academy of Sciences; University of Chinese Academy of SciencesMedia Attached File Attached