1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection (ASE 2024 - Research Papers)

Who

Xiaobing Sun, Xingan Gao, Sicong Cao, Lili Bo, Xiaoxue Wu, Kaifeng Huang

Track

ASE 2024 Research Papers

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 31 Oct 2024 16:00 - 16:15 at Gardenia - Malicious code and package Chair(s): Curtis Atkisson

Abstract

PyPI, the official package registry for Python, has become the major accessory for adversaries to distribute malicious packages and undertake open source supply chain attacks. In the recent years, PyPI has seen a recent surge in malicious package uploads. To exclude malicious PyPI packages, it is unrealistic to manually inspect and label the maliciousness of packages given the huge number of new packages uploaded every day. Therefore, malicious PyPI package detection is critical in safeguarding the security of open source software. Existing approaches use static feature extraction and metadata to detect malicious packages. However, the feature extraction relies on the rules predefined by expert. Attackers can also evade the detection according to the rules.In addition, existing approaches only uses partial information from metadata, such as package name and version name, which would hinder the effectiveness result of malicious package detection.

In this paper, we propose EA4MP, a novel approach which integrates deep code behavior features with metadata features, to detect malicious packages on PyPI. Specifically, EA4MP extracts code behavior sequences from all script files, and fine-tunes a BERT model to learn deep semantic features of malicious code. Besides, EA4MP extracts the metadata features from the PKG-INFO based on a group of pre-defined expert rules, and trains a ML model. Finally, EA4MP constructs an ensemble classifier based on the Adaboost algorithm to detect malicious packages. We evaluated EA4MP against VirusTotal, OSSGadget, and Bandit4Mal on a newly-constructed dataset. The experimental results show that EA4MP improves precision by 6.9%-24.6% and recall by 10.5%-18.4%. We also monitored 46,573 software packages uploaded on PyPI between March 28 and April 18, 2024, 119 of which are malicious packages found by EA4MP. We reported these packages to PyPI officials, and 82 of them have been removed.

Xiaobing Sun

Yangzhou University

China

Xingan Gao

Yangzhou University

Sicong Cao

Yangzhou University

China

Lili Bo

Yangzhou University

Xiaoxue Wu

Yangzhou University

Kaifeng Huang

Tongji University

China

Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection

Time Zone

The program is currently displayed in (GMT-07:00) Pacific Time (US & Canada).

Use conference time zone: (GMT-07:00) Pacific Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 31 Oct
Displayed time zone: Pacific Time (US & Canada) change

15:30 - 16:30	Malicious code and packageResearch Papers / Industry Showcase at Gardenia Chair(s): Curtis Atkisson UW

15:30 15m Talk		RMCBench: Benchmarking Large Language Models' Resistance to Malicious Code Research Papers Jiachi Chen Sun Yat-sen University, Qingyuan Zhong Sun Yat-sen University, Yanlin Wang Sun Yat-sen University, Kaiwen Ning Sun Yat-sen University, Yongkun Liu Sun Yat-sen University, Zenan Xu Tencent AI Lab, Zhe Zhao Tencent AI Lab, Ting Chen University of Electronic Science and Technology of China, Zibin Zheng Sun Yat-sen University
15:45 15m Talk		SpiderScan: Practical Detection of Malicious NPM Packages Based on Graph-Based Behavior Modeling and Matching Research Papers Yiheng Huang Fudan University, Ruisi Wang Fudan University, Wen Zheng Fudan University, Zhuotong Zhou Fudan University, China, Susheng Wu Fudan University, Shulin Ke Fudan University, Bihuan Chen Fudan University, Shan Gao Huawei, Xin Peng Fudan University
16:00 15m Talk		1+1>2: Integrating Deep Code Behaviors with Metadata Features for Malicious PyPI Package Detection Research Papers Xiaobing Sun Yangzhou University, Xingan Gao Yangzhou University, Sicong Cao Yangzhou University, Lili Bo Yangzhou University, Xiaoxue Wu Yangzhou University, Kaifeng Huang Tongji University Media Attached
16:15 15m Talk		Models Are Codes: Towards Measuring Malicious Code Poisoning Attacks on Pre-trained Model Hubs Industry Showcase Jian Zhao Huazhong University of Science and Technology, Shenao Wang Huazhong University of Science and Technology, Yanjie Zhao Huazhong University of Science and Technology, Xinyi Hou Huazhong University of Science and Technology, Kailong Wang Huazhong University of Science and Technology, Peiming Gao MYbank, Ant Group, Yuanchao Zhang Mybank, Ant Group, Chen Wei MYbank, Ant Group, Haoyu Wang Huazhong University of Science and Technology