DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode
Recent advances in Machine Learning (ML) have broadened the scope for automating diverse software engineering tasks through effective representation learning of software artifacts. Traditional methods often rely on manually selected, task-specific features, which can be imprecise and incomplete. In contrast, representation learning techniques, which allow the model itself to determine the most relevant features, offer a more scalable and generalizable approach. However, in the Android domain, models like apk2vec are limited by their focus on coarse-grained, whole-app level tasks or are too specific to a single task as in the case of smali2vec.
Our research contributes to this field by proposing DexBERT, a novel BERT-like model specifically developed for DEX bytecode, which forms the core binary format in Android applications. Inspired by the success of universal language models in natural language processing, DexBERT aims to abstract and encode deep semantic information from bytecode, facilitating its application to a variety of fine-grained class-level software engineering tasks. We evaluate DexBERT’s effectiveness in modeling the DEX language and its performance across three distinct tasks: Malicious Code Localization, Defect Prediction, and Component Type Classification. Our results indicate that DexBERT provides substantial improvements over existing approaches, achieving significant accuracy gains and demonstrating its generalizability across multiple tasks.
Furthermore, DexBERT addresses the challenge of variable application sizes and demonstrates robust performance even with apps of vastly different scales. This adaptability is critical for practical deployment in real-world scenarios where application size can vary greatly.
In summary, DexBERT not only advances the state of the art in Android app analysis but also sets a new standard for the development of fine-grained, task-agnostic models in software engineering. Our contribution is significant, as it enables the development of more versatile and efficient tools for software analysis, reducing the reliance on costly manual feature engineering and repetitive model training.
Tue 29 OctDisplayed time zone: Pacific Time (US & Canada) change
13:30 - 15:00 | AndroidJournal-first Papers / Research Papers / Industry Showcase at Magnoila Chair(s): Ziyao He University of California, Irvine | ||
13:30 15mTalk | How Does Code Optimization Impact Third-party Library Detection for Android Applications? Research Papers Zifan Xie Huazhong University of Science and Technology, Ming Wen Huazhong University of Science and Technology, Tinghan Li Huazhong University of Science and Technology, Yiding Zhu Huazhong University of Science and Technology, Qinsheng Hou Shandong University; Qi An Xin Group Corp., Hai Jin Huazhong University of Science and Technology Media Attached | ||
13:45 15mTalk | MaskDroid: Robust Android Malware Detection with Masked Graph Representations Research Papers Jingnan Zheng National University of Singapore, Jiahao Liu National University of Singapore, An Zhang , Jun ZENG Huawei, Ziqi Yang Zhejiang University, Zhenkai Liang National University of Singapore, Tat-Seng Chua National University of Singapore | ||
14:00 15mTalk | A Longitudinal Analysis Of Replicas in the Wild Wild Android Research Papers Syeda Mashal Abbas Zaidi University of Waterloo, Shahpar Khan University of Waterloo, Parjanya Vyas University of Waterloo, Yousra Aafer University of Waterloo | ||
14:15 15mTalk | Android Malware Family Labeling: Perspectives from the Industry Industry Showcase Liu Wang Beijing University of Posts and Telecommunications, Haoyu Wang Huazhong University of Science and Technology, Tao Zhang Macau University of Science and Technology, Haitao Xu Zhejiang University, Guozhu Meng Institute of Information Engineering, Chinese Academy of Sciences, Peiming Gao MYbank, Ant Group, Chen Wei MYbank, Ant Group, Yi Wang | ||
14:30 15mTalk | DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode Journal-first Papers Tiezhu Sun University of Luxembourg, Kevin Allix Independent Researcher, Kisub Kim Singapore Management University, Singapore, Xin Zhou Singapore Management University, Singapore, Dongsun Kim Korea University, David Lo Singapore Management University, Tegawendé F. Bissyandé University of Luxembourg, Jacques Klein University of Luxembourg | ||
14:45 15mTalk | Same App, Different Behaviors: Uncovering Device-specific Behaviors in Android Apps Industry Showcase Zikan Dong Beijing University of Posts and Telecommunications, Yanjie Zhao Huazhong University of Science and Technology, Tianming Liu Monash Univerisity, Chao Wang University of Southern California, Guosheng Xu Beijing University of Posts and Telecommunications, Guoai Xu Harbin Institute of Technology, Shenzhen, Lin Zhang The National Computer Emergency Response Team/Coordination Center of China (CNCERT/CC), Haoyu Wang Huazhong University of Science and Technology |