EMC: A Semantic-Enhanced Malware Classification Method with Robustness and Scalability
Driven by substantial financial incentives, malware continues to evolve, posing persistent and growing threats. Malware classification has long been a vital research area, and numerous classification approaches have been proposed. However, existing methods still face limitations in performance and often lack robustness when dealing with complex analytical scenarios involving concept drift. In this paper, we propose a robust and scalable malware classification framework based on program semantic analysis and feature enhancement, and implement a prototype system named \textit{EMC}. Our approach focuses on behavior-oriented semantic understanding of programs, constructing a more effective feature space while eliminating spurious correlations between features in a fine-grained manner to enhance robustness. \textit{EMC} extracts behavior-oriented binary opcode sequences and employs a BERT sliding window mechanism for semantic understanding and feature space construction. Furthermore, it combines random Fourier features and weighted resampling techniques to remove dependencies between features, and leverages mutual information to purify features. These enhancements enable the classification model to more accurately capture the intrinsic characteristics of malicious programs, thereby improving both accuracy and robustness. Compared with nine mainstream malware classification methods, \textit{EMC} achieves F1 score improvements ranging from 1.51% to 12.06% under standard conditions, and from 17.36% to 50.35% in scenarios involving concept drift.