R3-Bench: Reproducible Real-world Reverse Engineering Dataset for Symbol Recovery
Symbol recovery in reverse engineering is crucial for restoring variable and data structure information in compiled binaries. While learning-based methods have shown promise in recovering both semantic information (names and types) and syntactic information (shapes), they require comprehensive datasets where expressions in binary code are precisely aligned with their source code equivalents. Current techniques for generating such alignments struggle with complex data access patterns, resulting in incomplete training data and consequently hampering model performance and recovery accuracy. We present AST-Align, a novel technique unifying alignment of variables and struct access expressions across multiple architectures (x86 and ARM) and languages (C/C++/Rust). AST-Align significantly improves the number of generated ground truths, capturing four times more struct fields than previous methods. Using this algorithm, we develop R3-Bench, a metadata-rich, extensible dataset with explicit project inclusion criteria and reproducible processing pipeline, comprising over 10 million functions across multiple architectures. Our evaluation establishes baseline performance by testing various approaches from n-gram models to Large Language Models. The results show that while general LLMs initially perform poorly, their effectiveness dramatically improves with proper demonstration. R3-Bench provides a robust foundation for assessing model capabilities and serves as a valuable reference for future symbol recovery research.