RepoReasoner: Evaluating Repository-Level Code Reasoning Ability of Long-Context Language Models
Recent advances in large language models (LLMs) have significantly improved their ability to handle complex software engineering tasks at the repository level. However, existing benchmarks for evaluating code reasoning ability operate almost exclusively at the function level, where all necessary information is provided within a single, localized context. This approach fails to capture the complexity of real-world software development, where developers must reason about dependencies and logic scattered across entire repositories, creating a critical gap between current evaluation methodologies and real-world challenges. To bridge this gap, we introduce \bench, a benchmark designed to evaluate repository-level code reasoning ability. Moving beyond self-contained code snippets, \bench assesses LLMs’ ability to navigate, retrieve, and synthesize information distributed across multiple files through two tasks: (1) \textbf{Output Prediction}, which evaluates fine-grained, stateful reasoning by tracing complex, cross-file execution paths to predict a function’s final output, and (2) \textbf{Call Chain Prediction}, which assesses high-level architectural understanding by identifying the correct sequence of files involved in an execution from noisy context. We construct our benchmark using a multi-stage pipeline that leverages dynamic analysis of pytest execution traces to capture true, runtime-dependent call chains, and employs LLM-based I/O rewriting to create logically equivalent instances that prevent memorization. Our extensive evaluation of seven state-of-the-art LLMs reveals that repository-level reasoning remains a fundamental challenge. Even with perfect context, the best-performing model (DeepSeek-R1) achieves only 69.1% Pass@1 on Output Prediction, indicating that complex reasoning—not just retrieval—is the primary bottleneck. LLMs struggle with high-level architectural understanding, exhibiting high precision but low recall (F1 < 0.51) in call chain reconstruction. Counterintuitively, simply increasing context length can be counterproductive, as noise sometimes outweighs the benefits of additional information. All models show significant performance drops on our I/O-rewritten data, confirming partial reliance on memorization rather than genuine reasoning. These findings highlight the need for future research to focus on enhancing architectural comprehension and robust reasoning capabilities in repository-level code analysis.