ICPC 2025
Sun 27 - Mon 28 April 2025 Ottawa, Ontario, Canada
co-located with ICSE 2025

Test automation plays a crucial role in ensuring software security. It heavily relies on formal specifications to validate the correctness of the system behavior. However, the main approach to defining these formal specifications is through manual analysis of software documents, which requires a significant amount of engineering effort from experienced researchers and engineers. Meanwhile, system update further increases the human labor cost to maintain a corresponding formal specification, making the manual analysis approach a time-consuming and error-prone task.

Recent advances in Large Language Models (LLMs) have demonstrated promising capabilities in natural language understanding. Yet, the feasibility of using LLMs to automate the extraction of formal specifications from documents remains unexplored. We conduct an empirical study by constructing a comprehensive dataset comprising 603 specifications from 37 documents across three representative open-source software. We then evaluate the most recent LLMs’ capabilities in extracting formal specifications from documents in an end-to-end fashion, including GPT-4o, Claude, and Llama.

Our study demonstrates the application of LLMs in formal specification extraction tasks while identifying two major limitations: specification oversimplification and specification fabrication. We attribute these deficiencies to the LLMs’ inherent limitations in processing and expressive capabilities, as well as their tendency to fabricate fictional information. Inspired by human cognitive processes, we propose a novel two-stage method, annotation-then-conversion, to address these challenges. Our method decomposes the task into sentence annotation and temporal logic conversion, reducing the demands on LLMs’ processing and expressive capabilities for each subtask. Furthermore, by generating verifiable sentence-specification pairs, our method enables effective fact-checking, thereby mitigating hallucination effects. Our method demonstrates significant improvements over the end-to-end method, with a 29.2% increase in the number of correctly extracted specifications and a 14.0% improvement in average accuracy. In particular, our best-performing LLM achieves an accuracy of 71.6%.