ICSE 2026
Sun 12 - Sat 18 April 2026 Rio de Janeiro, Brazil

As Large Language Models (LLMs) continue to advance in their ability to process natural and programming languages, they show promise in complex translation tasks across domains with strict compliance requirements. However, ensuring consistency in legally critical domains remains challenging due to inherent limitations, such as natural language ambiguity and the tendency to generate hallucinations. This paper explores an \emph{agentic approach} that leverages LLMs for legal-critical software development. We use U.S. \emph{federal tax preparation software} as a representative case study, where natural language tax code must be precisely translated into executable logic, ensuring high fidelity to regulatory requirements.

A fundamental challenge in developing legally critical software from specifications is the generation of test cases, which suffers from the \emph{oracle problem}. Determining the correct output for a given scenario often requires interpreting legal statutes through a dialogic process, necessitating input from legal experts or independent reviewers. Prior research has proposed \emph{metamorphic testing} as a potential solution by evaluating equivalence across similarly situated individuals. This paper adopts and extends this approach by introducing LLM agents specialized in generating metamorphic test cases. A key innovation of our work is a higher-order generalization of metamorphic tests, motivated by our case study of tax preparation software, wherein system outputs are analyzed across comparative shifts among similar individuals. Since manually generating such higher-order relations is tedious and error-prone, our agentic paradigm is well-suited for automating test case generation.

We design and implement cohorts of LLM-based agents, each simulating roles in real-world software development teams handling legal documents. Our framework includes a \emph{metamorphic testing agent} that provides counterexamples while translating tax code into executable software logic. Our findings indicate that our agentic approach, utilizing smaller language models (e.g., \emph{GPT-4o-mini}), outperforms frontier models (e.g., \emph{GPT-4o} and \emph{Claude-3.5}) in complex tax code generation scenarios, achieving a worst-case pass rate of 45% compared to 9%–15%. Furthermore, our evaluations reveal that incorporating higher-order metamorphic testing improves the pass rate in the most challenging scenarios by up to 50%. Our results present a compelling case for using agentic LLM-driven methodologies to generate robust and trustworthy legal-critical software from natural language specifications.