An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software (ICSE 2026 - Research Track)

Who

Sina Gogani Khiabani, Ashutosh Trivedi, Diptikalyan Saha, Saeid Tizpaz-Niari

Track

ICSE 2026 Research Track

Abstract

As Large Language Models (LLMs) continue to advance in their ability to process natural and programming languages, they show promise in complex translation tasks across domains with strict compliance requirements. However, ensuring consistency in legally critical domains remains challenging due to inherent limitations, such as natural language ambiguity and the tendency to generate hallucinations. This paper explores an \emph{agentic approach} that leverages LLMs for legal-critical software development. We use U.S. \emph{federal tax preparation software} as a representative case study, where natural language tax code must be precisely translated into executable logic, ensuring high fidelity to regulatory requirements.

A fundamental challenge in developing legally critical software from specifications is the generation of test cases, which suffers from the \emph{oracle problem}. Determining the correct output for a given scenario often requires interpreting legal statutes through a dialogic process, necessitating input from legal experts or independent reviewers. Prior research has proposed \emph{metamorphic testing} as a potential solution by evaluating equivalence across similarly situated individuals. This paper adopts and extends this approach by introducing LLM agents specialized in generating metamorphic test cases. A key innovation of our work is a higher-order generalization of metamorphic tests, motivated by our case study of tax preparation software, wherein system outputs are analyzed across comparative shifts among similar individuals. Since manually generating such higher-order relations is tedious and error-prone, our agentic paradigm is well-suited for automating test case generation.

We design and implement cohorts of LLM-based agents, each simulating roles in real-world software development teams handling legal documents. Our framework includes a \emph{metamorphic testing agent} that provides counterexamples while translating tax code into executable software logic. Our findings indicate that our agentic approach, utilizing smaller language models (e.g., \emph{GPT-4o-mini}), outperforms frontier models (e.g., \emph{GPT-4o} and \emph{Claude-3.5}) in complex tax code generation scenarios, achieving a worst-case pass rate of 45% compared to 9%–15%. Furthermore, our evaluations reveal that incorporating higher-order metamorphic testing improves the pass rate in the most challenging scenarios by up to 50%. Our results present a compelling case for using agentic LLM-driven methodologies to generate robust and trustworthy legal-critical software from natural language specifications.

Link to Preprint

https://arxiv.org/abs/2509.13471

An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software

Sina Gogani Khiabani

University of Illinois Chicago

Ashutosh Trivedi

University of Colorado Boulder

United States

Diptikalyan Saha

IBM Research India

India

Saeid Tizpaz-Niari

University of Illinois Chicago

United States

Tracks

Co-hosted Conferences

Workshops