Exploring LLMs for Stakeholder-Specific Insight Generation from Software Contracts
This program is tentative and subject to change.
Context: Software contracts are legally binding agreements that outline the terms and conditions governing the development, licensing, use, or distribution of software and related services. A clear understanding of these terms and conditions ensures compliance between parties, aligns expectations, and helps developers navigate scope, timelines, and responsibilities, all of which are crucial for maintaining the quality of the software being developed. However, the intricate nature and length of contractual clauses often impede comprehension, thereby reducing their readability. While clause summarization and/or simplification may seem like a viable solution, a single contractual clause often outlines action items for multiple stakeholders across various departments within an organization. As a result, a generic summary may not adequately capture the specific responsibilities pertinent to each stakeholder mentioned in a contractual clause. Given the proven effectiveness of large language models (LLMs) in processing and analyzing complex text, this study explores their potential in generating stakeholder-specific insights from contractual clauses. Aim: Building upon the context, we empirically evaluate various state-of-the-art LLMs to determine their effectiveness in generating stakeholder-specific insights from complex contractual clauses. We also examine the efficacy of supplying contextual information to generate these insights, as opposed to relying on generic summary generation. To the best of our knowledge, no prior research in the software engineering paradigm has explored context-aware summarization that produces stakeholder-specific insights from complex text such as software contracts. Method: We investigate the application of zero-shot, few-shot prompting and finetuning techniques across various state-of-theart LLMs comparing the best results obtained from each approach across several models including T5, Llama 3.1 & 3.2, PEGASUS, BART, Mistral, Gemma 3 and Qwen 2.5. Enterprise-hosted models such as OpenAl’s ChatGPT and Anthropic’s Claude fall outside the scope of this study, as our focus is on deployable models that can be integrated with proprietary datasets complying with the data privacy policies of the involved parties and/or organizations. Accordingly, we restrict our evaluation to open-source models that offer performance comparable to commercial alternatives. We conducted our experiments on a proprietary contractual dataset comprising 4000 clauses. We validated the generated results using quantitative metrics such as ROUGE, METEOR, and BLEU scores and through human-evaluation metrics such as fluency, coherence, informativeness, and relevance to ensure the quality of the generated insights. Results and Conclusions: Based on both quantitative and qualitative metric scores, we identified finetuning as the most reliable and effective technique for generating stakeholder-specific insights, achieving improvements in the range of 150-200% over other techniques. Among the evaluated models, the finetuned Llama 3.2 model emerged as the most optimal one, as it outperformed the other models by (a) gaining a score of over 0.9 on the quantitative scale, (b) consistently being rated ‘High’ on the quality index for all four qualitative metrics, and (c) being among the fastest to generate insights (in less than one second on an average). To demonstrate the practical applicability of our approach, we integrated the finetuned Llama 3.2 model into the Software Contracts Governance System (SCGS) of a major IT vendor organization.