ESEIW 2025
Mon 29 September - Fri 3 October 2025

This program is tentative and subject to change.

Context: Software contracts are legally binding agreements that outline the terms and conditions governing the development, licensing, use, or distribution of software and related services. A clear understanding of these terms and conditions ensures compliance between parties, aligns expectations, and helps developers navigate scope, timelines, and responsibilities, all of which are crucial for maintaining the quality of the software being developed. However, the intricate nature and length of contractual clauses often impede comprehension, thereby reducing their readability. While clause summarization and/or simplification may seem like a viable solution, a single contractual clause often outlines action items for multiple stakeholders across various departments within an organization. As a result, a generic summary may not adequately capture the specific responsibilities pertinent to each stakeholder mentioned in a contractual clause. Given the proven effectiveness of large language models (LLMs) in processing and analyzing complex text, this study explores their potential in generating stakeholder-specific insights from contractual clauses. Aim: Building upon the context, we empirically evaluate various state-of-the-art LLMs to determine their effectiveness in generating stakeholder-specific insights from complex contractual clauses. We also examine the efficacy of supplying contextual information to generate these insights, as opposed to relying on generic summary generation. To the best of our knowledge, no prior research in the software engineering paradigm has explored context-aware summarization that produces stakeholder-specific insights from complex text such as software contracts. Method: We investigate the application of zero-shot, few-shot prompting and finetuning techniques across various state-of-theart LLMs comparing the best results obtained from each approach across several models including T5, Llama 3.1 & 3.2, PEGASUS, BART, Mistral, Gemma 3 and Qwen 2.5. Enterprise-hosted models such as OpenAl’s ChatGPT and Anthropic’s Claude fall outside the scope of this study, as our focus is on deployable models that can be integrated with proprietary datasets complying with the data privacy policies of the involved parties and/or organizations. Accordingly, we restrict our evaluation to open-source models that offer performance comparable to commercial alternatives. We conducted our experiments on a proprietary contractual dataset comprising 4000 clauses. We validated the generated results using quantitative metrics such as ROUGE, METEOR, and BLEU scores and through human-evaluation metrics such as fluency, coherence, informativeness, and relevance to ensure the quality of the generated insights. Results and Conclusions: Based on both quantitative and qualitative metric scores, we identified finetuning as the most reliable and effective technique for generating stakeholder-specific insights, achieving improvements in the range of 150-200% over other techniques. Among the evaluated models, the finetuned Llama 3.2 model emerged as the most optimal one, as it outperformed the other models by (a) gaining a score of over 0.9 on the quantitative scale, (b) consistently being rated ‘High’ on the quality index for all four qualitative metrics, and (c) being among the fastest to generate insights (in less than one second on an average). To demonstrate the practical applicability of our approach, we integrated the finetuned Llama 3.2 model into the Software Contracts Governance System (SCGS) of a major IT vendor organization.

This program is tentative and subject to change.

Thu 2 Oct

Displayed time zone: Hawaii change

13:50 - 14:50
13:50
15m
Talk
Contribution History as a Key Feature in OSS Task Recommendation: an LLM-Based Empirical Study
ESEM - Emerging Results and Vision Track
Md Abdul Hannan Colorado State University, Mohammad Habibullah Rakib Colorado State University, Khondaker Masfiq Reza Colorado State University, Fabio Marcos De Abreu Santos Colorado State University, USA
14:05
15m
Talk
Exploring LLMs for Stakeholder-Specific Insight Generation from Software Contracts
ESEM - Industry, Government, and Community Track
Jyoti Shukla TCS Research, Aditya Kahol TCS Research, Mohit Chaudhary TCS Research, Preethu Rose Anish TCS Research
14:20
15m
Talk
Benchmarking large language models for automated labeling: The case of issue report classification
ESEM - Journal First Track
Giuseppe Colavito University of Bari, Italy, Filippo Lanubile University of Bari, Nicole Novielli University of Bari
Link to publication
14:35
15m
Talk
Secret Breach Detection in Source Code with Large Language Models
ESEM - Technical Track
Md Nafiu Rahman Bangladesh University of Engineering and Techonology, Sadif Ahmed Bangladesh University of Engineering and Techonology, Zahin Wahab The University of British Columbia, S. M. Sohan Google Inc, Rifat Shahriyar Bangladesh University of Engineering and Technology Dhaka, Bangladesh
Pre-print