Can Large Language Models (LLMs) compete with Human Requirement Reviewers? - Replication of an Inspection Experiment on Requirements Documents
The applications of large language models (LLMs) for software engineering are growing, especially for code – typically for generating code, or for detecting or fixing quality problems. As software requirements are commonly written in natural language, it seems promising to leverage the capabilities of LLMs for detecting requirement issues. We replicated an inspection experiment where computer science students searched for defects in requirement documents using different reading techniques. For our replication, we used GPT-4-Turbo instead of human reviewers. Additionally, we considered GPT-3.5-Turbo, Nous-Hermes-2-Mixtral-8x7B-DPO, and Phi-3-medium-128k-instruct. We focus on single prompt approaches and refrain from more complex approaches (e.g., stepwise or agent-based). We proceeded in two phases. First, we explored the general feasibility of using LLMs for requirements inspection on a practice document and examined different prompts. Second, we applied selected approaches to two requirements documents and compared the approaches to each other and to human reviewers. The approaches include variations in reading techniques (ad-hoc, perspective-based, checklist-based), LLMs, and the instructions and material provided. We found that LLMs (a) report only a limited number of deficits despite having enough tokens on hand, which (b) do not vary a lot between the different prompts. They (c) seldom match the sample solution, and (d) only provide useful insights to a small degree.