Empirical evaluation of tools for hairy requirements engineering tasks
Context A hairy requirements engineering (RE) task involving natural language (NL) documents is (1) a non-algorithmic task to find all relevant answers in a set of documents, that is (2) not inherently difficult for NL-understanding humans on a small scale, but is (3) unmanageable in the large scale. In performing a hairy RE task, humans need more help finding all the relevant answers than they do in recognizing that an answer is irrelevant. Therefore, a hairy RE task demands the assistance of a tool that focuses more on achieving high recall, i.e., finding all relevant answers, than on achieving high precision, i.e., finding only relevant answers. As close to 100% recall as possible is needed, particularly when the task is applied to the development of a high-dependability system. In this case, a hairy-RE-task tool that falls short of close to 100% recall may even be useless, because to find the missing information, a human has to do the entire task manually anyway. On the other hand, too much imprecision, too many irrelevant answers in the tool’s output, means that manually vetting the tool’s output to eliminate the irrelevant answers may be too burdensome. The reality is that all that can be realistically expected and validated is that the recall of a hairy-RE-task tool is higher than the recall of a human doing the task manually. Objective Therefore, the evaluation of any hairy-RE-task tool must consider the context in which the tool is used, and it must compare the performance of a human applying the tool to do the task with the performance of a human doing the task entirely manually, in the same context. The context of a hairy-RE-task tool includes characteristics of the documents being subjected to the task and the purposes of subjecting the documents to the task. However, traditionally, many a hairy-RE-task tool has been evaluated by considering only (1) how high is its precision, or (2) how high is its F-measure, which weights recall and precision equally, ignoring the context, and possibly leading to incorrect — often underestimated — conclusions about how effective it is. Method To evaluate a hairy-RE-task tool, this article offers an empirical procedure that takes into account not only (1) the performance of the tool, but also (2) the context in which the task is being done, (3) the performance of humans doing the task manually, and (4) the performance of those vetting the tool’s output. The empirical procedure uses (I) on one hand, (1) the recall and precision of the tool, (2) a contextually weighted F-measure for the tool, (3) a new measure called summarization of the tool, and (4) the time required for vetting the tool’s output, and (II) on the other hand, (1) the recall and precision achievable by and (2) the time required by a human doing the task. Results The use of the procedure is shown for a variety of different contexts, including that of successive attempts to improve the recall of an imagined hairy-RE-task tool. The procedure is shown to be context dependent, in that the actual next step of the procedure followed in any context depends on the values that have emerged in previous steps. Conclusion Any recommendation for a hairy-RE-task tool to achieve close to 100% recall comes with caveats and may be required only in specific high-dependability contexts. Appendix C applies parts of this procedure to several hairy-RE-task tools reported in the literature using data published about them. The surprising finding is that some of the previously evaluated tools are actually better than they were thought to be when they were evaluated using mainly precision or an unweighted F-measure.
Daniel M. Berry got his Ph.D. in Computer Science from Brown University in 1974. He was on the faculty of the Computer Science Department at the University of California, Los Angeles, USA from 1972 until 1987. He was in the Computer Science Faculty at the Technion, Israel from 1987 until 1999. From 1990 until 1994, he worked for half of each year at the Software Engineering Institute at Carnegie Mellon University, USA, where he was part of a group that built CMU’s Master of Software Engineering program. During the 1998-1999 academic year, he visited the Computer Systems Group at the University of Waterloo in Waterloo, Ontario, Canada. In 1999, Berry moved to what is now the Cheriton School of Computer Science at the University of Waterloo. Between 2008 and 2013, Berry held an Industrial Research Chair in Requirements Engineering sponsored by Scotia Bank and the National Science and Engineering Research Council of Canada (NSERC). Berry’s current research interests are software engineering in general, and requirements engineering and electronic publishing in the specific.
Wed 17 AugDisplayed time zone: Hobart change
19:20 - 20:10 | Natural Language Processing for RERE@Next! Papers / Journal-First at Dibbler Chair(s): Tong Li Beijing University of Technology | ||
19:20 20mTalk | Back to the Roots: Linking User Stories to Requirements Elicitation Conversations RE@Next! Papers Tjerk Spijkman Utrecht University, Fabiano Dalpiaz Utrecht University, Sjaak Brinkkemper Utrecht University | ||
19:40 30mTalk | Empirical evaluation of tools for hairy requirements engineering tasks Journal-First Dan Berry University of Waterloo |