AthenaLLM: Supporting Experiments with Large Language Models in Software DevelopmentICPCICPC Tools
Existing studies on the use of Large Language Models (LLMs) in software development leverage methodologies that limit their scal- ability and require intensive manual data collection and analysis, for example, due to the use of video data or think-aloud protocols. We propose the use of a specialized tool capable of automatically collecting fine-grained, relevant data during experiments and case studies. It enables researchers to understand for example how often participants accept or reject suggestions made by LLMs and what kinds of prompts are more likely to trigger accepted suggestions, even in studies targeting a large number of participants. We imple- ment this idea as a Visual Studio Code plugin named AthenaLLM. It mimics the functionalities of GitHub Copilot and offers seamless integration with OpenAI API models like GPT-4 and GPT-3.5, and compatibility with other models providing an OpenAI-compatible API, e.g., Vicuña. It automatically collects data at a fine level of granularity and covers both the interactions of developers with their IDE and the products of such interactions. Thus, the proposed approach also reduces bias that the experimental process itself may introduce, e.g., due to the need for participants to verbalize their thoughts. In this paper we discuss the limitations of previous studies and how AthenaLLM could enable researchers to go both broader (in terms of number of participants) and deeper (in terms of the kinds of research questions that can be tackled).