Improving API Knowledge Comprehensibility: A Context-Dependent Entity Detection and Context Completion Approach using LLM
Extracting API knowledge from Stack Overflow has become a crucial way to assist developers in using APIs. Existing research has primarily focused on extracting relevant API-related knowledge at the sentence level to enhance API documentation.
However, this level of extraction can lead to a loss of crucial context, especially when sentences contain context-dependent entities (i.e., whose understanding requires reference to the surrounding context) that may hinder developers’ understanding. To investigate this issue, we conducted an empirical study of 384 Stack Overflow posts and found that (1) approximately one-third of API functionality sentences contain context-dependent entities, and (2) these entities fall into two categories: Referential Context-Dependent Entities and Local Variable Context-Dependent Entities. In response, we developed a novel method, CEDCC, which combines an entity filtering strategy informed by insights from our empirical study, with a large language model (LLM) to construct coreference chains for detecting context-dependent entities. Additionally, it employs a step-by-step approach with the LLM to complete the necessary context for understanding these entities. To evaluate CEDCC, we constructed a dataset of 1,023 API knowledge sentences, including 567 context-dependent entities and their required contexts. The results demonstrate the effectiveness of CEDCC in accurately detecting context-dependent entities and completing context tasks, achieving an F1-score of 0.865 and a BERTScore of 0.373, significantly surpassing the baseline methods. Human evaluations further confirmed that CEDCC effectively improves the comprehensibility of API knowledge sentences.