Code Summarization Beyond Function Level (Virtual Talk)
Code summarization, a critical task in natural language processing and software engineering, aims to generate concise descriptions of source code. Recent advancements have improved the quality of these summaries, enhancing code readability and maintainability. This study investigates the effectiveness of code summarization models beyond the function level, exploring the impact of class and repository contexts on the quality of summaries. The study involves revising benchmarks for evaluating models at class and repository levels, assessing baseline models, and evaluating LLMs with in-context learning to determine the enhancement of summary quality with additional context. The findings reveal that the fine-tuned state-of-the-art \textit{CodeT5+ base} model excels in code summarization, and structured context, such as few-shot learning and retrieved code chunks from RAG, significantly enhances the performance of LLMs in code summarization. Notably, the \textit{Deepseek Coder 1.3B} and \textit{Starcoder2 15B} models demonstrate substantial improvements in metrics such as BLEURT, METEOR, and BLEU$_4$ at both class and repository levels. Repository-level summarization exhibits promising potential but necessitates significant computational resources and gains from the inclusion of structured context. Lastly, we employed the novel SIDE code summarization metric in our evaluation. This work contributes to refining strategies for prompt engineering, few-shot learning, and RAG, addressing gaps in benchmarks for code summarization at various levels. Finally, we publish all study details, code, datasets, and results of evaluation to GitHub repository available at URL.