EditSum: A Retrieve-and-Edit Framework for Source Code Summarization
Existing studies show that code summaries help developers understand and maintain source code. Unfortunately, these summaries are often mismatched, missing or outdated in software projects. Code summarization aims to generate brief and accurate natural language descriptions automatically for source code. According to Gros et al., code summaries are highly structured and have many repetitive patterns, for example, they are often begin with patterns like “return true if…” and “create a new…”. The promising results obtained by previous approaches also prove the existence of these patternized words. Besides the patternized words, a code summary also contains important keywords, which are the key to reflecting the functionality of the code. However, the state-of-the-art code summarization approaches perform poorly on predicting the keywords, which leads to the generated summaries suffer a loss in informativeness. To alleviate this problem, this paper proposes a novel retrieve-and-edit approach named EditSum for code summarization. Specifically, EditSum first retrieves a similar code snippet from a pre-defined corpus and treats its summary as a prototype summary to learn the pattern. Then, EditSum edits the prototype automatically to combine the pattern in the prototype with the semantic information of input code. Our motivation is that the retrieved prototype provides a good start-point for post-generation because the summaries of similar code snippets often have the same pattern. The post-editing process further reuses the patternized words in prototype and generates keywords based on the semantic information of code. We conduct experiments on a large-scale Java corpus, which contains about 2M samples, and experimental results demonstrate that EditSum outperforms the state-of-the-art approaches by a substantial margin. The human evaluation also proves the summaries generated by EditSum are more informative and useful. We also verify that EditSum performs well on predicting the patternized words and keywords. The code and data will be open-sourced.