Cell2Doc: ML Pipeline for Generating Documentation in Computational Notebooks (ASE 2023 - Research Papers)

Who

Tamal Mondal, Scott Barnett, Akash Lal, Jyothi Vedurada

Track

ASE 2023 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 12 Sep 2023 16:06 - 16:18 at Room C - Testing AI Systems 3 Chair(s): Mike Papadakis

Abstract

Computational notebooks have become the go-to way for solving data-science problems. While they are designed to combine code and documentation, prior work shows that documentation is largely ignored by the developers because of the manual effort. Automated documentation generation can help, but existing techniques fail to capture algorithmic details and developers often end up editing the generated text to provide more explanation and sub-steps. This paper proposes a novel machine-learning pipeline, Cell2Doc, for code cell documentation in Python data science notebooks. Our approach works by identifying different logical contexts within a code cell, generating documentation for them separately, and finally combining them to arrive at the documentation for the entire code cell. Cell2Doc takes advantage of the capabilities of existing pre-trained language models and improves their efficiency for code cell documentation. We also provide a new benchmark dataset for this task, along with a data- preprocessing pipeline that can be used to create new datasets. We also investigate an appropriate input representation for this task. Our automated evaluation suggests that our best input representation improves the pre-trained model’s performance by 2.5x on average. Further, Cell2Doc achieves 1.33x improvement during human evaluation in terms of correctness, informativeness, and readability against the corresponding standalone pre-trained model.

Link to Preprint

https://github.com/jyothivedurada/jyothivedurada.github.io/blob/main/papers/Cell2Doc.pdf

Tamal Mondal

IIT Hyderabad

India

Scott Barnett

Deakin University

Akash Lal

Microsoft Research

Jyothi Vedurada