Mining the Characteristics of Jupyter Notebooks in Data Science Projects (MSR 2023 - Registered Reports)

Who

Morakot Choetkiertikul, Apirak Hoonlor, Chaiyong Ragkhitwetsagul, Siripen Pongpaichet, Thanwadee Sunetnanta, Tasha Settewong, Raula Gaikovina Kula

Track

MSR 2023 Registered Reports

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 May 2023 12:08 - 12:14 at Meeting Room 109 - Development Tools & Practices II Chair(s): Banani Roy

Abstract

Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter notebooks on Kaggle and the popular Jupyter notebooks for data science projects on GitHub. We plan to mine and analyze the Jupyter notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter notebook on Kaggle to a deployable project on GitHub.

Morakot Choetkiertikul

Mahidol University, Thailand

Thailand

Apirak Hoonlor

Mahidol University

Thailand

Chaiyong Ragkhitwetsagul

Mahidol University, Thailand

Thailand

Siripen Pongpaichet

Mahidol University

Thailand

Thanwadee Sunetnanta

Mahidol University

Tasha Settewong

Mahidol University

Thailand

Raula Gaikovina Kula

Nara Institute of Science and Technology

Japan

Time Zone

The program is currently displayed in (GMT+11:00) Hobart.

Use conference time zone: (GMT+11:00) HobartSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Tue 16 May
Displayed time zone: Hobart change

11:50 - 12:35	Development Tools & Practices IIData and Tool Showcase Track / Industry Track / Technical Papers / Registered Reports at Meeting Room 109 Chair(s): Banani Roy University of Saskatchewan

11:50 12m Talk		Automating Arduino Programming: From Hardware Setups to Sample Source Code Generation Technical Papers Imam Nur Bani Yusuf Singapore Management University, Singapore, Diyanah Binte Abdul Jamal Singapore Management University, Lingxiao Jiang Singapore Management University Pre-print
12:02 6m Talk		A Dataset of Bot and Human Activities in GitHub Data and Tool Showcase Track Natarajan Chidambaram University of Mons, Alexandre Decan University of Mons; F.R.S.-FNRS, Tom Mens University of Mons
12:08 6m Talk		Mining the Characteristics of Jupyter Notebooks in Data Science Projects Registered Reports Morakot Choetkiertikul Mahidol University, Thailand, Apirak Hoonlor Mahidol University, Chaiyong Ragkhitwetsagul Mahidol University, Thailand, Siripen Pongpaichet Mahidol University, Thanwadee Sunetnanta Mahidol University, Tasha Settewong Mahidol University, Raula Gaikovina Kula Nara Institute of Science and Technology
12:14 6m Talk		Optimizing Duplicate Size Thresholds in IDEs Industry Track Konstantin Grotov JetBrains Research, Constructor University, Sergey Titov JetBrains Research, Alexandr Suhinin JetBrains, Yaroslav Golubev JetBrains Research, Timofey Bryksin JetBrains Research Pre-print
12:20 12m Talk		Boosting Just-in-Time Defect Prediction with Specific Features of C Programming Languages in Code Changes Technical Papers Chao Ni Zhejiang University, xiaodanxu College of Computer Science and Technology, Zhejiang university, Kaiwen Yang Zhejiang University, David Lo Singapore Management University