The Limits of the Identifiable: Challenges in Python Version Identification with Deep Learning (SANER 2024 - Reproducibility Studies and Negative Results (RENE) Track )

Who

Marcus Gerhold, Lola Solovyeva, Vadim Zaytsev

Track

SANER 2024 Reproducibility Studies and Negative Results (RENE) Track

Time Zone

The program is currently displayed in (GMT+02:00) Athens.

Use conference time zone: (GMT+02:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 13 Mar 2024 14:00 - 14:15 at LAPPI - API and Dependency Analysis Chair(s): Martin Monperrus

Abstract

The evolution of Python requires accurate version identification to facilitate compatibility and ongoing support. We extend previous work on deep learning models for Python version identification, where LSTM and CodeBERT achieved a 92% accuracy on short code snippets. We further expand these results to larger realistic files, utilising code segmentation techniques for varying input granularities. These techniques ranged from per-line analysis to larger code segments. Our findings show that while LSTM with CodeBERT embeddings maintained high accuracy on short snippets, performance significantly drops on longer segments, particularly in balancing information retention and misclassification risks. Notably, import-statement analysis, despite being the most intuitive indicator of version requirements, reached only a 30% accuracy. This exposes the limitations of our approach when encountering rare or user-defined modules. The findings expose the limitations of deep learning for language version identification, and suggest that alternative approaches may be necessary for high accuracy on larger datasets.

Link to Preprint

https://grammarware.net/text/2024/identification.pdf

Marcus Gerhold

University of Twente, The Netherlands

Netherlands

Lola Solovyeva

University of Twente

Netherlands

Vadim Zaytsev

University of Twente, Netherlands

Netherlands

Time Zone

The program is currently displayed in (GMT+02:00) Athens.

Use conference time zone: (GMT+02:00) AthensSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Wed 13 Mar
Displayed time zone: Athens change

14:00 - 15:30	API and Dependency AnalysisResearch Papers / Reproducibility Studies and Negative Results (RENE) Track at LAPPI Chair(s): Martin Monperrus KTH Royal Institute of Technology

14:00 15m Talk		The Limits of the Identifiable: Challenges in Python Version Identification with Deep Learning Reproducibility Studies and Negative Results (RENE) Track Marcus Gerhold University of Twente, The Netherlands, Lola Solovyeva University of Twente, Vadim Zaytsev University of Twente, Netherlands Pre-print
14:15 15m Talk		Exploring Dependencies Among Inconsistencies to Enhance the Consistency Maintenance of Models Research Papers Luciano Marchezan Johannes Kepler Universität Linz, Wesley K.G. Assunção North Carolina State University, Edvin Herac , Saad Shafiq University of Southern California, Alexander Egyed Johannes Kepler University Linz
14:30 15m Talk		BUMP: A Benchmark of Reproducible Breaking Dependency Updates Research Papers Frank Reyes Garcia KTH Royal Institute of Technology, Yogya Gamage KTH Royal Institute of Technology, Gabriel Skoglund KTH Royal Institute of Technology, Benoit Baudry KTH, Martin Monperrus KTH Royal Institute of Technology
14:45 15m Talk		APIGen: Generative API Method Recommendation Research Papers Yujia Chen Harbin Institute of Technology, Shenzhen, Cuiyun Gao Harbin Institute of Technology, Muyijie Zhu Harbin Institute of Technology, Shenzhen, Qing Liao Harbin Institute of Technology, Yong Wang Anhui Polytechnic University, Guoai Xu Harbin Institute of Technology, Shenzhen
15:00 15m Talk		A Multi-Metric Ranking with Label Correlations Approach for Library Migration Recommendations Research Papers Jiancheng Zhang SouthWest Petroleum University, Peng Wu Sichuan Tourism University, Qin Luo Southwest Petroleum University
15:15 15m Talk		Adaptoring: Adapter Generation to Provide an Alternative API for a Library Research Papers Lars Reimann University of Bonn, Günter Kniesel-Wünsche University of Bonn Pre-print