Analyzing Duplication in JavaScript (BenchWork 2018)

Write a Blog >>

Sun 15 - Sat 21 July 2018 Amsterdam, Netherlands

co-located with ECOOP and ISSTA 2018

Who

Petr Maj, Celeste Hollenbeck, Shabbir Hussain, Jan Vitek

Track

BenchWork 2018

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 18 Jul 2018 14:30 - 14:50 at Hanoi - JavaScript & Dynamic Behaviour

Abstract

When analyzing any corpus of programs, care must be taken to ensure that the corpus is truly representative of the entire ecosystem, otherwise the observed features might be far from reality. A naive approach is to increase the size of the dataset, thus diminishing the chance that an interesting feature will be left out. However, such approach may easily lead to overemphasis on features that are mostly present, but not frequently executed.

To tackle this issue, the code duplication patterns in the corpus and the ecosystem must be understood and correlated to the actual frequency of the code in the wild.

In our work we concentrate on the widely used JavaScript language. Originally the language of the web, JavaScript has recently been pushed to server-side and even desktop applications thanks to the node.js framework. Our analysis included all non-forked JavaScript repositories on GitHub and we looked for different levels of file and project similarity. While clone ratios in popular languages can be reasonably high (40% for Java, 73% for Python), we have found that JavaScript contains a staggering amount of 94% files being identical to the remaining 6%. When we looked at whole projects the situation is similar with almost half of JavaScript projects having over 50% of their files found in others.

Deeper analysis identified that vast majority of the duplication found is thanks to very few, but extremely popular frameworks (such as jQuery and express.js) and that in terms of files, our dataset was dominated by projects using node.js (over 70% of total files). Our continuous investigation examines the JavaScript ecosystem in greater depth with special attention paid to the node.js application and to the life-cycle of the copied code. At the end of our work, we want to create and maintain a library of JavaScript sources and their relationships that may be used by others to curate their own datasets.

Petr Maj

Czech Technical University

Celeste Hollenbeck

Northeastern University, USA

Shabbir Hussain

Northeastern University

Jan Vitek