The Language of Programming: On the Vocabulary of Names (APSEC 2022 - Technical Track)

Who

Nitsan Amit, Dror Feitelson

Track

APSEC 2022 Technical Track

Time Zone

The program is currently displayed in (GMT+09:00) Osaka, Sapporo, Tokyo.

Use conference time zone: (GMT+09:00) Osaka, Sapporo, TokyoSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 8 Dec 2022 15:20 - 15:40 at Room2 - Empirical Studies 2 Chair(s): Yusuf Sulistyo Nugroho

Abstract

Most of the text in a computer program is composed of the names of variables and functions. These names are selected by one developer, and need to be understood by others. This is similar to the role of words written in natural language. But there are several marked differences between the names in a program and the words in a book. First, names are frequently composed of multiple existing words, in an attempt to capture nuanced meanings and intents. Second, because of the use of multiple words, names can be rather long. Third, conventions may also allow names to be very short, and many single-letter names are used. But despite these differences, the general statistics of names are rather similar to the statistics of words. Like words, the distribution of names is close to a Zipf distribution. Also, popular names tend to be shorter than rarely used names. However, the underlying vocabulary if different. The composition of words leads to a more diverse vocabulary that can grow without bounds. But if we look at the individual words used in compound names, we find a rather limited vocabulary. These properties help explain the predictability of software, and how it can coincide with the large variability of names. It also suggests that it may be beneficial to model programs at the level of individual words rather than at the level of source code tokens.

Nitsan Amit

Hebrew University

Israel

Dror Feitelson