In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of sub-tokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their sub-tokens. The tool receives an arbitrary project as input, extracts sub-tokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled sub-token clusters with short descriptions to enable Sosed to produce interpretable output.
Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of sub-tokens is available separately at https://github.com/JetBrains-Research/identifiers-extractor/.
Tue 22 SepDisplayed time zone: (UTC) Coordinated Universal Time change
16:00 - 17:00
|Subdomain-Based Generality-Aware Debloating|
|Revisiting the relationship between fault detection, test adequacy criteria, and test set size.|
Yiqun Chen University of Washington, Rahul Gopinath CISPA Helmholtz Center for Information Security, Anita Tadakamalla George Mason University, USA, Michael D. Ernst University of Washington, USA, Reid Holmes University of British Columbia, Gordon Fraser University of Passau, Paul Ammann George Mason University, USA, René Just University of Washington, USA
|WASim: Understanding WebAssembly Applications through Classification|
|Sosed: a tool for finding similar software projects|