Skill over Scale: The Case for Medium, Domain-Specific Models for SE (FORGE 2025 - Research Papers)

Who

Manisha Mukherjee, Vincent J. Hellendoorn

Track

FORGE 2025 Research Papers

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 28 Apr 2025 11:36 - 11:48 at 207 - Session4: Human-AI Collaboration & Legal Aspects of using FM

Abstract

Recent advancements in AI have sparked a trend in constructing large, generalist language models that handle a multitude of tasks, including many code-related ones. While these models are expensive to train and are often closed-source, they have enjoyed broad adoption because they tend to outperform smaller, domain-specific models of code. In this work, we argue that this is not a foregone conclusion. We show that modestly sized domain-specific models can outperform much larger ones on code labeling tasks, provided they are trained to the same standards. Concretely, we focus on StackOverflow (SO), which offers large volumes of aligned code and text data. We align established best-practices for pre-training large language models with properties of StackOverflow as a data source, especially using a large context window (2,048 tokens), coupled with a powerful toolkit (Megatron-LM) to train two models: SOBertBase, with 125M parameters, and SOBertLarge with 762M parameters, at a budget of just $374 and $1600 each. We compare the performance of our models with a prior domain-specific model which did not adopt many of these practices (BERTOverflow), as well two general-purpose BERT models (BERTBase and BERTLarge), and two models in OpenAI’s GPT series (GPT-3.5 and GPT-4). We study four labeling tasks: question quality prediction, closed question prediction, named entity recognition and obsoletion prediction. The final task is a new benchmark we introduce, on which we additionally compare SOBert with a fine-tuned CodeLlama and StackLlama (models with 10x more parameters than SOBertLarge). Our models, including the smaller one, consistently outperform all baselines. In contrast, BertOverflow is outperformed by generalist models in most tasks. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models. Both models are released to the public with over 500 downloads in the last month alone on Hugging Face.

Link to Preprint

https://arxiv.org/abs/2306.03268

Manisha Mukherjee

Carnegie Mellon University

United States

Vincent J. Hellendoorn