ICST 2023
Sun 16 - Thu 20 April 2023 Dublin, Ireland
Thu 20 Apr 2023 14:30 - 15:00 at Macken - Session 3

Any software development activity, both small and large, benefits from fast turn-around times. Our unique distributed compilation framework at SAP enables us to build large projects, such as HANA, with usage of all available resources. The key goal of our proposed method is to reduce build times, speedup the development cycle and cut hardware costs by efficiently using our infrastructure. Compile jobs are by their nature complex non-linear graph transformation tasks that do not have a predictable memory usage and time consumption. This can lead to unpredictable memory pressure causing out-of-memory situations. To address this issue, we present a machine learning based method to predict the memory consumption of compile jobs and fully max out the number of parallel compile jobs on the available hardware.

Typically, the number of compile jobs per host is manually tuned based on expert domain knowledge. The maximal memory consumption per job is determined once for all compile jobs and the number of jobs per host is derived from the available CPU cores and installed memory. Such an static approach necessarily both over- and underestimates the true memory usage, and can’t dynamically adapt if the memory requirements change.

The hypothesis of our work is that the true memory usage depends on the source code in the files being compiled. Based on that hypothesis, we present a novel CI/CD task to predict the memory consumption of compile jobs solely depending on the content of the source files. Furthermore, we use this information to schedule the maximal amount of parallel jobs that which allows us to reliably utilize our hardware to the fullest.

The driving constraints of our approach are to develop an understandable, observable, compute efficient and portable learning pipeline that can be integrated in our existing distributed compilation framework. For this purpose, we focus on an approach involving extracting token n-grams, weighting those with term-frequency inverse-document-frequency (TFIDF) and a multinomial Bayesian classifier.

Memory usage is a continuous target that necessitates us to divide it into discrete bins if we want to use a Bayesian classifier. We discretize its value into five classes, and predict the target memory class for each source file. The learned parameters of our pipeline are thus probabilities indicating which tokens contribute most to which target memory class.

In order to prevent out-of-memory crashes it is crucial to identify source files with a high memory consumption during compilation. We assign source files having the largest memory usage (of 11GB) to their correct memory class with an accuracy of 89%.