Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly Detection (ICSE 2025 - Research Track)

Who

Qiaolin Qin, Heng Li, Ettore Merlo, Maxime Lamothe

Track

ICSE 2025 Research Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 2 May 2025 14:00 - 14:15 at 210 - Security and Analysis 3 Chair(s): Adriana Sejfia

Abstract

With the advent of data-centric and machine learning (ML) systems, data quality is playing an increasingly critical role for ensuring the overall quality of software systems. Alas, data preparation, an essential step towards high data quality, is known to be a highly effort-intensive process. Although prior studies have dealt with one of the most impacting issues, data pattern violations, we observe that these studies usually require data-specific configurations (i.e., parameterized) or a certain set of fully curated data as learning examples (i.e., supervised). Both approaches require domain knowledge and depend on users’ deep understanding of their data, and are often effort-intensive. In this paper, we introduce RIOLU: Regex Inferencer autO-parameterized Learning with Uncleaned data. RIOLU is fully automated, is automatically parameterized, and does not need labeled samples. We observe that RIOLU can generate precise patterns from datasets in various domains, with a high F1 score of 97.2%, exceeding the state-of-the-art baseline. In addition, according to our experiment on five datasets with anomalies, RIOLU can automatically estimate a data column’s error rate, draw normal patterns, and predict anomalies from unlabeled data with higher performance (up to 800.4% improvement in terms of F1) than the state-of-the-art baseline. Furthermore, RIOLU can even outperform ChatGPT in terms of both accuracy (12.3% higher F1) and efficiency (10% less inference time). With user involvement, a variation (a guided version) of RIOLU can further boost its precision (up to 37.4% improvement in terms of F1). Our evaluation in an industrial setting further demonstrates the practical benefits of RIOLU.

Link to Preprint

https://arxiv.org/abs/2412.05240

Qiaolin Qin

Polytechnique Montréal

Heng Li

Polytechnique Montréal

Canada

Ettore Merlo

Polytechnique Montreal

Canada

Maxime Lamothe

Polytechnique Montreal

Canada

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Fri 2 May
Displayed time zone: Eastern Time (US & Canada) change

14:00 - 15:30	Security and Analysis 3Research Track / SE In Practice (SEIP) at 210 Chair(s): Adriana Sejfia University of Edinburgh

14:00 15m Talk		Automated, Unsupervised, and Auto-parameterized Inference of Data Patterns and Anomaly DetectionSecurity Research Track Qiaolin Qin Polytechnique Montréal, Heng Li Polytechnique Montréal, Ettore Merlo Polytechnique Montreal, Maxime Lamothe Polytechnique Montreal Pre-print
14:15 15m Talk		On Prescription or Off Prescription? An Empirical Study of Community-prescribed Security Configurations for KubernetesSecurity Research Track Shazibul Islam Shamim Auburn University, Hanyang Hu Company A, Akond Rahman Auburn University Pre-print File Attached
14:30 15m Talk		Similar but Patched Code Considered Harmful -- The Impact of Similar but Patched Code on Recurring Vulnerability Detection and How to Remove ThemSecurity Research Track Zixuan Tan Zhejiang University, Jiayuan Zhou Huawei, Xing Hu Zhejiang University, Shengyi Pan Zhejiang University, Kui Liu Huawei, Xin Xia Huawei Pre-print
14:45 15m Talk		TIVER: Identifying Adaptive Versions of C/C++ Third-Party Open-Source Components Using a Code Clustering TechniqueSecurity Research Track Youngjae Choi Korea University, Seunghoon Woo Korea University
15:00 15m Talk		A scalable, effective and simple Vulnerability Tracking approach for heterogeneous SAST setups based on Scope+OffsetSecurity SE In Practice (SEIP) James Johnson --, Julian Thome GitLab Inc., Lucas Charles GitLab Inc., Hua Yan GitLab Inc., Jason Leasure GitLab Inc. Pre-print
15:15 15m Talk		''ImmediateShortTerm3MthsAfterThatLOL'': Developer Secure-Coding Sentiment, Practice and Culture in OrganisationsSecurity SE In Practice (SEIP) Ita Ryan University College Cork, Utz Roedig University College Cork, Klaas-Jan Stol Lero; University College Cork; SINTEF Digital