SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions (ICSE 2022 - Technical Track)

Sun 8 - Fri 27 May 2022

Who

Ripon Saha, Akira Ura, Sonal Mahajan, Chenguang Zhu, Linyi Li, Yang Hu, Hiroaki Yoshida, Sarfraz Khurshid, Mukul Prasad

Track

ICSE 2022 Technical Track

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 9 May 2022 22:05 - 22:10 at ICSE room 5 - Synthesis and Performance Chair(s): John Grundy
Wed 11 May 2022 13:20 - 13:25 at ICSE room 4 - Synthesis and Reverse Engineering Chair(s): Reed Milewicz

Abstract

Automatic machine learning, or AutoML, holds the promise of truly democratizing the use of machine learning (ML), by substantially automating the work of data scientists. However, the huge combinatorial search space of candidate pipelines means that current AutoML techniques, generate sub-optimal pipelines, or none at all, especially on large, complex datasets. In this work we propose an AutoML technique SapientML, that can learn from a corpus of existing datasets and their human-written pipelines, and efficiently generate a high-quality pipeline for a predictive task on a new dataset. To combat the search space explosion of AutoML, SapientML employs a novel divide-and-conquer strategy realized as a three-stage program synthesis approach, that reasons on successively smaller search spaces. The first stage uses a machine-learned model to predict a set of plausible ML components to constitute a pipeline. In the second stage, this is then refined into a small pool of viable concrete pipelines using syntactic constraints derived from the corpus and the machine-learned model. Dynamically evaluating these few pipelines, in the third stage, provides the best solution. We instantiate SapientML as part of a fully automated tool-chain that creates a cleaned, labeled learning corpus by mining Kaggle, learns from it, and uses the learned models to then synthesize pipelines for new predictive tasks. We have created a training corpus of 1094 pipelines spanning 170 datasets, and evaluated SapientML on a set of 41 benchmark datasets, including 10 new, large, real-world datasets from Kaggle, and against 3 state-of-the-art AutoML tools and 2 baselines. Our evaluation shows that SapientML produces the best or comparable accuracy on 27 of the benchmarks while the second best tool fails to even produce a pipeline on 9 of the instances. This difference is amplified on the 10 most challenging benchmarks, where SapientML wins on 9 instances with the other tools failing to produce pipelines on 4 or more benchmarks.

Link to Preprint

https://arxiv.org/pdf/2202.10451.pdf

Ripon Saha

Akira Ura

Fujitsu Ltd.

Sonal Mahajan

Uber Technologies Inc.

United States

Chenguang Zhu

University of Texas at Austin

United States

Linyi Li

University of Illinois at Urbana-Champaign

United States

Yang Hu

The University of Texas at Austin

United States

Hiroaki Yoshida

AMD

United States

Sarfraz Khurshid

The University of Texas at Austin

Mukul Prasad

Fujitsu Research of America

United States

SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions

Time Zone

The program is currently displayed in (GMT-04:00) Eastern Time (US & Canada).

Use conference time zone: (GMT-04:00) Eastern Time (US & Canada)Select other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 9 May
Displayed time zone: Eastern Time (US & Canada) change

22:00 - 23:00	Synthesis and PerformanceTechnical Track / SEIP - Software Engineering in Practice at ICSE room 5 Chair(s): John Grundy Monash University

5m Talk		Toward Among-Device AI from On-Device AI with Stream Pipelines SEIP - Software Engineering in Practice MyungJoo Ham Samsung Electronics, Sangjung Woo Samsung Electronics, Jaeyun Jung Samsung Electronics, Wook Song Samsung Electronics, Gichan Jang Samsung Electronics, Yongjoo Ahn Samsung Electronics, Hyoungjoo Ahn Samsung Electronics Pre-print Media Attached
5m Talk		SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions Technical Track Ripon Saha , Akira Ura Fujitsu Ltd., Sonal Mahajan Uber Technologies Inc., Chenguang Zhu University of Texas at Austin, Linyi Li University of Illinois at Urbana-Champaign, Yang Hu The University of Texas at Austin, Hiroaki Yoshida AMD, Sarfraz Khurshid The University of Texas at Austin, Mukul Prasad Fujitsu Research of America Pre-print Media Attached
5m Talk		Automatic Detection of Performance Bugs in Database Systems using Equivalent Queries Technical Track Xinyu Liu Georgia Institute of Technology, Qi Zhou Facebook, Joy Arulraj Georgia Institute of Technology, Alessandro Orso Georgia Tech Pre-print Media Attached

Wed 11 May
Displayed time zone: Eastern Time (US & Canada) change

13:00 - 14:00	Synthesis and Reverse EngineeringTechnical Track / Journal-First Papers at ICSE room 4 Chair(s): Reed Milewicz Sandia National Laboratories

5m Talk		Learning to Find Usages of Library Functions in Optimized Binaries Journal-First Papers Toufique Ahmed University of California at Davis, Prem Devanbu Department of Computer Science, University of California, Davis, Anand Ashok Sawant University of California, Davis Link to publication DOI Pre-print Media Attached
5m Talk		Dynamic Update for Synthesized GR(1) Controllers Technical Track Gal Amram Tel Aviv University, Shahar Maoz Tel Aviv University, Israel, Itai Segall Nokia Bell-Labs, Matan Yossef Tel Aviv University Pre-print Media Attached
5m Talk		Push-Button Synthesis of Watch Companions for Android Apps Technical Track Cong Li Nanjing University, Yanyan Jiang Nanjing University, Chang Xu Nanjing University Link to publication DOI Pre-print Media Attached
5m Talk		Jigsaw: Large Language Models meet Program Synthesis Technical Track Naman Jain Microsoft Research, Skanda Vaidyanath Stanford, Arun Iyer Microsoft Research, India, Nagarajan Natarajan Microsoft Research, India, Suresh Parthasarathy Microsoft Research, India, Sriram Rajamani Microsoft Research, Rahul Sharma Microsoft Research Pre-print Media Attached
5m Talk		SapientML: Synthesizing Machine Learning Pipelines by Learning from Human-Written Solutions Technical Track Ripon Saha , Akira Ura Fujitsu Ltd., Sonal Mahajan Uber Technologies Inc., Chenguang Zhu University of Texas at Austin, Linyi Li University of Illinois at Urbana-Champaign, Yang Hu The University of Texas at Austin, Hiroaki Yoshida AMD, Sarfraz Khurshid The University of Texas at Austin, Mukul Prasad Fujitsu Research of America Pre-print Media Attached
5m Talk		Static Stack-Preserving Intra-Procedural Slicing of WebAssembly BinariesBest Artifact Award Technical Track Quentin Stiévenart Vrije Universiteit Brussel, David Binkley Loyola University Maryland, Coen De Roover Vrije Universiteit Brussel DOI Pre-print Media Attached

Information for Participants

Mon 9 May 2022 22:00 - 23:00 at ICSE room 5 - Synthesis and Performance Chair(s): John Grundy

Info for room ICSE room 5-even hours:

Click here to go to the room on Midspace

Wed 11 May 2022 13:00 - 14:00 at ICSE room 4 - Synthesis and Reverse Engineering Chair(s): Reed Milewicz

Info for room ICSE room 4-odd hours:

Click here to go to the room on Midspace