Write a Blog >>
ICSE 2023
Sun 14 - Sat 20 May 2023 Melbourne, Australia

DeepTest is a high-quality workshop for research at the intersection of Machine Learning (ML) and software engineering (SE). ML is widely adopted in modern software systems, including safety-critical domains such as autonomous cars, medical diagnosis, and aircraft collision avoidance systems. Thus, it is crucial to rigorously test such applications to ensure high dependability. However, standard notions of software quality and reliability become irrelevant when considering ML systems, due to their non-deterministic nature and the lack of a transparent understanding of the models’ semantics. ML is also expected to revolutionize software development. Indeed, ML is being applied for devising novel program analysis and software testing techniques related to malware detection, fuzzy testing, bug-finding, and type-checking.

The workshop will combine academia and industry in a quest for well-founded practical solutions. The aim is to bring together an international group of researchers and practitioners with both ML and SE backgrounds to discuss their research, share datasets, and generally help the field build momentum. The workshop will consist of invited talks, presentations based on research paper submissions, and one or more panel discussions, where all participants are invited to share their insights and ideas.

You're viewing the program in a time zone which is different from your device's time zone change time zone

Mon 15 May

Displayed time zone: Hobart change

11:00 - 12:30
Session 1DeepTest at Meeting Room 209


Day opening

Testing Autonomous Driving Systems
Baishakhi Ray Columbia University
13:45 - 15:15
Metamorphic Testing of Machine Translation Models using Back Translation
Wentao Gao University of Melbourne, Jiayuan He RMIT University, Van-Thuan Pham Monash University
A Method of Identifying Causes of Prediction Errors to Accelerate MLOps
Keita Sakuma NEC Corporation, Ryuta Matsuno NEC Corporation, Yoshio Kameda NEC Corporation
DeepSHAP Summary for Adversarial Example Detection
Yi-Ching Lin National Chengchi University, Fang Yu National Chengchi University
DeepPatch: A Patching-Based Method for Repairing Deep Neural Networks
Hao Bu Peking University, Meng Sun Peking University
15:45 - 17:15
Testing Generative Large Language Model: Mission Impossible or Where Lies the Path?
Zhenchang Xing CSIRO’s Data61; Australian National University

Day closing

Call for Papers

NOTICE (09 Jan 2023): Only those who submitted by the original deadline (January 13, 2023) will be given one more week to update their submitted papers. Please note that no new submissions will be accepted after the original deadline.

DeepTest is an interdisciplinary workshop targeting research at the intersection of software engineering and deep learning. This workshop will explore issues related to:

  • Deep Learning applied to Software Engineering (DL4SE)
  • Software Engineering applied to Deep Learning (SE4DL)

Although the main focus is on Deep Learning, we also encourage submissions that are more broadly related to Machine Learning, as well as submissions related to (Deep) Reinforcement Learning.

Topics of Interest

We welcome submissions introducing technology (i.e., frameworks, libraries, program analyses and tool evaluation) for testing DL-based applications, and DL-based solutions to solve open research problems (e.g., what is a bug in a DL/RL model). Relevant topics include, but are not limited to:

  • High-quality benchmarks for evaluating DL/RL approaches
  • Surveys and case studies using DL/RL technology
  • Techniques to aid interpretable DL/RL techniques
  • Techniques to improve the design of reliable DL/RL models
  • DL/RL-aided software development approaches
  • DL/RL for fault prediction, localization and repair
  • Fuzzing DL/RL systems
  • Metamorphic testing as software quality assurance
  • Fault Localization and Anomaly Detection
  • Use of DL for analyzing natural language-like artefacts such as code, or user reviews
  • DL/RL techniques to support automated software testing
  • DL/RL to aid program comprehension, program transformation, and program generation
  • Safety and security of DL/RL based systems
  • New approaches to estimate and measure uncertainty in DL/RL models

Types of Submissions

We accept two types of submissions:

  • Full research papers up to 8-page papers (including references) describing original and unpublished results related to the workshop topics;
  • Short papers up to 4-page papers (including references) describing both preliminary work, new insights in previous work, or demonstrations of testing-related tools and prototypes.

All submissions must conform to the ICSE 2023 formatting instructions. All submissions must be in PDF. The page limit is strict.

Submissions must conform to the IEEE formatting instructions IEEE Conference Proceedings Formatting Guidelines (title in 24pt font and full text in 10pt type, LaTeX users must use \documentclass[10pt,conference]{IEEEtran} without including the compsoc or compsocconf options).

DeepTest 2023 will employ a double-blind review process. Thus, no submission may reveal its authors’ identities. The authors must make every effort to honor the double-blind review process. In particular, the authors’ names must be omitted from the submission, and references to their prior work should be in the third person.

The official publication date of accepted papers is the date the proceedings are made available in the ACM or IEEE Digital Libraries. This date may be up to two weeks prior to the first day of ICSE 2023. The official publication date affects the deadline for any patent filings related to published work. Purchases of additional pages in the proceedings is not allowed.

If you have any questions or wonder whether your submission is in scope, please do not hesitate to contact the organizers.

Important Dates

  • Paper Submission: January 13, 2023 (AoE) (NEW: You can update your submitted papers until January 20 only if the initial submission is made before January 13)
  • Acceptance Notification: February 24, 2023 (AoE)
  • Camera Ready: March 17, 2023 (AoE)
  • Workshop Date: May 15, 2023

Submission System


Special Issue

Authors of DeepTest 2023 papers are encouraged to submit revised, extended versions of their manuscripts for the special issue in the Empirical Software Engineering (EMSE) journal, edited by Springer (details will follow). The call is also open to non-DeepTest 2023 authors.


  • Matteo Biagiola, Università della Svizzera italiana, Switzerland
  • Nicolás Cardozo, Universidad de los Andes, Colombia
  • Foutse Khomh, Polytechnique Montréal, Canada
  • Vincenzo Riccio, Università della Svizzera italiana, Switzerland
  • Donghwan Shin, University of Sheffield, United Kingdom
  • Andrea Stocco, Università della Svizzera italiana, Switzerland

DeepTest 2023 will feature two keynotes from the following distinguished speakers.

Keynote 1 - Testing Autonomous Driving Systems


Baishakhi Ray | Associate Professor, Columbia University


Recent years have seen rapid progress in Autonomous Driving Systems (ADSs). To ensure the safety and reliability of these systems, extensive testing is required. However, direct testing on the road is incredibly expensive and unrealistic to cover all critical scenarios. A popular alternative is to evaluate an ADS’s performance in some well-designed challenging scenarios, a.k.a. scenario-based testing. Such test cases must possess several desirable properties (e.g., failure-inducing, realistic, etc.) to be useful. However, the search space of such test cases can be huge due to the temporal nature of traffic scenarios. In this talk, I will cover our recent efforts in efficiently generating testing scenarios: 1) AutoFuzz, a grammar-based, learning-guided black-box fuzzing technique to generate failure-inducing scenarios for ADSs; 2) FusED, an evolutionary and causality-based domain-specific grey-box fuzzing framework to generate failure-inducing scenarios for fusion component of ADSs; and 3) CTG, a Signal Temporal Logic (STL) guided conditional diffusion model that generates realistic and user-controllable scenarios for ADSs.


Baishakhi Ray is an Associate Professor in the Department of Computer Science at Columbia University, NY, USA. She has received the prestigious IEEE TCSE Rising star award and NSF CAREER award. Baishakhi’s research interest is in the intersection of Software Engineering and Machine Learning. Her research has been acknowledged by many Distinguished Paper awards and has also been published in CACM Research Highlights, and has been widely covered in trade media.

Keynote 2 - Testing Generative Large Language Model: Mission Impossible or Where Lies the Path?


Zhenchang Xing | Associate Professor, Australian National University


OpenAI’s ChatGPT, a generative language model, has attracted widespread attention from industry, academia, and the public for its impressive natural language processing capabilities. Although we know how to train such generative language models, we do not know how these models can solve such a diverse range of open-ended tasks. Every time we “prompt program” a large language model to complete a task, we create a customized version of the language model, which exhibits different abilities and outputs than other customized versions. Some people believe that the emergent capabilities of large language models are turning AI from engineering into natural science, as it is hard to think of these models as being designed for a specific purpose in the traditional sense. As our focus shifts from ensuring design and construction correctness to trying to explore and understand un-designed AI products and behaviors, we need to consider the methodological challenges posed by this transformation. For example, will differential testing, metamorphic testing, and adversarial testing, which are effective for testing discriminative models in specific tasks, no longer be the saviors of open-ended task testing for large language models? How can we test and correct ethical issues and hallucinations in generative AI? Due to the emergent capabilities of large language models, which are customized through in-context learning, will we face similar problems to the Schrödinger’s cat problem in quantum physics? If observation and measurement have a fundamental impact on the observed object, can we still fully test the essence of large language models, or can we only test the appearances of a specific customized version? Large language models are changing the way humans interact with AI, what adjustments do we need to make to our existing data and algorithm-centric MLOps? There may be many unknown problems. In this talk, I will share my thoughts (or even confusion) on these questions and some thoughts of actions (likely be wrong), hoping to inspire the community to explore the feasibility and methodology of testing generative large language models.


Dr. Zhenchang Xing is a Senior Principal Research Scientist and the Science Leader of the SE4AI team at CSIRO’s Data61. Dr. Xing’s current research focus is on the interdisciplinary fields of software engineering, human-computer interaction, and artificial intelligence, with a particular emphasis on knowledge graph methods and behavior analysis techniques to improve software development efficiency and software quality, as well as new software engineering methods, technologies, and tools to ensure Responsible AI. Dr. Xing has over 190 peer-reviewed publications in prestigious journals and conferences and has received multiple distinguished paper awards from top software engineering conferences, including the ACM SIGSOFT Distinguished Paper Award, the IEEE TCSE Distinguished Paper Award, and the ACM SIGSOFT Most Influential Paper Award. Dr. Xing frequently serves the academic community in organizing and program committees of top software engineering conferences.

Questions? Use the DeepTest contact form.