Evaluating Large Language Models for Time Series Anomaly Detection in Aerospace Software (ASE 2025 - Industry Showcase)

Who

Yang Liu, Yixing Luo, Xiaofeng Li, Xiaogang Dong, Bin Gu, Zhi Jin

Track

ASE 2025 Industry Showcase

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Mon 17 Nov 2025 16:40 - 16:50 at Grand Hall 3 - Autonomous Systems

Abstract

Time series anomaly detection (TSAD) is essential for ensuring the safety and reliability of aerospace software systems. Although large language models (LLMs) provide a promising training-free alternative to unsupervised approaches, their effectiveness in aerospace settings remains under-examined because of complex telemetry, misaligned evaluation metrics, and the absence of domain knowledge. To address this gap, we introduce ATSADBench, the first benchmark for aerospace TSAD. ATSADBench comprises nine tasks that combine three pattern-wise anomaly types, univariate and multivariate signals, and both in-loop and out-of-loop feedback scenarios, yielding 108,000 data points. Using this benchmark, we systematically evaluate state-of-the-art open-source LLMs under two paradigms: Direct, which labels anomalies within sliding windows, and Prediction-based, which detects anomalies from prediction errors. To reflect operational needs, we reformulate evaluation at the window level and propose three user-oriented metrics: Alarm Accuracy (AA), Alarm Latency (AL), and Alarm Contiguity (AC), which quantify alarm correctness, timeliness, and credibility. We further examine two enhancement strategies, few-shot learning and retrieval-augmented generation (RAG), to inject domain knowledge. The evaluation results show that (1) LLMs perform well on univariate tasks but struggle with multivariate telemetry, (2) their AA and AC on multivariate tasks approach random guessing, (3) few-shot learning provides modest gains whereas RAG offers no significant improvement, and (4) in practice LLMs can detect true anomaly onsets yet sometimes raise false alarms, which few-shot prompting mitigates but RAG exacerbates. These findings offer guidance for future LLM-based TSAD in aerospace software.

Yang Liu

Beijing Institute of Control Engineering

Yixing Luo

Beijing Institute of Control Engineering

China

Xiaofeng Li

Beijing Institute of Control Engineering

Xiaogang Dong

Beijing Institute of Control Engineering

Bin Gu

Beijing Institute of Control Engineering

China

Zhi Jin

Peking University

China

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+09:00) Seoul.

Use conference time zone: (GMT+09:00) SeoulSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Mon 17 Nov
Displayed time zone: Seoul change

16:00 - 17:00	Autonomous SystemsNIER Track / Industry Showcase at Grand Hall 3

16:00 10m Talk		Human-In-The-Loop Oracle Learning for Simulation-Based Testing NIER Track Ben-Hau Chia Carnegie Mellon University, Eunsuk Kang Carnegie Mellon University, Christopher Steven Timperley Carnegie Mellon University
16:10 10m Talk		Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems NIER Track Dany Moshkovich IBM Research, Sergey Zeltyn IBM Research
16:20 10m Talk		Out of Distribution Detection in Self-adaptive Robots with AI-powered Digital Twins Industry Showcase Erblin Isaku Simula Research Laboratory, and University of Oslo (UiO), Hassan Sartaj Simula Research Laboratory, Shaukat Ali Simula Research Laboratory and Oslo Metropolitan University, Beatriz Sanguino Norwegian University of Science and Technology, Tongtong Wang Norwegian University of Science and Technology, Guoyuan Li Norwegian University of Science and Technology, Houxiang Zhang Norwegian University of Science and Technology, Thomas Peyrucain PAL Robotics
16:30 10m Talk		Unseen Data Detection using Routing Entropy in Mixture-of-Experts for Autonomous Vehicles NIER Track Sang In Lee Chungnam Naitional University, Donghwan Shin University of Sheffield, Jihun Park Chungnam National University Pre-print
16:40 10m Talk		Evaluating Large Language Models for Time Series Anomaly Detection in Aerospace Software Industry Showcase Yang Liu Beijing Institute of Control Engineering, Yixing Luo Beijing Institute of Control Engineering, Xiaofeng Li Beijing Institute of Control Engineering, Xiaogang Dong Beijing Institute of Control Engineering, Bin Gu Beijing Institute of Control Engineering, Zhi Jin Peking University
16:50 10m Talk		Bridging Research and Practice in Simulation-based Testing of Industrial Robot Navigation Systems Industry Showcase Sajad Khatiri Università della Svizzera italiana and University of Bern, Francisco Eli Vi˜na Barrientos ANYbotics AG, Maximilian Wulf ANYbotics AG, Paolo Tonella USI Lugano, Sebastiano Panichella University of Bern