Performance Optimization of HPC Workloads in Cloud Using AI-Driven Algorithms (APLAS 2025 - The 23rd Asian Symposium on Programming Languages and Systems)

Who

Aman Iftekhar, Rahul Mishra

Track

APLAS 2025 Research Papers

This program is tentative and subject to change.

Time Zone

The program is currently displayed in (GMT+05:30) Chennai, Kolkata, Mumbai, New Delhi.

Use conference time zone: (GMT+05:30) Chennai, Kolkata, Mumbai, New DelhiSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Wed 29 Oct 2025 14:30 - 15:00 at APLAS room - AI and Compiler Optimisation for Performance Chair(s): Meenakshi D'Souza

Abstract

As High-Performance Computing (HPC) workloads increasingly migrate to cloud infrastructures, the need for intelligent, real-time scheduling becomes critical. Conventional schedulers such as FCFS or SJF often struggle to adapt to the heterogeneous and dynamic nature of cloud-based systems, leading to inefficient resource utilization and increased job wait times. This paper proposes a unified artificial intelligence (AI)-based framework to address these challenges through the integration of three core capabilities: job runtime prediction using supervised learning, anomaly detection via deep autoencoders, and adaptive resource scheduling using reinforcement learning. Leveraging real-world data from the MIT SuperCloud dataset, containing over 2TB of CPU and GPU performance traces, our system extracts meaningful patterns from time-series telemetry to support informed scheduling decisions. The job prediction module enables the estimation of runtimes based on CPU utilization, memory consumption, and I/O statistics. The anomaly detection module flags resource-wasting or abnormal jobs using learned GPU performance norms. The reinforcement learning scheduler dynamically matches jobs to compute nodes based on predicted duration and anomaly status, optimizing for turnaround time and utilization. Experimental evaluations demonstrate a 28% reduction in average turnaround time and over 10% increase in resource utilization compared to traditional schedulers. These results establish the viability of AI-driven orchestration strategies in HPC cloud platforms and underscore the importance of integrated learning-based systems in achieving scalable, efficient, and context-aware workload management.

Aman Iftekhar

IIT Patna

Rahul Mishra