ExpertCache: GPU-Efficient MoE Inference through Reinforcement Learning-Guided Expert Selection
Mixture-of-Experts (MoE) architectures have emerged as a promising approach for scaling large language models while maintaining computational efficiency. However, inference in these models remains challenging due to the substantial GPU memory requirements of storing all expert parameters. In this paper, we introduce ExpertCache, a novel two-phase reinforcement learning framework that optimizes both which experts to load into GPU memory and which loaded experts to activate during inference. Our approach consists of a pre-loading controller that selects a task-specific subset of experts to cache in GPU memory, and a runtime controller that dynamically activates the most relevant experts for each token. Both controllers are optimized through reinforcement learning with carefully designed reward functions that balance model quality, computational efficiency, and expert utilization. We evaluate ExpertCache on Qwen3-235B-A22B using BigCodeBench, demonstrating that our approach reduces GPU memory requirements by up to 85% while achieving superior performance compared to loading all experts. Our method enables deployment of large MoE models on consumer-grade hardware and significantly improves inference throughput in production environments. ExpertCache outperforms current expert selection methods on both memory efficiency and computational performance, establishing a new state-of-the-art for efficient MoE inference.
Wed 10 SepDisplayed time zone: Auckland, Wellington change
10:30 - 12:00 | Session 2 - Quality Assurance 1Tool Demonstration Track / Research Papers Track / Industry Track / NIER Track / Journal First Track at Case Room 2 260-057 Chair(s): Coen De Roover Vrije Universiteit Brussel | ||
10:30 15m | A Jump-Table-Agnostic Switch Recovery on ASTs Research Papers Track | ||
10:45 15m | Quantization Is Not a Dealbreaker: Empirical Insights from Large Code Models Research Papers Track Saima Afrin William & Mary, Antonio Mastropaolo William and Mary, USA, Bowen Xu North Carolina State University Pre-print | ||
11:00 10m | AI-Powered Commit Explorer (APCE) Tool Demonstration Track Yousab Grees Belmont University, Polina Iaremchuk Belmont University, Ramtin Ehsani Drexel University, Esteban Parra Rodriguez Belmont University, Preetha Chatterjee Drexel University, USA, Sonia Haiduc Florida State University Pre-print | ||
11:10 10m | JDala - A Simple Capability System for Java Tool Demonstration Track Quinten Smit Victoria University of Wellington, Jens Dietrich Victoria University of Wellington, Michael Homer Victoria University of Wellington, Andrew Fawcet Victoria University of Wellington, James Noble Independent. Wellington, NZ | ||
11:20 10m | ExpertCache: GPU-Efficient MoE Inference through Reinforcement Learning-Guided Expert Selection NIER Track Xunzhu Tang University of Luxembourg, Tiezhu Sun University of Luxembourg, Yewei Song University of Luxembourg, SiYuanMa , Jacques Klein University of Luxembourg, Tegawendé F. Bissyandé University of Luxembourg | ||
11:30 15m | Efficient Detection of Intermittent Job Failures Using Few-Shot Learning Industry Track Henri Aïdasso École de technologie supérieure (ÉTS), Francis Bordeleau École de Technologie Supérieure (ETS), Ali Tizghadam TELUS Pre-print | ||
11:45 15m | LogOW: A Semi-Supervised Log Anomaly Detection Model in Open-World Setting Journal First Track Jingwei Ye Nankai University, Chunbo Liu Civil Aviation University of China, Zhaojun Gu Civil Aviation University of China, Zhikai Zhang Civil Aviation University of China, Xuying Meng The Institute of Computing Technology, Chinese Academy of Sciences, Weiyao Zhang The Institute of Computing Technology, Chinese Academy of Sciences, Yujun Zhang The Institute of Computing Technology, Chinese Academy of Sciences |