F3: Fault Forecasting Framework for Cloud Systems
In recent years, the development of cloud systems (e.g., Microsoft Azure) has grown explosively, and a variety of software services have been deployed on cloud systems. As cloud systems are required to serve customers on a 24/7 basis, high service reliability is essential to them. To reduce the number of the faults in cloud systems, many machine learning based fault forecasting methods have been proposed. Those forecasting methods aim to predict faults in advance so that proactive actions can be taken to avoid negative impact, and they mainly focus on a specific hardware (e.g., disk, memory and node). In cloud systems, many fault forecasting tasks have similar characteristics: 1) they are based on the temporal monitoring data and 2) they usually suffer from similar challenges (e.g., the extreme data imbalance problem). In this work, we present a unified fault forecasting framework for cloud systems, dubbed F3. In particular, F3 introduces an end-to-end pipeline for a variety of fault forecasting tasks in cloud systems, and the pipeline underlying F3 consists of several critical parts (e.g., data processing, fault forecasting, prediction result interpretation and action decision). In this way, when a new fault forecasting task arrives, F3 can be easily and effectively utilized to handle the new task with adaption. Besides, F3 is able to overcome other challenges, including the extreme data imbalance problem, data inconsistency between online and offline environments, as well as model overfitting. More encouragingly, F3 has been successfully applied to Microsoft Azure and has helped significantly reduce the number of virtual machine interruptions.
Sat 29 MayDisplayed time zone: Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna change
16:30 - 17:20 | Project Showcase SessionCloudIntelligence 2021 at CloudIntelligence Room Chair(s): Yingnong Dang Microsoft, USA | ||
16:30 12mDemonstration | Building a Secured Data Intelligence Platform CloudIntelligence 2021 Conan Yang Salesforce | ||
16:42 12mDemonstration | Infusing ML into VM Provisioning in Cloud CloudIntelligence 2021 Chuan Luo Microsoft Research, China, Randolph Yao Microsoft, USA, Bo Qiao Microsoft Research, Beijing, China, Qingwei Lin Microsoft Research, Beijing, China, Tri M. Tran Microsoft Azure, Gil Shafriri Microsoft Azure, Yingnong Dang Microsoft, USA, Raphael Ghelman Microsoft Azure, Pulak Goyal Microsoft Azure, Eli Cortez Microsoft Azure, Daud Howlader Microsoft Azure, Sushant Rewaskar Microsoft Azure, Murali Chintalapati Microsoft Azure, Dongmei Zhang Microsoft Research | ||
16:55 12mDemonstration | F3: Fault Forecasting Framework for Cloud Systems CloudIntelligence 2021 Chuan Luo Microsoft Research, China, Pu Zhao Microsoft Research, Beijing, China, Bo Qiao Microsoft Research, Beijing, China, Youjiang Wu Microsoft, USA, Yingnong Dang Microsoft, USA, Murali Chintalapati Microsoft Azure, Susy Yi Microsoft 365, Paul Wang Microsoft 365, Andrew Zhou Microsoft 365, Saravanakumar Rajmohan Microsoft Office, United States, Qingwei Lin Microsoft Research, Beijing, China, Dongmei Zhang Microsoft Research | ||
17:07 12mDemonstration | SEAT: statistically sound infra-side deployment and integration testing CloudIntelligence 2021 Nutcha Temiyasathit Facebook, Tao Yang Facebook, Karan Luthra Facebook, Nick Ruff Facebook, Petar Zuljevic Facebook, Ethan Benowitz Facebook, Boris Baracaldo Facebook, Oytun Eskiyenenturk Facebook, Xin Fu Facebook |
Go directly to this room on Clowdr