Raft Protocol for Fault Tolerance and Self-Recovery in Federated LearningFULL
Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, ensuring fault tolerance and self-recovery in such scenarios remains challenging. This paper proposes integrating the Raft consensus protocol into FL to address these challenges. Our approach utilises Raft’s leader election and log replication mechanisms to enable automatic stateful recovery after failures and thus improve fault tolerance. The log replication process efficiently maintains consistency and coherence across distributed FL nodes, ensuring uninterrupted training process and model convergence. This way, the consistent system state is replicated across all nodes mitigating the impact of aggregator failures and facilitates recovery from discrepancies. This enhances the robustness of the overall FL system, especially in dynamic and unreliable cyber-physical conditions. To demonstrate the viability of our approach, we present a proof-of-concept implementation based on the existing FL framework Flower. We also conduct a series of experiments to measure the aggregator election time and traffic overheads associated with the state replication. Despite the expected network traffic overheads growing with the number of FL nodes, the results demonstrate a resilient self-recovering system capable of withstanding node failures while maintaining model consistency.
Tue 16 AprDisplayed time zone: Lisbon change
11:00 - 12:30 | Session 6: Self-Recovery & Evaluation Research Track / Artifact Track at Luis de Freitas Branco Chair(s): Dalal Alrajeh Imperial College London | ||
11:00 25mTalk | Raft Protocol for Fault Tolerance and Self-Recovery in Federated LearningFULL Research Track | ||
11:25 25mTalk | Integrating Graceful Degradation and Recovery through Requirement-driven AdaptationFULL Research Track Simon Chu Carnegie Mellon University, Justin Koe The Cooper Union, David Garlan Carnegie Mellon University, Eunsuk Kang Carnegie Mellon University | ||
11:50 25mTalk | Learning Recovery Strategies for Dynamic Self-healing in Reactive SystemsFULL Research Track Mateo Sanabria Universidad de los Andes, Ivana Dusparic Trinity College Dublin, Ireland, Nicolás Cardozo Universidad de los Andes Pre-print | ||
12:15 15mTalk | SWITCH: An Exemplar for Evaluating Self-Adaptive ML-Enabled SystemsARTIFACT Artifact Track Pre-print Media Attached |