SEAMS 2024
Mon 15 - Tue 16 April 2024 Lisbon, Portugal
co-located with ICSE 2024

Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, ensuring fault tolerance and self-recovery in such scenarios remains challenging. This paper proposes integrating the Raft consensus protocol into FL to address these challenges. Our approach utilises Raft’s leader election and log replication mechanisms to enable automatic stateful recovery after failures and thus improve fault tolerance. The log replication process efficiently maintains consistency and coherence across distributed FL nodes, ensuring uninterrupted training process and model convergence. This way, the consistent system state is replicated across all nodes mitigating the impact of aggregator failures and facilitates recovery from discrepancies. This enhances the robustness of the overall FL system, especially in dynamic and unreliable cyber-physical conditions. To demonstrate the viability of our approach, we present a proof-of-concept implementation based on the existing FL framework Flower. We also conduct a series of experiments to measure the aggregator election time and traffic overheads associated with the state replication. Despite the expected network traffic overheads growing with the number of FL nodes, the results demonstrate a resilient self-recovering system capable of withstanding node failures while maintaining model consistency.

Tue 16 Apr

Displayed time zone: Lisbon change

11:00 - 12:30
Session 6: Self-Recovery & Evaluation Research Track / Artifact Track at Luis de Freitas Branco
Chair(s): Dalal Alrajeh Imperial College London
11:00
25m
Talk
Raft Protocol for Fault Tolerance and Self-Recovery in Federated LearningFULL
Research Track
Rustem Dautov SINTEF, Erik Johannes Husom SINTEF Digital
11:25
25m
Talk
Integrating Graceful Degradation and Recovery through Requirement-driven AdaptationFULL
Research Track
Simon Chu Carnegie Mellon University, Justin Koe The Cooper Union, David Garlan Carnegie Mellon University, Eunsuk Kang Carnegie Mellon University
11:50
25m
Talk
Learning Recovery Strategies for Dynamic Self-healing in Reactive SystemsFULL
Research Track
Mateo Sanabria Universidad de los Andes, Ivana Dusparic Trinity College Dublin, Ireland, Nicolás Cardozo Universidad de los Andes
Pre-print
12:15
15m
Talk
SWITCH: An Exemplar for Evaluating Self-Adaptive ML-Enabled SystemsARTIFACT
Artifact Track
Arya Marda IIIT Hyderabad, Shubham Kulkarni IIIT Hyderabad, Karthik Vaidhyanathan IIIT Hyderabad
Pre-print Media Attached