Raft Protocol for Fault Tolerance and Self-Recovery in Federated Learning (SEAMS 2024 - Research Track)

Who

Rustem Dautov, Erik Johannes Husom

Track

SEAMS 2024 Research Track

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 16 Apr 2024 11:00 - 11:25 at Luis de Freitas Branco - Session 6: Self-Recovery & Evaluation Chair(s): Dalal Alrajeh

Abstract

Federated Learning (FL) has emerged as a decentralised machine learning paradigm for distributed systems, particularly in edge and IoT environments. However, ensuring fault tolerance and self-recovery in such scenarios remains challenging. This paper proposes integrating the Raft consensus protocol into FL to address these challenges. Our approach utilises Raft’s leader election and log replication mechanisms to enable automatic stateful recovery after failures and thus improve fault tolerance. The log replication process efficiently maintains consistency and coherence across distributed FL nodes, ensuring uninterrupted training process and model convergence. This way, the consistent system state is replicated across all nodes mitigating the impact of aggregator failures and facilitates recovery from discrepancies. This enhances the robustness of the overall FL system, especially in dynamic and unreliable cyber-physical conditions. To demonstrate the viability of our approach, we present a proof-of-concept implementation based on the existing FL framework Flower. We also conduct a series of experiments to measure the aggregator election time and traffic overheads associated with the state replication. Despite the expected network traffic overheads growing with the number of FL nodes, the results demonstrate a resilient self-recovering system capable of withstanding node failures while maintaining model consistency.

Rustem Dautov

SINTEF

Erik Johannes Husom

SINTEF Digital