"Making Our Life Less Monotonous" or "Just Tick Things Off": An Exploratory Multi-Method Study of Toil
Google introduced Site Reliability Engineering (SRE) as a DevOps approach focused on service reliability. A key concept in SRE is toil defined as manual, repetitive, automatable, and reactive tasks that arise in production service management, scale with service growth, and provide no lasting value. As SRE adoption expands beyond Google, organisations adapt SRE practices and terminology, leading to inconsistencies in how toil is defined and perceived. Understanding toil is crucial, as it is assumed to directly impact both operational efficiency and well-being of engineers, ultimately influencing the reliability and scalability of production services. In this study, we examine toil in light of Google’s claim that “less toil is better". We explore how different developers define toil, its perceived impact, and whether reducing toil is seen as beneficial. To this end, we analyse grey literature, for a broad, practical perspective of toil and its effects, and conduct semi-structured interviews at the Company to triangulate the previous findings. We observe that while Google’s toil definition has been widely adopted, it has been expanded to include additional technical attributes (e.g., routine, time-consuming, persistent) and human aspects (e.g., mundane, not just unwanted work). Practitioners generally view toil negatively, citing its impact on organisations’ growth, teams’ efficiency, and morale. However, while toil elimination is generally praised, we also observe resistance stemming from concerns over cost, complexity, and the possible negative impact this elimination can have on employees predominantly working on toil.