View profile

Sleep, Interrupted: Niall Richard Murphy on Taking the Emergency Out of On-Call

In a game of word association, the first word many would associate with the feeling of being on-call
The Morning Mind-Meld
Sleep, Interrupted: Niall Richard Murphy on Taking the Emergency Out of On-Call
By Jaime Woo and Emil Stolarsky • Issue #1 • View online
In a game of word association, the first word many would associate with the feeling of being on-call is, apparently, “sucks.” The word pops up in presentations, on forum posts, and even as branding for VictorOps. Given how many developers and operators regularly go on-call, coupled with the perceived value of being on-call, you can’t help but wonder: does on-call have to suck? 
That question has led to many strategies for making on-call less painful, notably from Charity Majors, Cindy Sridharan, and most recently Molly Struve. The advice—reframe on-call as a learning experience, understand on-call not as inevitable but as a proxy for engineering culture, and think intentionally about fitting the role of on-call to the specific needs of your company—is helpful, and necessary reading for anyone wanting to improve on-call processes. And yet even as on-call improves, one wonders if the logical endpoint of making on-call not suck is to not have on-call at all? Enter Niall Richard Murphy’s “Against On-Call: A Polemic.”
His polemic begins with an anecdote about Colossus, the code-breaking machines used at Bletchley Park in the British war effort against the Germans. If something went wrong in the middle of the night, cryptologist Irving John Good would be summoned to attend to and debug the problem. Quoting from B. Jack Copeland’s book Colossus: The Secrets of Bletchley Park’s Codebreaking Computers, the episode is described as:
His method of search was instructive, being wholly abstract. That is, it was conducted solely within an evolving mental model. Others streamed to and fro, bringing him answers to carefully posed questions and setting out on missions to confirm or refute specific further predictions, as the logical map, invisible of course to the onlooker, as systematically shrunk to a point. This announced and the item duly found, Jack was conveyed back to his interrupted sleep.
Murphy sees a parallel in his experience of on-call. Yes, the technology may have become more sophisticated, but seventy-five years later it remains asking someone to do “tree traversal mechanisms in their head and try to pinpoint a solution in their pajamas.” That frustration leads him to ask if there’s a systemic approach to changing on-call.
He finds inspiration from how on-call is practised in hospitals: The on-call that many SREs would recognize parallels what happens in the accident & emergency department (commonly called the ER in North America), notes Murphy, where on-call doctors perform interrupt-driven work, triage a situation—assessing whether or not the issue at hand should be dealt with then and there—and stabilize patients to the point where they can be moved to ward medicine for further treatment.
(For more on this comparison between developers and doctors, check out “What Medicine Can Teach Us About Being On-Call” by Daniel Turner, where he shares his wife’s experiences on-call as a resident, and the utility of checklists.) 
Emergency medicine is expensive, taking a toll on those who perform it, but the life-and-death stakes obviously justify the efforts. The critical nature underlines why it’s exciting for developers to imagine being on-call as akin to being an ER doctor, a career portrayed as meaningful and glamorous in television shows like Grey’s Anatomy. On-call takes on a romanticized, badge-of-honour quality. But is something like emergency medicine, meant as a last resort after some form of catastrophe has struck, the best comparison for SREs on-call? 
Murphy thinks a shift is in order. His vision for an improved on-call experience is that, instead of emergency medicine, on-call should look more like ward medicine, which he describes in his chapter from Seeking SRE as serving “to manage a cure, the treatment of slow-decline, or otherwise non-life-threatening, non-immediate situations.” As alluring as this idea is, there’s almost a counter-intuitive quality to the suggestion: the point of on-call is to tackle urgent, important situations, so how can it be transformed into ward medicine?
The key is figuring out why urgent, important situations are occurring: are they a natural outcome of development work, or are they preventable? For Murphy, many situations are the result of kicking decisions down the line until they gather emergency-like qualities, rather than investing upfront into the work needed to reduce the likelihood of disruptions and outages. He writes:
Think about the set of postmortems you’ve assembled for your service over the years. When you look at the set of root causes and contributing factors over a long enough period, you can ask yourself the questions: what proportion of those outages were genuinely unforeseeable in advance, and what proportion of them would have been remediated if fairly simple protections had been put more consistently in place?
The concept of simple protections evokes Tanya Reilly’s “The History of Fire Escapes,” which draws comparisons to another emergency-based profession: firefighting. Reilly outlines the history of fire escapes, and how they, along with firefighting, are efforts to reduce damage and prevent tragedy. No one would conceive of these two tools constituting a complete solution to the problem of fires: the creation of widely-distributed and commonly-followed fire codes for buildings was an attempt to intentionally and proactively reduce emergencies. The number and severity of fires decreased as buildings had to meet a consistent standard, and Reilly cautions that the software development industry can be too focused on firefighting rather than (ahem) code-building.
Yet computer systems are not human beings nor buildings: software is custom-made and changes quickly, and that relentless pace of software development has made understanding our systems more difficult—even in times of relative healthiness. It’s hard to imagine anyone wanting to visit a hospital or live in a building that follows a “moves fast and breaks things” mentality, and as software eats the world the consequences of failing software become more devastating. The lessons drawn from Murphy and Reilly point to how common knowledge and practices paired with common tools and components would reduce the complexity of systems and allow for more problems to be foreseeable. Then, perhaps, on-call could move closer to ward medicine instead of emergency medicine, or to fire prevention rather than fire escapes.

Special thanks to Andrew Louis and Jake Pittis for gut-checks.

This is an Incident Labs project, with new issues every two weeks. We’re interested in figuring out the best practices for incident management for software companies. We also produce the Post-Incident Review, a zine focused on outages. If you use PagerDuty and Slack, our software project Ovvy simplifies scheduling and overrides, and is currently in private beta and free to use.
Did you enjoy this issue?
Jaime Woo and Emil Stolarsky

Life is busy, and there are always those conference talks or long-form articles you should read but never seem to find the time.


The Morning Mind-Meld is a chance to build context between the conversations happening around DevOps and SRE, and to hopefully create some inspiration, even—or, especially!—during a hectic week.

If you don't want these updates anymore, please unsubscribe here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Powered by Revue