Processes crash or may need to be restarted. Hard drives fail. Natural disasters can take out several data centers in a region. Site Reliability Engineers need to anticipate these sorts of failures and develop strategies to keep systems running in spite of them.
This usually means running systems across multiple sites, and this means that you need to make tradeoffs between availability and consistency of your system state.
This talk explores distributed consensus algorithms, such as RAFT and Paxos in production: how they work, how they perform, what can go wrong when to use them and not to use them.
Laura Nolan has been a Site Reliability Engineer at Google for four years, working on large data infrastructure projects and most recently, networking. Her background is in software engineering and computer science. She wrote the ‘Managing Critical State’ chapter in the O'Reilly SRE book and is co-chair of SRECon EMEA 2017.