Hiprup

What is Site Reliability Engineering (SRE)? How does it relate to DevOps?

Site Reliability Engineering (SRE) is Google's discipline for running large-scale systems by applying software engineering practices to operations.

  • Origin — coined by Ben Treynor at Google (2003). Predates 'DevOps' as a term.

  • Core concepts:

    • SLIs / SLOs / Error Budgets — quantitative reliability targets and how much downtime you can 'spend'.

    • Toil reduction — at most 50% of an SRE's time on toil; the rest on engineering automation.

    • Blameless postmortems — learn from incidents, don't punish.

    • Embracing risk — 100% reliability is the wrong goal; aim for the SLO.

SRE vs DevOps:

  • DevOps = cultural philosophy for delivery.

  • SRE = concrete implementation of that philosophy with measurable engineering practices.

  • Often paraphrased: 'class SRE implements DevOps'.

SRE was Google's specific implementation of DevOps before DevOps was a buzzword. Key SRE concepts: SLI/SLO/error budget, toil reduction, blameless postmortems.

Don't say 'SRE = ops' — that misses the engineering rigor.

What is Site Reliability Engineering (SRE)? How does it relate to DevOps? | Hiprup