What is Site Reliability Engineering (SRE)? How does it relate to DevOps?
Site Reliability Engineering (SRE) is Google's discipline for running large-scale systems by applying software engineering practices to operations.
Origin — coined by Ben Treynor at Google (2003). Predates 'DevOps' as a term.
Core concepts:
SLIs / SLOs / Error Budgets — quantitative reliability targets and how much downtime you can 'spend'.
Toil reduction — at most 50% of an SRE's time on toil; the rest on engineering automation.
Blameless postmortems — learn from incidents, don't punish.
Embracing risk — 100% reliability is the wrong goal; aim for the SLO.
SRE vs DevOps:
DevOps = cultural philosophy for delivery.
SRE = concrete implementation of that philosophy with measurable engineering practices.
Often paraphrased: 'class SRE implements DevOps'.
SRE was Google's specific implementation of DevOps before DevOps was a buzzword. Key SRE concepts: SLI/SLO/error budget, toil reduction, blameless postmortems.
Don't say 'SRE = ops' — that misses the engineering rigor.