What is Site Reliability Engineering (SRE)? How does it relate to DevOps?

Question

Accepted Answer

Site Reliability Engineering (SRE) is Google's discipline for running large-scale systems by applying software engineering practices to operations . Origin — coined by Ben Treynor at Google (2003). Predates 'DevOps' as a term. Core concepts: SLIs / SLOs / Error Budgets — quantitative reliability targets and how much downtime you can 'spend'. Toil reduction — at most 50% of an SRE's time on toil; the rest on engineering automation. Blameless postmortems — learn from incidents, don't punish. Embracing risk — 100% reliability is the wrong goal; aim for the SLO. SRE vs DevOps: DevOps = cultural philosophy for delivery. SRE = concrete implementation of that philosophy with measurable engineering practices. Often paraphrased: 'class SRE implements DevOps'. SRE was Google's specific implementation of DevOps before DevOps was a buzzword. Key SRE concepts: SLI/SLO/error budget, toil reduction, blameless postmortems. Don't say 'SRE = ops' — that misses the engineering rigor.