SRE at Google: Our complete list of CRE life lessons

In 2016 we announced a new discipline at Google, Customer Reliability Engineering, an offshoot of Site Reliability Engineering (SRE). Our goal with CRE was (and still is) to create a shared operational fate between Google and our Google Cloud customers, to give you more control over the critical applications you’re entrusting to us. Since then, here on the Google Cloud blog, we’ve published a wealth of resources to help you take the best practices we’ve learned from SRE teams at Google and apply them in your own environments. Below is the complete list of CRE life lessons posts we’ve published in the past five years in one convenient location.Common pitfallsKnow thy enemy: How to prioritize and communicate risksHow to avoid a self-inflicted DDoS AttackUsing load shedding to survive a success disasterService-level metricsAvailable . . . or not? That is the questionSLOs, SLIs, SLAs, oh myBuilding good SLOsConsequences of SLO violationsAn example escalation policyApplying the escalation policyDefining SLOs for services with dependenciesTune up your SLI metricsLearning—and teaching—the art of service-level objectivesUsing deemed SLIs to measure customer reliabilityReleasesReliable releases and rollbacksHow release canaries can save your baconSRE supportWhy should your app get SRE support?How SREs find the landmines in a serviceMaking the most of an SRE service takeoverDark launchesWhat is a dark launch, and what does it do for me?The practicalities of dark launchingPostmortemsFearless shared postmortemsGetting the most out of shared postmortemsError BudgetsGood housekeeping for error budgetsUnderstanding error budget overspendProduction IncidentsShrinking the impact of production incidents using SRE principlesShrinking the time to mitigate production incidentsWe still have plenty more articles to come, so keep your eye on our DevOps & SRE channel. You can also check out sre.google or read our SRE books online.
Quelle: Google Cloud Platform

Published by