Keep Reading

Site Reliability Engineering

Site Reliability Engineering

Date read: 2020-12-28
How strongly I recommend it: 8.4/10
(See my list of books, for more.)

Go to the Amazon book for details and reviews.

Site Reliability Engineering

2 elements of troubleshooting

an understanding of how to troubleshoot generically (i.e., without any particular system knowledge) and a solid knowledge of the system. ”

Focus of SRE

“Common to all SREs is the belief in and aptitude for developing software systems to solve complex problems.”

“By design, it is crucial that SRE teams are focused on engineering. Without constant engineering, operations load increases and teams will need more people just to keep pace with the workload. ”

“Google places a 50% cap on the aggregate “ops” work for all SREs—tickets, on-call, manual tasks, etc. This cap ensures that the SRE team has enough time in their schedule to make the service stable and operable”

“Google’s rule of thumb is that an SRE team must spend the remaining 50% of its time actually doing development

“Often this means shifting some of the operations burden back to the development team, or adding staff to the team without assigning that team additional operational responsibilities

Sysadmin alternatives

“Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.”

The pros and cons of Sysadmin model

The advantage of the sysadmin model is the widely available talents and tools.

The direct cost of the sysadmin model is the manual intervention of change management and event handling cannot scale up with the load growth as a service becomes more popular.

The hidden cost of the sysadmin model is the conflicts between the “dev” team and “ops” team because of their different backgrounds, skill sets and incentives.

How quickly software can be released to production is one of measures of this hidden cost.

“At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change — a new configuration, a new feature launch, or a new type of user traffic — the two teams’ goals are fundamentally in tension.”

SRE responsibilities

“An SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).”

Separation of rules and monitoring targets

“Borgmon configuration separates the definition of the rules from the targets being monitored. This means the same sets of rules can be applied to many targets at once, instead of writing nearly identical configuration over and over.”

Troubleshooting multiple components

“It’s then possible to look at the connections between components—or, equivalently, at the data flowing between themto determine whether a given component is working properly. Injecting known test data in order to check that the resulting output is expected (a form of black-box testing) at each step can be especially effective, as can injecting data intended to probe possible causes of errors”

“In a multilayer system where work happens throughout a stack of components, it’s often best to start systematically from one end of the stack and work toward the other end, examining each component in turn”

Guide troubleshooting with questions

"A malfunctioning system is often still trying to do something—just not the thing you want it to be doing. Finding out what it’s doing, then asking why it’s doing that and where its resources are being used or where its output is going can help you understand how things have gone wrong.”

SRE Focus

“Google has chosen to run our systems with a different approach: our Site Reliability Engineering teams focus on hiring software engineers to run our products and to create systems to accomplish the work that would otherwise be performed, often manually, by sysadmins.”

Diagnosis and understanding of system design

“A thorough understanding of the system’s design is decidedly helpful for coming up with plausible hypotheses about what’s gone wrong”

Thinking system in SRE

“Modern research identifies two distinct ways of thinking that an individual may, consciously or subconsciously, choose when faced with challenges:

1. Intuitive, automatic, and rapid action

2. Rational, focused, and deliberate cognitive functions

When one is dealing with the outages related to complex systems, the second of these options is more likely to produce better results and lead to well-planned incident handling.”

On-call resources

“The most important on-call resources are:

1. Clear escalation paths

2. Well-defined incident-management procedures

3. A blameless postmortem culture”

Troubleshooting in theory

“Given a set of observations about a system and a theoretical basis for understanding system behavior, we iteratively hypothesize potential causes for the failure and try to test those hypotheses.”

Demand Forecasting and Capacity Planning

“Capacity planning should take both organic growth (which stems from natural product adoption and usage by customers) and inorganic growth (which results from events like feature launches, marketing campaigns, or other business-driven changes) into account.”

SRE Monitoring

“A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs. However, this type of email alerting is not an effective solution: a system that requires a human to read an email and decide whether or not some type of action needs to be taken in response is fundamentally flawed. Monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.”

Software system inertia

Like physical objects in real life, software systems have inertia: a working computer system tends to remain in motion until acted upon by an external force, such as a configuration change or a shift in the type of load served. Recent changes to a system can be a productive place to start identifying what’s going wrong.

Change Management

“roughly 70% of outages are due to changes in a live system”

“Best practices in this domain use automation to accomplish the following:

Implementing progressive rollouts

Quickly and accurately detecting problems

Rolling back changes safely when problems arise”