Available . . . or not? That is the question – CRE life lessons

By AJ Ross, Matt Brown and Adrian Hilton, Customer Reliability Engineers, and Dave Rensin, Director of Customer Reliability Engineering

In our last installment of the CRE life lessons series, we discussed how to survive a “success disaster” with load-shedding techniques. We got a lot of great feedback from that post, including several questions about how to tie measurements to business objectives. So, in this post, we decided to go back to first principles, and investigate what “success” means in the first place, and how to know if your system is “succeeding” at all.

A prerequisite to success is availability. A system that’s unavailable cannot perform its function and will fail by default. But what is “availability”? We must define our terms:

Availability defines whether a system is able to fulfill its intended function at a point in time. In addition to being used as a reporting tool, the historical availability measurement can also describe the probability that your system will perform as expected in the future. Sometimes availability is measured by using a count of requests rather than time directly. In either case, the structure of the formula is the same: successful units / total units. For example, you might measure uptime / (uptime + downtime), or successful requests / (successful requests + failed requests). Regardless of the particular unit used, the result is a percentage like 99.9% or 99.999% —  sometimes referred to as “three nines” or “five nines.”

Achieving high availability is best approached by focusing on the unsuccessful component (e.g., downtime or failed requests). Taking a time-based availability metric as an example: given a fixed period of time (e.g., 30 days, 43200 minutes) and an availability target of 99.9% (three nines), simple arithmetic shows that the system must not be down for more than 43.2 minutes over the 30 days. This 43.2 minute figure provides a very concrete target to plan around, and is often referred to as the error budget. If you exceed 43.2 minutes of downtime over 30 days, you’ll not meet your availability goal.

Two further concepts are often used to help understand and plan the error budget:

Mean Time Between Failures (MTBF): total uptime / # of failures. This is the average time between failures.

Mean Time to Repair (MTTR): total downtime / # of failures. This is the average time taken to recover from a failure.

These metrics can be computed historically (e.g., over the past 3 months, or year) and combined as (Total Period / MTBF) * MTTR to give an expected downtime value. Continuing with the above example, if the historical MTBF is calculated to be 10 days, and the historical MTTR is calculated to be 20 minutes, then you would expect to see 60 minutes of downtime ((30 days / 10 days) * 20 minutes) —  clearly outside the 44-minute error budget for a three-nines availability target. To meet the target would require decreasing the MTBF (say to every 20 days) or decreasing the MTTR (say to 10 minutes), or a combination of both.

Keeping the concepts of error budget, MTBF and MTTR in mind when defining an availability target helps to provide justification for why the target is set where it is. Rather than simply describing the target as a fixed number of nines, it’s possible to relate the numeric target to the user experience in terms of total allowable downtime, frequency and duration of failure.

Next, we’ll look at how to ensure this focus on user experience is maintained when measuring availability.

Measuring availability

How do you know whether a system is available? Consider a fictitious “Shakespeare” service, which allows users to find mentions of a particular word or phrase in Shakespeare’s texts. This is a canonical example, used frequently within Google for training purposes, and mentioned throughout the SRE book.

Let’s try working the scientific method to determine the availability of the hypothetical Shakespeare system.

Question: how often is the system available?
Observation: when you visit shakespeare.com, you normally get back the “200 OK” status code and an HTML blob. Very rarely, you see a 500 Internal Server error or a connection failure.
Hypothesis: if “availability” is the percentage of requests per day that return 200 OK, the system will be 99.9% available.
Measure: “tail” the response logs of the Shakespeare service’s web servers and dump them into a logs-processing system.
Analyze: take a daily availability measurement as the percentage of 200 OK responses vs. the total number of requests.
Interpret: After seven days, there’s a minimum of 99.7% availability on any given day.

Happily, you report these availability numbers to your boss (Dave), and go home. A job well done.

The next day Dave draws your attention to the support forum. Users are complaining that all their searches at shakespeare.com return no results. Dave asks why the availability dashboard shows 99.7% availability for the last day, when there clearly is a problem.

You check the logs and notice that the web server has received just 1000 requests in the last 24 hours, and they’re all 200 OKs except for three 500s. Given that you expect at least 100 queries per second, that explains why users are complaining in the forums, although the dashboard looks fine.

You’ve made the classic mistake of basing your definition of availability on a measurement that does not match user-expectations or business objectives.

Redefining availability in terms of the user experience with black-box monitoring

After fixing the critical issue (a typo in a configuration file) that prevented the Shakespeare frontend service from reaching the backend, we take a step back to think about what it means for our system to be available.

If the “rate of 200 OK logs for shakespeare.com” is not an appropriate availability measurement, then how should we measure availability?

Dave wants to understand the availability as observed by users. When does the user feel that shakespeare.com is available? After some lively back-and-forth, we agree that the system is available when a user can visit shakespeare.com, enter a query and get a result for that query within five seconds, 100% of the time.

So you write a black-box “prober” (black-box, because it makes no assumptions about the implementation of the Shakespeare service, see the SRE Book, Chapter 6) to emulate a full range of clients devices (mobile, desktop). For each type of client, you visit shakespeare.com, enter the query “to be or not to be,” and check that the result contains the expected link to Hamlet. You run the prober for a week, and finally recalculate the minimum daily availability measure: 80% of queries return Hamlet within five seconds, 18% of queries take longer, 1% timeout and another 1% return errors. A full 20% of queries fail our definition of availability!

Choosing an availability target according to business goals

After getting over his shock, Dave asks a simple question: “Why can’t we have 100% returning within 5 seconds?”

You explain all the usual reasons why: power outages, fiber cuts, etc. After an hour or so, Dave is willing to admit that 100% query response in under five seconds is truly impossible.

Which leads, Dave to ask, “What availability can we have, then?”

You turn the question the question around on him: “What availability is required for us to meet our business goals?”

Dave’s eyes light up. The business has set a revenue target of $25 million per year, and we make on average $0.01 per query result. At 100 queries per second * 3,1536,000 seconds per year * 80% success rate * $0.01 per query, we’ll earn $25.23 million. In other words, even with a 20% failure rate, we’ll still hit our revenue targets!

Still, a 20% failure rate is pretty ugly. Even if we think we’ll meet our revenue targets, it’s not a good user experience and we might have some attrition as a result. Should we fix it, and if so, what should our availability objective be?

Evaluating cost/benefit tradeoffs, opportunity costs

Suppose the rate of queries returning in greater than five seconds can be reduced to 0.5% if an engineer works on the problem for six months. How should we decide whether or not to do this?

We can start by estimating how much the 20% failure rate is going to cost us in missed revenue (accounting for users who give up on retrying) over the life of the product. We know roughly how much it will cost to fix the problem. Naively, we may decide that since the revenue lost due to the error rate exceeds the cost of fixing the issue, then we should fix it.

But this ignores a crucial factor… the opportunity cost of fixing the problem. What other things could an engineer have done with that time instead?

Hypothetically, there’s a new search algorithm that increases the relevance of Shakespeare search results, and putting it into production might drive a 20% increase in search traffic, even as availability remains constant. This increase in traffic could easily offset any lost revenue due to poor availability.

An oft-heard SRE saying is that you should “design a system to be as available as is required, but not much more.” At Google, when designing a system, we generally target a given availability figure (e.g., 99.9%), rather than particular MTBF or MTTR figures. Once we’ve achieved that availability metric, we optimize our operations for “fast fix,” e.g., MTTR over MTBF, accepting that failure is inevitable, and that “spes consilium non est” (Hope is not a strategy). SREs are often able to mitigate the user visible impact of huge problems in minutes, allowing our engineering teams to achieve high development velocity, while simultaneously earning Google a reputation for great availability.

Ultimately, the tradeoff made between availability and development velocity belong to the business. Precisely defining the availability in product terms allows us to have a principled discussion and to make choices we can be proud of.

N.B. Google Cloud Next ’17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai and other luminaries for three days of keynotes, code labs, certification programs and over 200 technical sessions. And for the first time ever, Next ’17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.
Quelle: Google Cloud Platform

Our journey on building the Go SDK for Azure

Over the last few months, we&;ve been busy adding new functionality to the Azure Go SDK and we&039;ll keep doing so as we march towards public preview next year.

If you followed the recent changes on our GitHub repo, you probably noticed few general improvements we made to the SDK

Model Flattening

In the last release we added model flattening to many of our APIs (i.e. you can type Resource.Sku.Family instead of resource.Properties.Sku.Family), which makes for more readable code.

Better error messages during parameter validation

During parameter validation, we enabled the SDK to return an error with the info needed to fix the JSON before sending the request out – making it easier to identify/correct potential coding mistakes.

For example, let us take a scenario where a user wants to create a resource group and location is required in that operation. User forgets to include it in the request.

In previous SDK versions, the operation would fail inside Azure and user would get the following error

resources.GroupsClient#: Failure responding to request: StatusCode=400 — Original Error: autorest/azure: Service returned an error. Status=400 Code="LocationRequired" Message="The location property is required for this definition."

In the latest SDK version, user would get

resources.GroupsClientCreateOrUpdate: Invalid input: autorest/validation: validation failed: parameter=parameters.Location constraint=Null value=(*string)(nil) details: value can not be null; required parameter

 

We also improved the coverage and functionality of the data plane of the SDK by adding support for file and directory manipulation, getting/setting ACLs on containers, working with the Storage Emulator and other various storage blob and queues operations.

Some of the fixes and improvements added to the SDK have been provided by enthusiastic developers outside of our Microsoft team and we would like to extend our sincere gratitude and appreciation to everyone who sent us feedback and/or pull requests. We took note of your requests for better API coverage in the data plane, better documentation, release notes and samples, and we are making progress in incorporating them into our future releases.

Breaking changes

Speaking of future releases: while many API changes are expected to be additive in nature, some of the changes we are introducing will break existing clients. A recent example was issue 1559, which arose when we added parameter validation; in the near future, some methods and parameters may be added/deleted, parameters change order, and structs can change while we are considering model flattening on more APIs. This is part of the reason why we keep the &039;beta&039; label on the Go SDK, and we are carefully examining every proposed change for alternatives that will not break the existing functionality.

We’d like to thank in advance all of you who continue to use our Go SDK and send us feedback; we are committed to building the best experience for developers on our platform, and we&039;d like to make sure the changes have minimal impact on your development cycle as the SDK goes towards more mature stages of public preview and GA (general availability)

We will use this blog to keep you updated on the progress and potential breaking changes, and we’ll give you a heads-up as we are approaching new milestones.
Have any suggestions for how to make the SDK better? We’d love to hear from you! Send us a pr or file an issue, and let’s talk!
Quelle: Azure

The White House Denies That Alex Jones Has Been Offered Press Credentials

Alex Jones will likely won&;t be attending any upcoming White House press briefings, according to the Trump adminstration&039;s press office.

Yesterday Alex Jones told viewers on his popular YouTube channel that his conspiracy news site, Infowars has been offered White House press credentials by the new administration to cover the Trump White House. But on Thursday White House press officials tell BuzzFeed News that Jones and Infowars have not been offered a spot in the briefing room. “He is not credentialed for the White House,” a White Deputy Press Secretary said. “The White House Press office has not offered him credentials.”

Jones, an ardent Trump supporter has been called “America’s leading conspiracy theorist” and is a prominent 9/11 and Sandy Hook truther. His false suggestion that he&039;s been offered White House press credentials comes on the heels of reports that the Trump administration is planning to open the briefing room to alternative outlets. The far right-leaning outlet, Gateway Pundit, for example has also suggested it will receive a spot in the briefing room and has already hired a White House correspondent. Currently though, the administration has only announced it will open up four “virtual press seats” for local press outlets more than 50 miles outside Washington D.C.

Here&039;s Jones&039; full statement from his YouTube page on the White House credentials, via Media Matters:

The statement contradicts Jones&039; video”Here&039;s the deal, I know I get White House credentials, we&039;ve already been offered them, we&039;re going to get them, but I&039;ve just got to spend the money to send somebody there. I want to make sure it&039;s even worth it. I don&039;t want to just sit there up there like “I&039;m in the media, look our people are there.” People don&039;t understand this paradigm, we&039;re devolving in a good way, power from the federal government back to the people, back from the centralized MSM [mainstream media] to the people, just like Trump said in his speech.

But there is investigative journalism, or people to interview in DC. Might be good to put a few reporters there, it&039;s just all a money issue. That&039;s why it&039;s important for people who are watching us to know, you are our sponsors. You&039;re the reason we&039;re able to do this. You&039;re the reason we&039;re able to have the crew and do what we do and change the world.

Quelle: <a href="The White House Denies That Alex Jones Has Been Offered Press Credentials“>BuzzFeed