SLOs, SLIs, SLAs, oh my – CRE life lessons

Last week on CRE life lessons, we discussed how to come up with a precise numerical target for system availability. We term this target the Service Level Objective (SLO) of our system. Any discussion we have in future about whether the system is running sufficiently reliably and what design or architectural changes we should make to it must be framed in terms of our system continuing to meet this SLO.

We also have a direct measurement of SLO conformance: the frequency of successful probes of our system. This is a Service Level Indicator (SLI). When we evaluate whether our system has been running within SLO for the past week, we look at the SLI to get the service availability percentage. If it goes below the specified SLO, we have a problem and may need to make the system more available in some way, such as running a second instance of the service in a different city and load balancing between the two.

Why have an SLO at all?
Suppose that we decide that running our aforementioned Shakespeare service against a formally defined SLO is too rigid for our tastes; we decide to throw the SLO out of the window and make the service “as available as is reasonable.” This makes things easier, no? You simply don’t mind if the system goes down for an hour now and then. Indeed, perhaps downtime is normal during a new release and the attending stop-and-restart.

Unfortunately for you, customers don’t know that. All they see is that Shakespeare searches that were previously succeeding have suddenly started to return errors. They raise a high-priority ticket with support, who confirms that they see the error rate and escalate to you. Your on-call engineer investigates, confirms this is a known issue, and responds to the customer with “this happens now and again, you don’t have to escalate.” Without an SLO, your team has no principled way of saying what level of downtime is acceptable; there’s no way to measure whether or not this a significant issue with the service. and you cannot terminate the escalation early with “Shakespeare search service is currently operating within SLO.” As our colleague Perry Lorier likes to say, “if you have no SLOs, toil is your job.”

The SLO you run at becomes the SLO everyone expects

A common pattern is to start your system off at a low SLO, because that’s easy to meet: you don’t want to run a 24/7 rotation, your initial customers are OK with a few hours of downtime, so you target at least 99% availability — 1.68 hours downtime per week. But in fact, your system is fairly resilient and for six months operates at 99.99% availability — down for only a few minutes per month.

But then one week, something breaks in your system and it’s down for a few hours. All hell breaks loose. Customers page your on-call complaining that your system has been returning 500s for hours. These pages go unnoticed, because on-call leaves their pagers on their desks overnight, per your SLO which only specifies support during office hours.

The problem is, customers have become accustomed to your service being always available. They’ve started to build it into their business systems on the assumption that it’s always available. When it’s been continually available for six months, and then goes down for a few hours, something is clearly seriously wrong. Your excessive availability has become a problem because now it’s the expectation. Thus the expression, “An SLO is a target from above and below” — don’t make your system very reliable if you don’t intend and commit to it to being that reliable.

Within Google, we implement periodic downtime in some services to prevent a service from being overly available. In the SRE Book, our colleague Marc Alvidrez tells a story about our internal lock system — Chubby. Then, there’s the set of test front-end servers for internal services to use in testing, allowing those services to be accessible externally. These front-end servers are convenient but are explicitly not intended for use by real services; they have a one business day support SLA, and so can be down for 48 hours before the support team is even obligated to think about fixing them. Over time, experimental services that used those front-ends started to become critical; when we finally had a few hours of downtime on the front-ends, it caused widespread consternation.

Now we run a quarterly planned-downtime exercise with these front-ends. The front-end owners send out a warning, then block all services on the front-ends except for a small whitelist. They keep this up for several hours, or until a major problem with the blockage appears; the blockage can be quickly reversed in that case. At the end of the exercise the front-end owners receive a list of services that use the front-ends inappropriately, and work with the service owners to move them to somewhere more suitable. This downtime exercise keeps the front-end availability suitably low, and detects inappropriate dependencies in time to get them fixed.

Your SLA is not your SLO

At Google, we distinguish between a Service-Level Agreement (SLA) and a Service-Level Objective (SLO). An SLA normally involves a promise to someone using your service that its availability should meet a certain level over a certain period, and if it fails to do so then some kind of penalty will be paid. This might be a partial refund of the service subscription fee paid by customers for that period, or additional subscription time added for free. The concept is that going out of SLA is going to hurt the service team, so they’ll push hard to keep it within SLA.

Because of this, and because the principle availability shouldn’t be much better than the SLO, the SLA is normally a looser objective than the SLO. This might be expressed in availability numbers: for instance, an availability SLA of 99.9% over 1 month with an internal availability SLO of 99.95%. Alternatively the SLA might only specify a subset of the metrics comprising the SLO.

For example, with our Shakespeare search service, we might decide to provide it as an API to paying customers in which a customer pays us $10K per month for the right to send up to one million searches per day. Now that money is involved, we need to specify in the contract how available they can expect the service to be, and what happens if we breach that agreement. We might say that we’ll provide the service at a minimum of 99% availability, following the definition of successful queries given previously. If the service drops below 99% availability in a month, then we’ll refund $2K; if it drops below 80% then, we’ll refund $5K.

If you have an SLA that’s different from your SLO, as it almost always is, it’s important for your monitoring to measure SLA compliance explicitly. You want to be able to view your system’s availability over the SLA calendar period, and easily see if it appears to be in danger of going out of SLA. You’ll also need a precise measurement of compliance, usually from logs analysis. Since we have an extra set of obligations (in the form of our SLA) to paying customers, we need to measure queries received from them separately from other queries (we might not mind dropping queries from non-paying users if we have to start load shedding, but we really care about any query from the paying customer that we fail to handle properly). That’s another benefit of establishing an SLA — it’s an unambiguous way to prioritize traffic.

When you define your SLA, you need to be extra-careful about which queries you count as legitimate. For example, suppose that you give each of three major customers (whose traffic dominates your service) a quota of one million queries per day. One of your customers releases a buggy version of their mobile client, and issues two million queries per day for two days before they revert the change. Over a 30-day period you’ve issued approximately 90 million good responses, and two million errors; that gives you a 97.8% success rate. You probably don’t want to give all your customers a refund as a result of this; two customers had all their queries succeed, and the customer for whom two million out of 32 million queries were rejected brought this upon themselves. So perhaps you should exclude all “out of quota” response codes from your SLA accounting.

On the other hand, suppose you accidentally push an empty quota specification file to your service before going home for the evening. All customers receive a default 1000 queries per day quota. Your three top customers get served constant “out of quota” errors for 12 hours until you notice the problem when you come into work in the morning, and revert the change. You’re now showing 1.5 million rejected queries out of 90 million for the month, a 98.3% success rate. This is all your fault: counting this as 100% success for 88.5M queries is missing the point and a moral failure for measuring the SLA.

Conclusion

SLIs, SLOs and SLAs aren’t just useful abstractions. Without them you cannot know if your system is reliable, available, or even useful. If they don’t tie explicitly back to your business objectives then you have no idea if the choices you make are helping or hurting your business. You also can’t make honest promises to your customers.

If you’re building a system from scratch, make sure that SLIs, SLOs and SLAs are part of your system requirements. If you already have a production system but don’t have them clearly defined then that’s your highest priority work.

To summarize:

If you want to have a reliable service, you must first define “reliability.” In most cases that actually translates to availability.
If you want to know how reliable your service is, you must be able to measure the rates of successful and unsuccessful queries; these will form the basis of your SLIs.
The more reliable the service, the more it costs to operate. Define the lowest level of reliability that you can get away with, and state that as your Service Level Objective (SLO).
Without an SLO your team and your stakeholders cannot make principled judgements about whether your service needs to be made more reliable (increasing cost and slowing development) or less reliable (allowing greater velocity of development).
If you’re charging your customers money you will probably need an SLA and it should be a little bit looser than your SLO.

As an SRE (or DevOps professional), it is your responsibility to understand how your systems serve the business in meeting those objectives, and, as much as possible, control for risks that threaten the high-level objective. Any measure of system availability which ignores business objectives is worse than worthless because it obfuscates the actual availability, leading to all sorts of dangerous scenarios, false senses of security, and failure.

For those of you who wrote us thoughtful comments and questions from our last article, we hope this post has been helpful. Keep the feedback coming!

N. B. Google Cloud Next ’17 is fewer than seven weeks away. Register now to join Google Cloud SVP Diane Greene, Google CEO Sundar Pichai, and other luminaries for three days of keynotes, code labs, certification programs, and over 200 technical sessions. And for the first time ever, Next ’17 will have a dedicated space for attendees to interact with Google experts in Site Reliability Engineering and Developer Operations.

Quelle: Google Cloud Platform

SQL Server 2016 innovations power Azure SQL Data Warehouse to deliver faster insights

Azure SQL Data Warehouse (SQL DW) is a SQL-based petabyte-scale, massively parallel, cloud solution for data warehousing. It is fully managed and highly elastic, enabling you to provision and scale capacity in minutes. You can scale compute and storage independently, allowing you to range from burst to archival scenarios, and pay based off what you&;re using instead of being locked into a cluster configuration.

The engine underneath Azure SQL Data Warehouse that runs the queries on each individual node is the industry leading SQL Server Database from Microsoft. With general availability in 2016, Azure SQL DW received an upgrade to SQL Server 2016 that transparently provided 40% performance increase to user workloads comprising of analytic queries.

The two performance pillars of SQL DW are its column store and the batch mode execution engine, also known as vectorized query execution. In this blog, we highlight the improvements in SQL Server 16 that took SQL Data Warehouse performance to a new level. These are all in addition to existing features such as columnar compression and segment elimination. We already had batch mode execution that can process multiple rows at a time, instead of one value at a time, and take advantage of SIMD hardware innovations. SQL Server 16 further extended batch mode execution to more operators and scenarios.

The following are the key SQL Server 16 performance innovations for columnstore and batch mode. Each links to a detailed blog providing examples and observed performance gain.

Aggregate Pushdown

Aggregates are very common in analytic queries. With columnstore tables, SQL Server processes aggregates in batch mode delivering an order of magnitude better performance. SQL Server 16 further dials up aggregate computation performance by pushing the aggregation to the SCAN node. This allows the aggregate to be computed on the compressed data during the scan itself.

String Predicate Pushdown

Columnstore in SQL Server 16 allows string predicates to be pushed down the SCAN node, resulting in a significant improvement in query performance. String predicate pushdown leverages dictionaries to minimize the number of string comparisons.

Multiple Aggregates

SQL Server 16 now processes multiple aggregates on a table scan more efficiently in a single batch mode aggregation operator. Previously multiple aggregation paths and operators would be instantiated resulting in slower performance.

Batch Mode Window Aggregates

SQL Server 16 introduces batch mode execution for window aggregates. Batch mode has the potential to speed up certain queries by even 300 times as measured in some of our internal tests.

Batch Mode in Serial Execution

High concurrent activity and/or low number of cores can force queries to run in serial. Previously serial queries would get forced to run in row mode, resulting in a double beating from lack of parallelism and lack of batch mode. SQL Server 16 can run batch mode even when degree of parallelism (DOP) for a query is 1 (DOP 1 means the query is run serial). SQL Data Warehouse at lower SLOs (less than DWU1000) runs each distribution query in serial as there is less than one core per distribution. With this improvement, these queries now run in batch mode.

The above is quite an extensive list of performance boosts that SQL Data Warehouse now benefits from. Best of all, no change is required to SQL Data Warehouse user queries to get the above performance benefits – it is all automatic under the hood!

Next steps

In this blog we described how SQL Server 2016 innovations in columnstore and batch mode technologies give a huge performance boost to Azure SQL Data Warehouse queries. We encourage you to try it out by moving your on-premise data warehouse into the cloud.

Learn more

Check out the many resources to learn more about SQL Data Warehouse.

What is Azure SQL Data Warehouse?

SQL Data Warehouse best practices

Video library

MSDN forum

Stack Overflow forum
Quelle: Azure