Mai 2017 - Seite 45 von 189 - Cloud Computing Köln

We recently ran a contributor survey in the RDO community, and while the participation was fairly small (21 respondants), there’s a lot of important insight we can glean from it.

First, and unsurprisingly:

Of the 20 people who answered the “corporate affiliation” question, 18 were Red Hat employees. While we are already aware that this is a place where we need to improve, it’s good to know just how much room for improvement there is. What’s useful here will be figuring out why people outside of Red Hat are not participating more. This is touched on in later questions.

Next, we have the frequency of contributions:

Here we see that while 14% of our contributors are pretty much working on RDO all the time, the majority of contributors only touch it a few times per release – probably updating a single package, or addressing a single bug, for that particular cycle.

This, too, is mostly in line with what we expected. With most of the RDO pipeline being automated, there’s little that most participants would need to do beyond a handful of updates each release. Meanwhile, a core team works on the infrastructure and the tools every week to keep it all moving.

We asked contributors where they participate:

Most of the contributors – 75% – indicate that they are involved in packaging. (Respondants could choose more than one area in which they participate.) Test day participation was a distant second place (35%), followed by documentation (25%) and end user support (25%)

I’ve personally seen way more people than that participate in end user support, on the IRC channel, mailing list, and ask.openstack.org. Possibly these people don’t think of what they’re doing as support, but it is still a very important way that we grow our user community.

The rest of the survey delves into deeper details about the contribution process.

When asked about the ease of contribution, 80% said that it was ok, with just 10% saying that the contribution process was too hard.

When asked about difficulties encountered in the contribution process:

Answers were split fairly evenly between “Confusing or outdated documentation”, “Complexity of process”, and “Lack of documentation”. Encouragingly, “lack of community support” placed far behind these other responses.

It sounds like we have a need to update the documentation, and greatly clarify it. Having a first-time contributor’s view of the documentation, and what unwarranted assumptions it makes, would be very beneficial in this area.

When asked how these difficulties were overcome, 60% responded that they got help on IRC, 15% indicated that they just kept at it until they figured it out, and another 15% indicated that they gave up and focused their attentions elsewhere.

Asked for general comments about the contribution process, almost all comments focused on the documentation – it’s confusing, outdated, and lacks useful examples. A number of people complained about the way that the process seems to change almost every time they come to contribute. Remember: Most of our contributors only touch RDO once or twice a release, and they feel that they have to learn the process from scratch every time. Finally, two people complained that the review process for changes is too slow, perhaps due to the small number of reviewers.

I’ll be sharing the full responses on the RDO-List mailing list later today.

Thank you to everyone who participated in the survey. Your insight is greatly appreciated, and we hope to use it to improve the contributor process in the future.
Quelle: RDO

24. Mai 2017

da Agency

Know thy enemy: how to prioritize and communicate risks – CRE life lessons

Editor’s note: We’ve spent a lot of time in CRE Life Lessons talking about how to identify and mitigate risks in your system. In this post, we’re going to talk about how to effectively communicate and stack-rank those risks.

When a Google Cloud customer engages with Customer Reliability Engineering (CRE), one of the first things we do is an Application Reliability Review (ARR). First, we try to understand your application’s goals: what it provides to users and the associated service level objectives (SLOs) (or we help you create SLOs if you do not have any!). Second, we evaluate your application and operations to identify risks that threaten your ability to reach your SLOs. For each identified risk, we provide a recommendation on how to eliminate or mitigate it based on our experiences at Google.

The number of risks identified for each application varies greatly depending on the maturity of your application and team and target level for reliability or performance. But whether we identify five risks or 50, two fundamental facts remain true: Some risks are worse than others, and you have a finite amount of engineering time to address them. You need a process to communicate the relative importance of the risks and to provide guidance on which risks should be addressed first. This appears easy, but beware! The human brain is notoriously unreliable at comparing and evaluating risks.

This post explains how we developed a method for analyzing risks during an ARR, allowing us to present our customers with a clear, ranked list of recommendations, explain why one risk is ranked above another, and describe the impact a risk may have on the application’s SLO target. By the end of this post, you’ll understand how to apply this to your own application, even without going through a CRE engagement.

Take one: the risk matrix
Each risk has many properties that can be used to evaluate its relative importance. In discussions internally and with customers, two properties in particular stand out as most relevant:

The likelihood of the risk occurring in a given time period.
The impact that would be felt if the risk materializes.

We began by defining three levels for each property, which are represented in the following 3×3 table.

Example table with representative risks for each category: The row headers represent likelihood and column headers represent impact.

Catastrophic

Damaging

Minimal

Frequent

Overload results in slow or dropped requests during the peak hour each day.

The wrong server is turned off and requests are dropped.

Restarts for weekly upgrades drop in-progress requests (i.e., no lame ducking).

Common

A bad release takes the entire service down. Rollback is not tested.

Users report an outage before monitoring and alerting notifies the operator.

A daylight savings bug drops requests.

Rare

There is a physical failure in the hosting location that requires complete restoration from a backup or disaster recovery plan.

Overload results in a cascading failure. Manual intervention is required to halt or fix the issue.

A leap year bug causes all servers to restart and drop requests.

We tested this approach with a couple of customers by bucketing the risks we had identified into the table. This is not a novel approach. We very quickly realized that our terminology and format are the same as that used in a risk matrix, a commonly used management tool in the risk assessment field. This realization seemed to confirm that we were on the right track, and had created something that customers and their management could easily understand.

We were right: Our customers told us that the table of risks was a good overview and was easy to grasp. However, we struggled to explain the relative importance of entries in the list based on the cells in the table:

The distribution of risks across the cells was extremely uneven. Most risks ended up in the “common, damaging” cell, which doesn’t help to explain relative importance of the items within each cell.
Assigning a risk to a cell (and its subsequent position in the list of risks) is subjective and depends on the reliability target of the application. For example, the “frequent, catastrophic” example of dropping traffic for a few minutes during a release is catastrophic at four nines, but less so at two nines.
Ordering the cells into a ranking is not straightforward. Is it more important to handle a “rare, catastrophic” risk, or a “frequent, minimal” risk? The answer is not clear from the names or definitions of the categories alone. Further, the desired order can change from matrix to matrix depending on the number of items in each cell.

Risk expressed as expected losses
As we showed in the previous section, the traditional risk matrix does a poor job of explaining the relative importance of each risk. However, the risk assessment field offers another useful model: using impact and likelihood to calculate the expected loss from a risk. Expressed as a numeric quantity, this expected loss value is great way to explain the relative importance of our list of risks.

How do we convert qualitative concepts of impact and likelihood to quantified values that we can use to calculate expected loss? Consider our earlier posts on availability and SLOs, specifically, the concepts of Mean Time Between Failure (MTBF), Mean Time To Recover (MTTR), and error budget. The MTBF of a risk provides a measure of likelihood (i.e., how long it takes for the risk to cause a failure), the MTTR provides a measure of impact (i.e., how long we expect the failure to last before recovering), and the error budget is the expected number of downtime minutes per year that you’re willing to allow (a.k.a. accepted loss).

Now with this system, when we work through an ARR and catalog risks, we use our experience and judgement to estimate each risk’s MTBF (counted in days) and the subsequent MTTR (counted in minutes out of SLO). Using these two values, we estimate the expected loss in minutes for each risk over a fixed period of time, and generate the desired ranking.

We found that calculating expected losses over a year is a useful timeframe for risk-ranking, and developed a three-colour traffic light system to provide high-level guidance and quick visual feedback on the magnitude of each risk vs. the error budget:

Red: This risk is unacceptable, as it falls above the acceptable error budget for a single risk (we typically use 25%), and therefore, can have a major impact on your reliability in a single event.
Amber: This risk should not be acceptable, as it’s a major consumer of your error budget and therefore, needs to be addressed. You may be able to accept some amber risks by addressing some less urgent (green) risks to buy back budget.
Green: This is an acceptable risk. It’s not a major consumer of your error budget, and in aggregate, does not cause your application to exceed the error budget. You don’t have to address green risks, but may wish to do so to give yourself more budget to cover unexpected risks, or to accept amber risks that are hard to mitigate or eliminate.

Based on the three-colour traffic light system, the following table demonstrates how we rank and colour the risks given a 3-nines availability target. The risks are a combination of those in the original matrix and some additional examples to help illustrate the amber category. You can refer to the spreadsheet linked at the end of this post to see the precise MTTR and MTBF numbers that underlie this table, along with additional examples of amber risks.

Risk

Bad minutes/year

Overload results in slow or dropped requests during the peak hour each day.

3559

A bad release takes the entire service down. Rollback is not tested.

507

Users report an outage before monitoring and alerting notifies the operator.

395

There is a physical failure in the hosting location that requires complete restoration from a backup or disaster recovery plan.

242

The wrong server is turned off and requests are dropped.

213

Overload results in a cascading failure. Manual intervention is required to halt or fix the issue.

150

Operator accidentally deletes database; restore from backup is required

129

Unnoticed growth in usage triggers overload; service collapses.

125

A configuration mishap reduces capacity; causing overload and dropped requests

122

A new release breaks a small set of requests; not detected for a day.

119

Operator is slow to debug and root cause bug due to noisy alerting

A daylight savings bug drops requests.

Restarts for weekly upgrades drop in-progress requests (i.e., no lame ducking).

A leap year bug causes all servers to restart and drop requests.

Other Considerations
The ranked list of risks is extremely useful for communicating the findings of an ARR and conveying the relative magnitude of the risks compared to each other. We recommend that you use the list only for this purpose. Do not prioritize your engineering work directly based on the list. Instead, use the expected loss values as inputs to your overall business planning process, taking into consideration remediation and opportunity costs to prioritize work.

Also, don’t be tricked into thinking that because you have concrete numbers for the expected loss, that they are precise! They’re only as good as the estimates derived from MTBF and MTTR values. In the best case, MTBF and MTTR are averages from observed data; more commonly, they will be estimates based purely on intuition and experience. To minimize introducing errors into the final ranking, we recommend estimating MTBF and MTTR values likely to be within an order of magnitude of correct, rather than use specific, potentially inaccurate values.

Somewhat in contrast to the advice just mentioned, we find it useful to introduce additional granularity into the calculation of MTBF and MTTR values, for more accurate estimates. First, we split MTTR into two components:

Mean Time To Detect (MTTD): The time between when the risk first manifests and when the issue is brought to the attention of someone (or something) capable of remediating it.
Mean Time To Repair (MTTR): Redefined to mean the time between when the issue is brought to the attention of someone capable of remediating it and when it is actually remediated.

This granularity is driven by the realization that, often, the time to notice an issue and the time to fix it differ significantly. It’s easier to assess and ensure estimates are consistent across risks with these figures separately specified.

Second, in addition to considering MTTD, we also factor in what proportion of the users are affected by a risk (e.g., in a sharded system, shards can fail at a given rate and incur downtime before a successful failover succeeds, but each failure only impacts a proportion of the users). Taking these two optimizations into account, our overall formula for calculating the expected annual loss from a risk is:

(MTTD + MTTR) * (365.25 / MTBF) * percent of affected users

To implement this method for your own application, here is a spreadsheet template that you can copy and populate with your own data: https://goo.gl/bnsPj7

Summary
When analyzing the reliability of an application, it is easy to generate a large list of potential risks that must be prioritized for remediation. We have demonstrated how the MTBF and MTTR values of each risk can be used to develop a prioritized list of risks based on the expected impact on the annual error budget.

We here in CRE have found this method to be extremely helpful. In addition, customers can use the expected loss figure as an input to more comprehensive risk assessments, or cost/benefit calculations of future engineering work. We hope you find it helpful too!
Quelle: Google Cloud Platform

Now you can make a private chat with just friends on a public Live video.

The next feature lets you do a joint live broadcast with a friend in a different location. It’s basically like letting other people watch your Skype or Facetime with your BFF.