Fv2 VMs are now available, the fastest VMs on Azure

Today, I’m excited to announce the general availability of our new Fv2 VM family. Azure now offers the fastest Intel® Xeon® Scalable processor, code-named Skylake, in the public cloud. In Azure, we have seen growing demand for massive large-scale computation by customers doing financial modeling, scientific analysis, genomics, geothermal visualization, and deep learning. Our drive to continuously innovate in Azure allows us to offer cost effective and best-in-class hardware for these world-changing workloads. Recent announcements of ND, offering the first Tesla P40s in the public cloud, NCv2, offering Tesla P100s, and the only cloud with InfiniBand connectivity, Azure enables amazing, scale-out, GPU-powered calculations. Now, with the Fv2, Azure offers the fastest CPU-powered calculations on the Intel Skylake processor.

These VM sizes are hyper-threaded and run on the Intel® Xeon® Platinum 8168 processor, featuring a base core frequency of 2.7 GHz and a maximum single-core turbo frequency of 3.7 GHz. Intel® AVX-512 instructions, which are new on Intel Scalable Processors, will provide up to a 2X performance boost to vector processing workloads on both single and double precision floating point operations. In other words, they are really fast for any computational workload. The closest cloud competitor offering Skylake currently only offers 2.0 GHz. This makes Azure the best place for computationally-intensive workloads, with the newest and best tools for the job.

The Fv2 VMs will be available in 7 sizes, with the largest size featuring 72 vCPUs and 144 GiB of RAM. These sizes will support Azure premium storage disks by default and will also support accelerated networking capabilities for the highest throughput of any cloud and ultra-low VM to VM latencies. With the best performance to price ratio on Azure, Fv2 VMs are a perfect fit for your compute intensive workloads. 

Here are the details on these new Fv2 VM sizes:

Size

vCPUs

Memory:
GiB

Local SSD:
GiB

Max cached and local disk IOPS (cache size in GiB)

Max. data disks (1023 GB each)

Max.NICs 

Standard_F2s_v2
2
4 GiB
16
4000 (32)
4
2

Standard_F4s_v2
4
8 GiB
32
8000 (64)
8
2

Standad_F8s_v2
8
16 GiB
64
16000 (128)
16
4

Standard_F16s_v2
16
32GiB
128
32000 (256)
32
8

Standard_F32s_v2
32
64GiB
256
64000 (512)
32
8

Standard_F64s_v2
64
128GiB
512
128000 (1024)
32
8

Standard_F72s_v2
72
144GiB
576
144000 (2048)
32
8

 

Starting today, these sizes are available in West US 2, West Europe, and East US. Southeast Asia is coming soon. I hope you enjoy these new sizes and I am excited to see what you will do with them!! 

See ya around, 
Corey
Quelle: Azure

Building good SLOs – CRE life lessons

By Robert van Gent and Stephen Thorne, Customer Reliability Engineers and Cody Smith, Site Reliability Engineer

In a previous episode of CRE Life Lessons, we discussed how choosing good service level indicators (SLIs) and service level objectives (SLOs) is critical for defining and measuring the reliability of your service. There’s also a whole chapter in the SRE book about this topic. In this episode, we’re going to get meta and go into more detail about some best practices we use at Google to formulate good SLOs for our SLIs.

SLO musings

SLOs are objectives that your business aspires to meet and intends to take action to defend; just remember, your SLOs are not your SLAs (service level agreements)! You should pick SLOs that represent the most critical aspects of the user experience. If you meet an SLO, your users and your business should be happy. Conversely, if the system does not meet the SLO, that implies there are users who are being made unhappy!

Your business needs to be able to defend an endangered SLO by reducing the frequency of outages, or reducing the impact of outages when they occur. Some ways to do this might include: slowing down the rate at which you release new versions, or by implementing reliability improvements instead of features. All parts of your business need to acknowledge that these SLOs are valuable and should be defended through trade-offs.

Here are some important things to keep in mind when designing your SLOs:

An SLO can be a useful tool for resolving meaningful uncertainty about what a team should be doing. The objective is a line in the sand between “we definitely need to work on this issue” and “we might not need to work on this issue.” Therefore, don’t pick SLO targets that are higher than what you actually need, even if you happen to be meeting them now, as that reduces your flexibility to change things in the future, including trade offs against reliability, like development velocity.

Group queries into SLOs by user experience, rather than by specific product elements or internal implementation details. For example, direct responses to user action should be grouped into a different SLO than background or ancillary responses (e.g., thumbnails). Similarly, “read” operations (e.g., view product) should be grouped into a different SLO than lower volume but more important “write” ones (e.g., check out). Each SLO will likely have different availability and latency targets.

Be explicit about the scope of your SLOs and what they cover (which queries, which data objects) and under what conditions they are offered. Be sure to consider questions like whether or not to count invalid user requests as errors, or happens when a single client spams you with lots of requests.

Finally, though somewhat in tension with the above, keep your SLOs simple and specific. It’s better not to cover non-critical operations with an SLO than to dilute what you really care about. Gain experience with a small set of SLOs; launch and iterate!

Example SLOs

Availability

Here we’re trying to answer the question “Was the service available to our user?” Our approach is to count the failures and known missed requests, and report the measurement as a percentage. Record errors from the first point that is in your control (e.g., data from your Load Balancer, not from the browser’s HTTP requests). For requests between microservices, record data from the client side, not the server side.

That leaves us with an SLO of the form:

Availability: <service> will <respond successfully> for <a customer scope> for at least <percentage> of requests in the <SLO period>

For example . . .

Availability: Node.js will respond with a non-503 within 30 seconds for browser pageviews for at least 99.95% of requests in the month.

. . . and . . .

Availability: Node.js will respond with a non-503 within 60 seconds for mobile API calls for at least 99.9% of requests in the month.

For requests that took longer than 30 seconds (60 second for mobile), the service might as well have been down, so they count against our availability SLO.

Latency

Latency is a measure of how well a service performed for our users. We count the number of queries that are slower than a threshold, and report them as a percentage of total queries. The best measurements are done as close to the client as possible, so measure latency at the Load Balancer for incoming web requests, and from the client not the server for requests between microservices.

Latency: <service> will respond within <time limit> for at least <percentage> of requests in the <SLO period>.

For example . . .

Latency: Node.js will respond within 250ms for at least 50% of requests in the month, and within 3000ms for at least 99% of requests in the month.

Percentages are your friend . . .
Note that we expressed our latency SLI as a percentage: “percentage of requests with latency < 3000ms” with target of 99%, not “99th percentile latency in ms” with target “< 3000ms”. This keeps SLOs consistent and easy to understand, because they all have the same unit and the same range. Also, accurately computing percentiles across large data sets is hard, while counting the number of requests below a threshold is easy. You’ll likely want to monitor multiple thresholds (e.g., percentage of requests < 50ms, < 250ms, . . .), but having SLO targets of 99% for one threshold, and possibly 50% for another, is generally sufficient.

Avoid targeting average (mean) latency — it’s almost never what you want. Averages can hide outliers, and sufficiently small values are indistinguishable from zero; users will not notice a difference between 50 ms and 250 ms for a full page response time, and thus they should be comparably good. There’s a big difference between an average of 250ms because all requests are taking 250ms, and an average of 250ms because 95% of requests are taking 1ms and 5% of requests are taking 5s.

. . . except 100%
A target of 100% is impossible over any meaningful length of time. It’s also likely not necessary. SREs use SLOs to embrace risk; the inverse of your SLO target is your error budget, and if your SLO target is 100% that means you have no error budget! In addition, SLOs are a tool for establishing team priorities — dividing top-priority work from work that’s prioritized on a case-by-case basis. SLOs tend to lose their credibility if every individual failure is treated as a top priority.

Regardless of the SLO target that you eventually choose, the discussion is likely to be very interesting; be sure to capture the rationale for your chosen target for posterity.

Reporting
Report on your SLOs quarterly, and use quarterly aggregates to guide policies, particularly pager thresholds. Using shorter periods tends to shift focus to smaller, day-to-day issues, and away from the larger, infrequent issues that are more damaging. Any live reporting should use the same sliding window as the quarterly report, to avoid confusion; the published quarterly report is merely a snapshot of the live report.

Example quarterly SLO summary

This is how you might present the historical performance of your service against SLO, e.g., for a semi-annual service report, where the SLO period is one quarter:

SLO

Target

Q2

Q3

Web Availability

99.95%

99.92%

99.96%

Mobile Availability

99.9%

99.91%

99.97%

Latency ≤ 250ms

50%

74%

70%

Latency ≤ 3000ms

99%

99.4%

98.9%

For SLO-dependent policies such as paging alerts or freezing of releases when you’ve spent the error budget, use a sliding window shorter than a quarter. For example, you might trigger a page if you spent ≥1% of the quarterly error budget over the last four hours, or you might freeze releases if you spent ≥ ⅓ of the quarterly budget in the last 30 days.

Breakdowns of SLI performance (by region, by zone, by customer, by specific RPC, etc.) are useful for debugging and possibly for alerting, but aren’t usually necessary in the SLO definition or quarterly summary.

Finally, be mindful about with whom you share your SLOs, especially early on. They can be a very useful tool for communicating expectations about your service, but the more broadly they are exposed the harder it is to change them.

Conclusion

SLOs are a deep topic, but we’re often asked about handy rules of thumb people can use to start reasoning about them. The SRE book has more on the topic, but if you start with these basic guidelines, you’ll be well on your way to avoiding the most common mistakes people make when starting with SLOs. Thanks for reading, we hope this post has been helpful. And as we say here at Google, may the queries flow, your SLOs be met and the pager stay silent!
Quelle: Google Cloud Platform