Advancing no-impact and low-impact maintenance technologies

“This post continues our reliability series kicked off by my July blog post highlighting several initiatives underway to keep improving platform availability, as part of our commitment to provide a trusted set of cloud services. Today I wanted to double-click on the investments we’ve made in no-impact and low-impact update technologies including hot patching, memory-preserving maintenance, and live migration. We’ve deployed dozens of security and reliability patches to host infrastructure in the past year, many of which were implemented with no customer impact or downtime. The post that follows was written by John Slack from our core operating systems team, who is the Program Manager for several of the update technologies discussed below.” – Mark Russinovich, CTO, Azure

This post was co-authored by Apurva Thanky, Cristina del Amo Casado, and Shantanu Srivastava from the engineering teams responsible for these technologies.

 

We regularly update Azure host infrastructure to improve the reliability, performance, and security of the platform. While the purposes of these ‘maintenance’ updates vary, they typically involve updating software components in the hosting environment or decommissioning hardware. If we go back five years, the only way to apply some of these updates was by fully rebooting the entire host. This approach took customer virtual machines (VMs) down for minutes at a time. Since then, we have invested in a variety of technologies to minimize customer impact when updating the fleet. Today, the vast majority of updates to the host operating system are deployed in place with absolute transparency and zero customer impact using hot patching. In infrequent cases in which the update cannot be hot patched, we typically utilize low-impact memory preserving update technologies to roll out the update.

Even with these technologies, there are still other rare cases in which we need to do more impactful maintenance (including evacuating faulty hardware or decommissioning old hardware). In such cases, we use a combination of live migration, in-VM notifications, and planned maintenance providing customer controls.

Thanks to continued investments in this space, we are at a point where the vast majority of host maintenance activities do not impact the VMs hosted on the affected infrastructure. We’re writing this post to be transparent about the different techniques that we use to ensure that Azure updates are minimally impactful.

Plan A: Hot patching

Function-level “hot” patching provides the ability to make targeted changes to running code without incurring any downtime for customer VMs. It does this by redirecting all new invocations of a function on the host to an updated version of that function, so it is considered a ‘no impact’ update technology. Wherever possible we use hot patching to apply host updates completely avoiding any impact to the VMs running on that host. We have been using hot patching in Azure since 2017. Since then, we have worked to broaden the scope of what we can hot patch. As an example, we updated the host operating system to allow the hypervisor to be hot patched in 2018. Looking forward, we are exploring firmware hot patches. This is a place where the industry typically hasn't focused. Firmware has always been viewed as ‘if you need to update it, reboot the server,’ but we know that makes for a terrible customer experience. We've been working with hardware manufacturers to consider our own firmware to make them hot patchable and incrementally updatable.

Some large host updates contain changes that cannot be applied using function-level hot patching. For those updates, we endeavor to use memory-preserving maintenance.

Plan B: Memory-preserving maintenance

Memory-preserving maintenance involves ‘pausing’ the guest VMs (while preserving their memory in RAM), updating the host server, then resuming the VMs and automatically synchronizing their clocks. We first used memory-preserving maintenance for Azure in 2018. Since then we have improved the technology in three important ways. First, we have developed less impactful variants of memory-preserving maintenance targeted for host components that can be serviced without a host reboot. Second, we have reduced the duration of the customer experienced pause. Third, we have expanded the number of VM types that can be updated with memory preserving maintenance. While we continue to work in this space, some variants of memory-preserving maintenance are still incompatible with some specialized VM offerings like M, N, or H series VMs for a variety of technical reasons.

In the rare case we need to make more impactful maintenance (including host reboots, VM redeployment), customers are notified in advance and given the opportunity to perform the maintenance at a time suitable for their workload(s).

Plan C: Self-service maintenance

Self-service maintenance involves providing customers and partners a window of time, within which they can choose when to initiate impactful maintenance on their VM(s). This initial self-service phase typically lasts around a month and empowers organizations to perform the maintenance on their own schedules so it has no or minimal disruption to users. At the end of this self-service window, a scheduled maintenance phase begins—this is where Azure will perform the maintenance automatically. Throughout both phases, customers get full visibility of which VMs have or have not been updated—in Azure Service Health or by querying in PowerShell/CLI. Azure first offered self-service maintenance in 2018. We generally see that administrators take advantage of the self-service phase rather than wait for Azure to perform maintenance on their VMs automatically.

In addition to this, when the customer owns the full host machine, either using Azure Dedicated Hosts or Isolated virtual machines, we recently started to offer maintenance control over all non-zero impact platform updates. This includes rebootless updates which only cause a few seconds pause. It is useful for VMs running ultra-sensitive workloads which can’t sustain any interruption even if it lasts just for a few seconds. Customers can choose when to apply these non-zero impact updates in a 35-day rolling window. This feature is in public preview, and more information can be found in this dedicated blog post.

Sometimes in-place update technologies aren’t viable, like when a host shows signs of hardware degradation. In such cases, the best option is to initiate a move of the VM to another host, either through customer control via planned maintenance or through live migration.

Plan D: Live migration

Live migration involves moving a running customer VM from one “source” host to another “destination” host. Live migration starts by moving the VM’s local state (including RAM and local storage) from the source to the destination while the virtual machine is still running. Once most of the local state is moved, the guest VM experiences a short pause usually lasting five seconds or less. After that pause, the VM resumes running on the destination host. Azure first started using live migration for maintenance in 2018. Today, when Azure Machine Learning algorithms predict an impending hardware failure, live migration can be used to move guest VMs onto different hosts preemptively.

Amongst other topics, planned maintenance and AI Operations were covered in Igal Figlin’s recent Ignite 2019 session “Building resilient applications in Azure.” Watch the recording here for additional context on these, and to learn more about how to take advantage of the various resilient services Azure provides to help you build applications that are inherently resilient.

The future of Azure maintenance 

In summary, the way in which Azure performs maintenance varies significantly depending on the type of updates being applied. Regardless of the specifics, Azure always approaches maintenance with a view towards ensuring the smallest possible impact to customer workloads. This post has outlined several of the technologies that we use to achieve this, and we are working diligently to continue improving the customer experience. As we look toward the future, we are investing heavily in machine learning-based insights and automation to maintain availability and reliability. Eventually, this “AI Operations” model will carry out preventative maintenance, initiate automated mitigations, and identify contributing factors and dependencies during incidents more effectively than our human engineers can. We look forward to sharing more on these topics as we continue to learn and evolve.
Quelle: Azure

Azure Lighthouse: The managed service provider perspective

This blog post was co-authored by Nikhil Jethava, Senior Program Manager, Azure Lighthouse.

Azure Lighthouse became generally available in July this year and we have seen a tremendous response from Azure managed service provider communities who are excited about the scale and precision of management that the Azure platform now enables with cross tenant management. Similarly, customers are empowered with architecting precise and just enough access levels to service providers for their Azure environments. Both customers and partners can decide on the precise scope of the projection.

Azure Lighthouse enables partners to manage multiple customer tenants from within a single control plane, which is their environment. This enables consistent application of management and automation across hundreds of customers and monitoring and analytics to a degree that was unavailable before. The capability works across Azure services (that are Azure Resource Manager enabled) and across licensing motion. Context switching is a thing of the past now.

In this article, we will answer some of the most commonly asked questions:

How can MSPs perform daily administration tasks across different customers in their Azure tenant from a single control plane?
How can MSPs secure their intellectual property in the form of code?

Let us deep dive into a few scenarios from the perspective of a managed service provider.

Azure Automation

Your intellectual property is only yours. Service providers, using Azure delegated resource management, are no longer required to create Microsoft Azure Automation runbooks under customers’ subscription or keep their IP in the form of runbooks in someone else’s subscription. Using this functionality, Automation runbooks can now be stored in a service provider's subscription while the effect of the runbooks will be reflected on the customer's subscription. All you need to do is ensure the Automation account's service principal has the required delegated built-in role-based access control (RBAC) role to perform the Automation tasks. Service providers can create Azure Monitor action groups in customer's subscriptions that trigger Azure Automation runbooks residing in a service provider's subscription.
   

Azure Monitor alerts

Azure Lighthouse allows you to monitor the alerts across different tenants under the same roof. Why go through the hassle of storing the logs ingested by different customer's resources in a centralized log analytics workspace? This helps your customers stay compliant by allowing them to keep their application logs under their own subscription while empowering you to have a helicopter view of all customers.

Azure Resource Graph Explorer

With Azure delegated resource management, you can query Azure resources from Azure Resource Graph Explorer across tenants. Imagine a scenario where your boss has asked you for a CSV file that would list the existing Azure Virtual Machines across all the customers’ tenants. The results of the Azure Resource Graph Explorer query now include the tenant ID, which makes it easier for you to identify which Virtual Machine belongs to which customer.

 

 
 

Azure Security Center

Azure Lighthouse provides you with cross-tenant visibility of your current security state. You can now monitor compliance to security policies, take actions on security recommendations, monitor the secure score, detect threats, execute file integrity monitoring (FIM), and more, across the tenants.

   

Azure Virtual Machines

Service providers can perform post-deployment tasks on different Azure Virtual Machines from different customer's tenants using Azure Virtual Machine extensions, Azure Virtual Machine Serial Console, run PowerShell commands using Run command option, and more in the Azure Portal. Most administrative tasks on Azure Virtual Machines across the tenants can now be performed quickly since the dependency on taking remote desktop protocol (RDP) access to the Virtual Machines lessens. This also solves a big challenge since admins now do not require to log on to different Azure Subscriptions in multiple browser tabs just to get to the Virtual Machine’s resource menu.

Managing user access

Using Azure delegated resource management, MSPs no longer need to create administrator accounts (including contributor, security administrator, backup administrator, and more) in their customer tenants. This allows them to manage the lifecycle of delegated administrators right within their own Microsoft Azure Active Directory (AD) tenant. Moreover, MSPs can add user accounts to the user group in their Azure Active Directory (AD) tenant, while customers make sure those groups have the required access to manage their resources. To revoke access when an employee leaves the MSP’s organization, it can simply be removed from the specific group the access has been delegated to.

Added advantages for Cloud Solution Providers

Cloud Solution Providers (CSPs) can now save on administration time. Once you’ve set up the Azure delegated resource management for your users, there is absolutely no need for them to log in to the Partner Center (found by accessing Customers, Contoso, and finally All Resources) to administer customers’ Azure resources.

Also, Azure delegated resource management happens outside the boundaries of the Partner Center portal. Instead, the delegated user access is managed directly under Azure Active Directory. This means subscription and resource administrators in Cloud Solution Providers are no longer required to have the 'admin agent' role in the Partner Center. Therefore, Cloud Solutions Providers can now decide which users in their Azure Active Directory tenant will have access to which customer and to what extent.

More information

This is not all. There is a full feature list available for supported services and scenarios in Azure Lighthouse documentation. Check out Azure Chief Technology Officer Mark Russinovich’s blog for a deep under-the-hood view.

So, what are you waiting for? Get started with Azure Lighthouse today.
Quelle: Azure

Connecting Microsoft Azure and Oracle Cloud in the UK and Canada

In June 2019, Microsoft announced a cloud interoperability collaboration with Oracle that will enable our customers to migrate and run enterprise workloads across Microsoft Azure and Oracle Cloud.

At Oracle OpenWorld in September, the cross-cloud collaboration was a big part of the conversation. Since then, we have fielded interest from mutual customers who want to accelerate their cloud adoption across both Microsoft Azure and Oracle Cloud. Customers are interested in running their Oracle database and enterprise applications on Azure and in the scenarios enabled by the industry’s first cross-cloud interconnect implementation between Azure and Oracle Cloud Infrastructure. Many are also excited about our announcement to integrate Microsoft Teams with Oracle Cloud Applications. We have already enabled the integration of Azure Active Directory with Oracle Cloud Applications and continue to break new ground while engaging with customers and partners.

Interest from the partner community

Partners like Accenture are very supportive of the collaboration between Microsoft Azure and Oracle Cloud. Accenture recently published a white paper, articulating their own perspective and hands-on experiences while configuring the connectivity between Microsoft Azure and Oracle Cloud Infrastructure.

Another Microsoft and Oracle partner who expressed interest early on is SYSCO. SYSCO is a European IT-company specializing in solutions for the utilities sector. They offer unique industry expertise combined with highly skilled technology experts within AI and analytics, cloud, infrastructure, and applications. SYSCO is a Microsoft Gold Cloud Platform partner and a Platinum Oracle partner.

In August 2019, we introduced the ability to interconnect Microsoft Azure (UK South) and Oracle Cloud Infrastructure in London, UK providing our joint customers access to a direct, low-latency, and highly reliable network connection between Azure and Oracle Cloud Infrastructure. Prior to that, for partners like SYSCO, the ability to leverage this new collaboration between Microsoft Azure and Oracle Cloud was out of reach.

“The Microsoft Azure and Oracle Cloud Interconnect announcement is one of the best announcements in years for our customers! A direct link provides the Microsoft / Oracle cloud interconnect with a new option for all customers using proprietary business applications. With our expertise across both Microsoft and Oracle, we are thrilled to be one of the first partners to pilot this together with our customers in the utilities industry in Norway.”–Frank Vikingstad VP International – SYSCO

Azure and Oracle Cloud Infrastructure interconnect in Toronto, Canada

Today we are announcing that we have extended the Microsoft Azure and Oracle Cloud Infrastructure interconnect to include the Azure Canada Central region and Oracle Cloud Infrastructure region in Toronto, Canada.

“This unique Azure and Oracle Cloud Infrastructure solution delivers the performance, easy integration, rigorous service level agreements, and collaborative enterprise support that enterprise IT departments need to simplify their operations. We’ve been pleased by the demand for the interconnected cloud solution by our mutual customers around the world and are thrilled to extend these capabilities to our Canadian customers.” –Clive D’Souza, Sr. Director & Head of Product Management, Oracle Cloud Infrastructure

What this means for you

In addition to being able to run certified Oracle databases and applications on Azure, you now have access to new migration and deployment scenarios enabled by the interconnect. For example, you can rely on tested, validated, and supported deployments of Oracle applications on Azure with Oracle databases, Real Application Clusters (RAC) and Exadata, deployed in Oracle Cloud Infrastructure. You can also run custom applications on Azure backed by Oracle’s Autonomous Database on Oracle Cloud Infrastructure.

To learn more about the collaboration between Oracle and Microsoft and how you can run Oracle applications on Azure please refer to our website.
Quelle: Azure

Tips for learning Azure in the new year

As 2020 is upon us, it's natural to take time and reflect back on the current year’s achievements (and challenges) and begin planning for the next year. One of our New Year’s resolutions was to continue live streaming software development topics to folks all over the world. In our broadcasts in late November and December, the Azure community saw some of our 2020 plans. While sharing, many others typed in the chat from across the world that they’d set a New Year’s resolution to learn Azure and would love any pointers.

When we shared our experiences learning Azure in the “early days,” we talked about the number of great resources (available at no cost) users can take advantage of right now, and carry their learnings into the new year and beyond. 

Here are a few tips for our developer community to help them keep their resolutions to learn Azure:

Create a free account: The first thing that you’ll need is to create a free account. You can sign up with a Microsoft or GitHub account and get access to 12 months of popular free services, a 30-day Azure free trial with $200 to spend during that period and over 25 services that are free forever. Once your 30-day trial is over, we’ll notify you so you can decide if you want to upgrade to pay-as-you-go pricing and remove the spending limit. In other words, no surprises here folks!
Stay current with the Azure Application Developer and languages page: This home page is a single, unified destination for developers and architects that covers Azure application development along with all of our language pages such as .NET, Node.js, Python, and more. It is refreshed monthly and your go-to-source for our SDKs, hands-on tutorials, docs, blogs, events, and other Azure resources. Check out our recent Python for Beginners series to jump right in.
Free Developer’s Guide to Azure eBook: This free eBook includes all the updates from Microsoft’s first-party conferences, along with new services and features announced since then. In addition to these important services, we drill into practical examples that you can use in the real world and included a table and reference architecture that show you “what to use when” for databases, containers, serverless scenarios, and more. There is also a key focus on security to help you stop potential threats to your business before they happen. You’ll also see brand new sections on IoT, DevOps, and AI/ML that you can take advantage of today. In the more than 20 pages of demos, you’ll be diving into topics that include creating and deploying .NET Core web apps and SQL Server to Azure from scratch, building on to the application to perform analysis of the data with Cognitive Services. After the app is created, we’ll make it more robust and easier to update by incorporating CI/CD using API Management to control our APIs and generate documentation automatically.
Azure Tips and Tricks (weekly tips and videos): Azure Tips and Tricks helps developers learn something new within a couple of minutes. Since inception in 2017, the collection has grown to over 230 tips and more than 80 videos, conference talks, and several eBooks spanning the entire universe of the Azure platform. Featuring a new weekly tip and video it is designed to help you boost your productivity with Azure, and all tips are based on practical real-world scenarios. The series spans the entire universe of the Azure platform from Azure App Services, to containers, and more. Swing by weekly for a tip or stay for hours watching our Azure YouTube playlist.
Rock, Paper, Scissors, Lizard, Spock sample application: Rock, Paper, Scissors, Lizard, Spock is the geek version of the classic Rock, Paper, Scissors game. Rock, Paper, Scissors, Lizard, Spock is created by Sam Kass and Karen Bryla.
The sample application running in Azure was presented at Microsoft Ignite 2019 by Scott Hanselman and friends. It’s a multilanguage application built with Visual Studio and Visual Studio Code, deployed with GitHub Actions, and running on Azure Kubernetes Service (AKS). The sample application also uses Azure Machine Learning and Azure Cognitive Services (custom vision API). Languages used in this application include .NET, Node.js, Python, Java, and PHP.
Microsoft.Source Newsletter: Get the latest articles, documentation, and events from our curated monthly developer community newsletter. Learn about new technologies and find opportunities to connect with other developers online and locally. Each edition, you’ll have the opportunity to share your feedback and shape the newsletter as it grows and evolves.

Additional resources

Here are some bonus tips to help you keep up with Azure as it changes:

Azure documentation is the most comprehensive and current resource you’ll find for all of our Azure services.
See how Microsoft does DevOps: Customers are looking for guidance and insights about companies that have undergone a transformation through DevOps. To that end, we are sharing the stories of four Microsoft teams that have experienced DevOps transformation, with guidance on lessons learned and ways to drive organizational change through Azure technologies and internal culture. The stories are aimed at providing practical information about DevOps adoption to developers, IT professionals, and decision-makers.
Azure Friday is a video series that releases up to three new episodes per week to keep up with the latest in Azure with hosts such as Scott Hanselman.

Quelle: Azure

New features in Azure Monitor Metrics Explorer based on your feedback

A few months ago, we posted a survey to gather feedback on your experience with metrics in Azure Portal. Thank you for participation and for providing valuable suggestions!

We want to share some of the insights we gained from the survey and highlight some of the features that we delivered based on your feedback. These features include:

Resource picker that supports multi-resource scoping.
Splitting by dimension allows limiting the number of time series and specifying sort order.
Charts can show a large number of datapoints.
Improved chart legends.

Resource picker with multi-resource scoping

One of the key pieces of feedback we heard was about the resource picker panel. You said that being able to select only one resource at a time when choosing a scope is too limiting. Now you can select multiple resources across resource groups in a subscription.

Ability to limit the number of timeseries and change sort order when splitting by dimension

Many of you asked for the ability to configure the sort order based on dimension values, and for control over the maximum number of timeseries shown on the chart. Those who asked explained that for some metrics, including available memory and remaining disk space, they want to see the timeseries with smallest values, while for other metrics, including CPU utilization or count of failures, showing the timeseries with highest values make more sense. To address your feedback, we expanded the dimension splitter selector with Sort order and Limit count inputs.
 

Charts that show a large number of datapoints

Charts with multiple timeseries over the long period, especially with short time grain are based on queries that return lots of datapoints. Unfortunately, processing too many datapoints may slow down chart interactions. To ensure the best performance, we used to apply a hard limit on the number of datapoints per chart, prompting users to lower the time range or to increase the time grain when the query returns too much data.

Some of you found the old experience frustrating. You said that occasionally you might want to plot charts with lots of datapoints, regardless of performance. Based on your suggestions, we changed the way we handle the limit. Instead of blocking chart rendering, we now display a message that suggests that the metrics query will return a lot of data, but will let you proceed anyways (with a friendly reminder that you might need to wait longer for the chart to display).
   
High-density charts from lots of datapoints can be useful to visualize the outliers, as shown in this example:
  

Improved chart legend

A small but useful improvement was made based on your feedback that the chart legends often wouldn’t fit on the chart, making it hard to interpret the data. This was almost always happening with the charts pinned to dashboards and rendered in the tight space of dashboard tiles, or on screens that have a smaller resolution. To solve the problem, we now let you scroll the legend until you find the data you need:
  

Feedback

Let us know how we're doing and what more you'd like to see. Please stay tuned for more information on these and other new features in the coming months. We are continuously addressing pain points and making improvements based on your input.

If you have any questions or comments before our next survey, please use the feedback button on the Metrics blade. Don’t feel shy about giving us a shout out if you like a new feature or are excited about the direction we’re headed. Smiles are just as important in influencing our plans as frowns.

Quelle: Azure

Advancing Azure Active Directory availability

“Continuing our Azure reliability series to be as transparent as possible about key initiatives underway to keep improving availability, today we turn our attention to Azure Active Directory. Microsoft Azure Active Directory (Azure AD) is a cloud identity service that provides secure access to over 250 million monthly active users, connecting over 1.4 million unique applications and processing over 30 billion daily authentication requests. This makes Azure AD not only the largest enterprise Identity and Access Management solution, but easily one of the world’s largest services. The post that follows was written by Nadim Abdo, Partner Director of Engineering, who is leading these efforts.” – Mark Russinovich, CTO, Azure

 

Our customers trust Azure AD to manage secure access to all their applications and services. For us, this means that every authentication request is a mission critical operation. Given the critical nature and the scale of the service, our identity team’s top priority is the reliability and security of the service. Azure AD is engineered for availability and security using a truly cloud-native, hyper-scale, multi-tenant architecture and our team has a continual program of raising the bar on reliability and security.

Azure AD: Core availability principles

Engineering a service of this scale, complexity, and mission criticality to be highly available in a world where everything we build on can and does fail is a complex task.

Our resilience investments are organized around the set of reliability principles below:

Our availability work adopts a layered defense approach to reduce the possibility of customer visible failure as much as possible; if a failure does occur, scope down the impact of that failure as much as possible, and finally, reduce the time it takes to recover and mitigate a failure as much as possible.

Over the coming weeks and months, we dive deeper into how each of the principles is designed and verified in practice, as well as provide examples of how they work for our customers.

Highly redundant

Azure AD is a global service with multiple levels of internal redundancy and automatic recoverability. Azure AD is deployed in over 30 datacenters around the world leveraging Azure Availability Zones where present. This number is growing rapidly as additional Azure Regions are deployed.

For durability, any piece of data written to Azure AD is replicated to at least 4 and up to 13 datacenters depending on your tenant configuration. Within each data center, data is again replicated at least 9 times for durability but also to scale out capacity to serve authentication load. To illustrate—this means that at any point in time, there are at least 36 copies of your directory data available within our service in our smallest region. For durability, writes to Azure AD are not completed until a successful commit to an out of region datacenter.

This approach gives us both durability of the data and massive redundancy—multiple network paths and datacenters can serve any given authorization request, and the system automatically and intelligently retries and routes around failures both inside a datacenter and across datacenters.

To validate this, we regularly exercise fault injection and validate the system’s resiliency to failure of the system components Azure AD is built on. This extends all the way to taking out entire datacenters on a regular basis to confirm the system can tolerate the loss of a datacenter with zero customer impact.

No single points of failure (SPOF)

As mentioned, Azure AD itself is architected with multiple levels of internal resilience, but our principle extends even further to have resilience in all our external dependencies. This is expressed in our no single point of failure (SPOF) principle.

Given the criticality of our services we don’t accept SPOFs in critical external systems like Distributed Name Service (DNS), content delivery networks (CDN), or Telco providers that transport our multi-factor authentication (MFA), including SMS and Voice. For each of these systems, we use multiple redundant systems configured in a full active-active configuration.

Much of that work on this principle has come to completion over the last calendar year, and to illustrate, when a large DNS provider recently had an outage, Azure AD was entirely unaffected because we had an active/active path to an alternate provider.

Elastically scales

Azure AD is already a massive system running on over 300,000 CPU Cores and able to rely on the massive scalability of the Azure Cloud to dynamically and rapidly scale up to meet any demand. This can include both natural increases in traffic, such as a 9AM peak in authentications in a given region, but also huge surges in new traffic served by our Azure AD B2C which powers some of the world’s largest events and frequently sees rushes of millions of new users.

As an added level of resilience, Azure AD over-provisions its capacity and a design point is that the failover of an entire datacenter does not require any additional provisioning of capacity to handle the redistributed load. This gives us the flexibility to know that in an emergency we already have all the capacity we need on hand.

Safe deployment

Safe deployment ensures that changes (code or configuration) progress gradually from internal automation to internal to Microsoft self-hosting rings to production. Within production we adopt a very graduated and slow ramp up of the percentage of users exposed to a change with automated health checks gating progression from one ring of deployment to the next. This entire process takes over a week to fully rollout a change across production and can at any time rapidly rollback to the last well-known healthy state.

This system regularly catches potential failures in what we call our ‘early rings’ that are entirely internal to Microsoft and prevents their rollout to rings that would impact customer/production traffic.

Modern verification

To support the health checks that gate safe deployment and give our engineering team insight into the health of the systems, Azure AD emits a massive amount of internal telemetry, metrics, and signals used to monitor the health of our systems. At our scale, this is over 11 PetaBytes a week of signals that feed our automated health monitoring systems. Those systems in turn trigger alerting to automation as well as our team of 24/7/365 engineers that respond to any potential degradation in availability or Quality of Service (QoS).

Our journey here is expanding that telemetry to provide optics of not just the health of the services, but metrics that truly represent the end-to-end health of a given scenario for a given tenant. Our team is already alerting on these metrics internally and we’re evaluating how to expose this per-tenant health data directly to customers in the Azure Portal.

Partitioning and fine-grained fault domains

A good analogy to better understand Azure AD are the compartments in a submarine, designed to be able to flood without affecting either other compartments or the integrity of the entire vessel.

The equivalent for Azure AD is a fault domain, the scale units that serve a set of tenants in a fault domain are architected to be completely isolated from other fault domain’s scale units. These fault domains provide hard isolation of many classes of failures such that the ‘blast radius’ of a fault is contained in a given fault domain.

Azure AD up to now has consisted of five separate fault domains. Over the last year, and completed by next summer, this number will increase to 50 fault domains, and many services, including Azure Multi-Factor Authentication (MFA), are moving to become fully isolated in those same fault domains.

This hard-partitioning work is designed to be a final catch all that scopes any outage or failure to no more than 1/50 or ~2% of our users. Our objective is to increase this even further to hundreds of fault domains in the following year.

A preview of what’s to come

The principles above aim to harden the core Azure AD service. Given the critical nature of Azure AD, we’re not stopping there—future posts will cover new investments we’re making including rolling out in production a second and completely fault-decorrelated identity service that can provide seamless fallback authentication support in the event of a failure in the primary Azure AD service.

Think of this as the equivalent to a backup generator or uninterruptible power supply (UPS) system that can provide coverage and protection in the event the primary power grid is impacted. This system is completely transparent and seamless to end users and is now in production protecting a portion of our critical authentication flows for a set of M365 workloads. We’ll be rapidly expanding its applicability to cover more scenarios and workloads.

We look forward to sharing more on our Azure Active Directory Identity Blog, hearing your questions and topics of interest for future posts.
Quelle: Azure

New enhancements for Azure IoT Edge automatic deployments

Since releasing Microsoft Azure IoT Edge, we have seen many customers using IoT Edge automatic deployments to deploy workloads to the edge at scale. IoT Edge automatic deployments handle the heavy lifting of deploying modules to the relevant Azure IoT Edge devices and allow operators to keep a close eye on status to quickly address any problems. Customers love the benefits and have given us feedback on how to make automatic deployments even better through greater flexibility and seamless experiences. Today, we are sharing a set of enhancements to IoT Edge automatic deployments that are a direct result of this feedback. These enhancements include layered deployments, deploying marketplace modules from the Azure portal and other UI updates, and module support for automatic device configurations.

Layered deployments

Layered deployments are a new type of IoT Edge automatic deployments that allow developers and operators to independently deploy subsets of modules. This avoids the need to create an automatic deployment for every combination of modules that may exist across your device fleet. Microsoft Azure IoT Hub evaluates all applicable layered deployments to determine the final set of modules for a given IoT Edge device. Layered deployments have the same basic components as any automatic deployment. They target devices based on tags in the device twins and provide the same functionality around labels, metrics, and status reporting. Layered deployments also have priorities assigned to them, but instead of using the priority to determine which deployment is applied to a device, the priority determines how multiple deployments are ranked on a device. For example, if two layered deployments have a module or a route with the same name, the layered deployment with the higher priority will be applied while the lower priority is overwritten.

This first illustration shows how all modules need to be included in each regular deployment, requiring a separate deployment for each target group.

This second illustration shows how layered deployments allow modules to be deployed independently to each target group, with a lower overall number of deployments.

Revamped UI for IoT Edge automatic deployments

There are updates throughout the IoT Edge automatic deployments UI in the Azure portal. For example, you can now select modules from Microsoft Azure Marketplace from directly within the create deployment experience. The Azure Marketplace features many Azure IoT Edge modules built by Microsoft and partners.

Automatic configuration for module twins

Automatic device management in Azure IoT Hub automates many of the repetitive and complex tasks of managing large device fleets by using automatic device configurations to update and report status on device twin properties. We have heard from many of you that you would like the equivalent functionality for configuring module twins, and are happy to share that this functionality is now available.

Next steps

Learn about layered deployments for Azure IoT Edge
Learn about automatic device management support for module twins

Quelle: Azure

Better performance with bursting enhancement on Azure Disks

We introduced the preview of bursting support on Azure Premium SSD Disks, and new disk sizes 4/8/16 GiB on both Premium & Standard SSDs at Microsoft Ignite in November. We would like to share more details about it. With bursting, eligible Premium SSD disks can now achieve up to 30x of the provisioned performance target, better handling for spiky workloads. If you have workloads running on-premises with less predictable disk traffic, you can migrate to Azure and improve your overall performance taking advantage of bursting support.

Disk bursting is enforced on a credit based system, where you will accumulate credits when traffic is below provisioned target and consume credit when it exceeds provisioned. You can best leverage the capability in these scenarios below:

OS disks to accelerate virtual machine (VM) boot: You can expect to experience a boost as part of VM boot where reads to the OS disk may be issued at a higher rate. If you are hosting cloud workstations on Azure, your applications launch time can potentially be reduced taking advantage of additional disk throughput.
Data disks to accommodate spiky traffic: Some production operations trigger spikes of disk input/output (IO) by design. For example, if you conduct a database checkpoint, there will be a sudden increase of writes against the data disk, and a similar increase in reads for backup operations. Disk bursting provides you better flexibility to handle any excepted or unexpected change of disk traffic pattern.

With this preview release, we lower the entry cost of cloud adoption with smaller disk sizes and make our disk offerings more performant leveraging burst support. Start leveraging these new disk capabilities to build your most performant, robust and cost-efficient solution on Azure today!

Getting Started

Create new managed disks on the burst applicable sizes using the Azure portal, Powershell, or command-line interface (CLI) now! You can find the specifications of burst eligible and new disk sizes in the table below. The preview regions that support bursting and new disk sizes are listed in our Azure Disks frequently asked questions article. We are actively extending the preview support to more regions.

Premium SSD managed disks

Bursting capability is supported on Premium SSD managed disks only. It will be enabled by default for all new deployments in the supported regions. For existing disks of the applicable sizes, you can enable bursting with either of the two options: detach and re-attach the disk or stop and restart the attached VM. To learn more details on how bursting works, please refer to this "What disk types are available in Azure?" article.

Burst Capable Disks

Disk Size

Provisioned IOPS per disk

Provisioned Bandwidth per disk

Max Burst IOPS per disk

Max Burst Bandwidth per disk

Max Burst Duration at Peak Burst Rate

P1 – New

4 GiB

120

25 MiB/sec

3,500

170 MiB/sec

30 mins

P2 – New

8 GiB

120

25 MiB/sec

3,500

170 MiB/sec

30 mins

P3 – New

16 GiB

120

25 MiB/sec

3,500

170 MiB/sec

30 mins

P4

32 GiB

120

25 MiB/sec

3,500

170 MiB/sec

30 mins

P6

64 GiB

240

50 MiB/sec

3,500

170 MiB/sec

30 mins

P10

128 GiB

500

100 MiB/sec

3,500

170 MiB/sec

30 mins

P15

256 GiB

1,100

125 MiB/sec

3,500

170 MiB/sec

30 mins

P20

512 GiB

2,300

150 MiB/sec

3,500

170 MiB/sec

30 mins

Standard SSD Managed Disks

Here are the new disk sizes introduced on Standard SSD Disks. The performance targets define the max IOPS and bandwidth you can achieve on these sizes. Compared to Premium SSD Disks above, the disk IOPS and bandwidth offered are not provisioned. For your performance sensitive workloads or single instance deployment, we recommend you leverage Premium SSDs.

 

Disk Size

Max IOPS per disk

Max Bandwidth per disk

E1 – New

4 GiB

120

25 MB/sec

E2 – New

8 GiB

120

25 MB/sec

E3 – New

16 GiB

120

25 MB/sec

Visit our service website to explore the Azure Disk Storage portfolio. To learn about pricing, you can visit the Azure Managed Disks pricing page.

General feedback

We look forward to hearing your feedback on the new disk sizes. Please email us at AzureDisks@microsoft.com.
Quelle: Azure

New features in Azure Monitor metrics explorer based on your feedback

A few months ago, we posted a survey to gather feedback on your experience with metrics in Azure Portal. Thank you for participation and providing valuable suggestions! We appreciate your input, whether you are working on a hobby project, in a governmental organization, or any size company—small to huge. We want to share some of the insights we gained from the survey and highlight some of the features that we delivered based on your feedback. These features include:Resource picker that supports multi-resource scoping.Splitting by dimension allows limiting the number of time series and specifying sort order.Charts can show large number of datapoints.Improved chart legends.Resource picker with multi-resource scopingOne of the key pieces of feedback we heard was about the resource picker panel. You said that being able to select only one resource at a time when choosing a scope is too limiting. Now you can select multiple resources across resources groups in a subscription.  Ability to limit the number of timeseries and change sort order when splitting by dimensionMany of you asked for ability to configure the sort order based on dimension values, and for control over the maximum number of timeseries shown on the chart. Those who asked, explained that for some metrics, such as “Available memory” and “Remaining disk space,” they want to see the timeseries with smallest values, while for other metrics, including “CPU Utilization” or “Count of Failures,” showing the timeseries with highest values make more sense. To make it possible, we expanded the dimension splitter selector with Sort order and Limit count inputs.  Charts that show large number of datapointsCharts with multiple timeseries over the long period, especially with short time grain are based on queries that return lots of datapoints. Unfortunately, processing too many datapoints may slow down chart interactions. To ensure the best performance, we used to apply a hard limit on the number of datapoints per chart, prompting users to lower the time range or to increase the time grain when the query returns too much data. Some of you found the old experience frustrating. You said that that occasionally you might want to plot charts with lots of datapoints, regardless of performance. Based on your suggestions, we changed the way we handle the limit. Instead of blocking chart rendering, we now display a message that suggests that the metrics query will return a lot of data, but letting your proceed anyways (with a friendly reminder that you might need to wait longer for the chart to display).   High-density charts from lots of datapoints can be useful to visualize the outliers, as shown in this example:   Improved chart legendA small but useful improvement was made based on your feedback that the chart legends often wouldn’t fit on the chart, making it hard to interpret the data. This was almost always happening with the charts pinned to dashboards and rendered in the tight space of dashboard tiles, or on screens that have smaller resolution. To solve the problem, we now let you scroll the legend until you find the data you need:  FeedbackLet us know how we’re doing and what more you’d like to see. Please stay tuned for more information on these and other new features in the coming months. We are continuously addressing pain points and making improvements based on your input.If you have any questions or comments before our next survey, please use the feedback button on the Metrics blade. Don’t feel shy about giving us a shout out if you like a new feature or are excited about the direction we’re headed. Smiles are just as important in influencing our plans as frowns!
Quelle: Azure