The Immortal Mind

A Manifesto for Organizational Intelligence That SurvivesA Different QuestionFor centuries, humanity has asked a single question:How do we preserve knowledge?We built libraries. We built archives. We built databases. We built clouds.And yet knowledge continues to disappear.Engineers retire. Teams reorganize. Projects are abandoned. Companies are acquired. Entire decades of hard-won experience vanish with the people who created it.The problem is not storage.The problem is continuity.Infrastructure Can Already Heal ItselfWe have been building self-healing digital infrastructure with OpenKubes.Git stores the desired state. Kubernetes reconciles reality back to that state. Servers fail. Clusters disappear. Entire environments get rebuilt from scratch — automatically, without human intervention.Infrastructure survives because it remembers what it should be.We call this the Immortal Platform.Git is the contract.Kubernetes is the enforcer.Target recovery time: under ten minutes. No runbook. No 3am call. No tribal knowledge required.When we built this, we thought we were solving an infrastructure problem.We were actually solving the beginning of a much larger problem.What About Intelligence?Organizations face a different kind of failure — one that no monitoring system detects, no alert fires for, and no on-call engineer gets paged about.The slow disappearance of organizational intelligence.What happens when the engineer who designed the system retires?Who remembers why that architectural decision was made in 2019?Who remembers the three failed approaches before the solution that worked?Who remembers the lessons from the production incident that took down the factory floor for six hours?Most organizations have no answer.Their infrastructure is documented. Their intelligence is not.We have spent ten years building and operating Kubernetes platforms across automotive plants, financial institutions, industrial facilities, and government agencies. We have seen the same pattern repeat itself dozens of times: a new team inherits a platform built by people who are no longer there. They spend months — sometimes years — reverse-engineering decisions that took the original team weeks to make. They repeat mistakes that were already made and documented somewhere no one can find. They abandon patterns that worked because nobody explained why they were there.The cost is not just time. It is confidence. Every inherited system that lacks its original context becomes a system nobody fully trusts, nobody fully understands, and nobody wants to touch.This is not a technology problem. This is a memory problem.The Immortal MindOpenKubes AI begins with a simple idea:Knowledge should be as durable as infrastructure.The founding principle of OpenKubes is that the platform owns contracts, not components. Infrastructure components come and go — the contracts they fulfill persist. The Immortal Mind extends that same principle to knowledge: decisions, context, and lessons are contracts too. Individual people, teams, and documents come and go — the organizational intelligence they carry must persist.If we can build infrastructure that heals itself — that reads a desired state from Git, reconciles toward it, and recovers from failure without human intervention — then we can build intelligence systems that do the same.Not storing documents in a folder that nobody reads.Not writing runbooks that become outdated before the ink dries.But genuinely living knowledge — continuously updated, continuously reconciled, continuously connected to the systems and decisions it describes.Just as Kubernetes reconciles infrastructure to its desired state, future AI systems can continuously reconcile organizational knowledge to its current reality.Git is the contract.Kubernetes is the enforcer.AI is the memory.To be precise about what we mean: the memory itself does not live inside a model. It lives in Git, in architecture decision records, in the knowledge graph, and in the context that evolves with the platform. AI is what connects, understands, and makes that memory accessible. Models will come and go. The organizational memory must endure.The Architecture of Organizational ImmortalityOpenKubes AI envisions four foundational layers — not as a product announcement, but as an architectural direction:Layer 1: Knowledge GraphA structured, living representation of organizational knowledge.People. Systems. Projects. Decisions. Failures. Lessons. Relationships. Dependencies. Evolution.Not a static diagram. A continuously updated graph that reflects the current state of the organization and its history — connected to the actual infrastructure it describes.When a cluster is deployed, the knowledge graph knows why. When an architectural decision is made, it is captured — not in a document folder nobody will find, but in a structured, queryable, AI-accessible form.Not a mockup: the OpenKubes knowledge graph, extracted from Git by a 200-line script — every decision, component, and commit, connected.Explore the interactive version: https://kubernauts.de/en/openkubes/openkubes_knowledge_graph_force_layout.htmlLayer 2: Context StoreGitOps for knowledge.Architecture decision records. Runbooks. Postmortem analyses. Design rationale. Lessons learned. Every significant decision — versioned, auditable, and connected to the code and infrastructure it influenced.Not documentation as an afterthought. Documentation as infrastructure — with the same discipline, the same tooling, the same lifecycle.When you git blame a Kubernetes manifest, you can trace it back to the incident that caused it. When you ask why a system is designed the way it is, the answer is a git log away.This is not theory. It is how OpenKubes is already built: every architectural decision lives as a versioned ADR in the platform repository, and every deviation from upstream in our deployment guides is recorded with its reason and its operational impact. The Context Store simply makes that discipline queryable — and permanent.Layer 3: Model RuntimeOpen AI runtimes deployed anywhere — on the same infrastructure that runs your workloads.Cloud. Edge. Air-gapped factory floors. Sovereign government infrastructure.The intelligence follows the workload. Not locked in a vendor’s cloud. Not dependent on an external API that may change, disappear, or become unavailable in an air-gapped facility.The same platform engineering principles that make OpenKubes infrastructure sovereign make OpenKubes AI intelligence sovereign.Layer 4: Immortal Platform IntegrationThe platform heals itself. The intelligence remembers itself. The system continuously rebuilds both.When a cluster fails and is reprovisioned on fresh bare metal, the AI layer knows the history of that cluster — every deployment, every incident, every change. The infrastructure is new. The memory is intact.Beyond AutomationIt is important to say clearly what this is not.This is not a vision of autonomous machines replacing human engineers.This is not digital immortality for individuals.This is not artificial general intelligence.This is preservation of organizational intelligence.A future where critical knowledge no longer disappears when individuals leave.A future where the organization remembers — not just what it built, but why it built it.The engineer retires. The knowledge stays.The team disbands. The context remains.The company is acquired. The intelligence survives.Why This Matters for Industrial SystemsIn a factory, a hospital, a power grid, or a government agency — the stakes of lost knowledge are not measured in developer productivity.They are measured in production downtime, patient safety, grid stability, and national security.We have seen what happens when a factory floor loses the engineer who understood the control system. We have seen what happens when a hospital’s IT team inherits infrastructure nobody documented. We have seen what happens when a critical system needs to be rebuilt and nobody remembers the original architecture rationale.The infrastructure survived. The intelligence did not. The consequences were real.This is why OpenKubes AI is not a feature.It is a responsibility.The Complete VisionWhen we look at where OpenKubes is going, we see a platform designed not merely for uptime — but for continuity:OpenKubes IMP → Infrastructure survivesOpenKubes AI → Knowledge survivesOpenKubes Robotics → Actions surviveThe Robotics layer is not hypothetical. Open-RMF — the open robotics middleware framework for fleet management, traffic coordination, task dispatching, and simulation — runs today as a reference robotics workload on the OpenKubes platform, deployed through the same GitOps-based operational model as everything else, consistently across local, edge, bare-metal, and public cloud environments.Together these layers form something that has never existed before:A platform where systems, knowledge, and actions persist — regardless of hardware failures, software updates, team changes, or the passage of time.Infrastructure that heals itself. Intelligence that remembers itself. Systems that evolve themselves.An InvitationThis manifesto is not a product announcement.It is a direction.OpenKubes AI does not exist yet as a shipping product. But the architectural foundation does — in the Git repositories, the Crossplane compositions, the Cluster API providers, the running robotics reference workload, and the knowledge accumulated across ten years of building and operating critical Kubernetes infrastructure.The Immortal Mind is where that foundation leads.If you are building systems that cannot afford to forget — factory automation platforms, critical infrastructure, sovereign AI systems, industrial knowledge management — we want to build this with you.Not for you. With you.Because the most important knowledge to preserve is the knowledge we build together.Git is the contract. Kubernetes is the enforcer. AI is the memory.Together, they create systems that do not merely survive failure. They learn from it.???? github.com/openkubes/openkubes ???? OpenKubes Platform Presentation ???? blog.kubernauts.io ???? OpenKubes Roadmap: OK-30 Immortal PlatformArash Kaffamanesh is the founder of Clouds Sky GmbH & Kubernauts GmbH and has been building and operating Kubernetes platforms for over ten years across automotive, industrial, financial, and healthcare environments. He is the creator of OpenKubes — the open platform for self-healing sovereign Kubernetes infrastructure.The Immortal Mind was originally published in Kubernauts on Medium, where people are continuing the conversation by highlighting and responding to this story.
Quelle: blog.kubernauts.io

The 2026 Agent Confidence Index: Where 300 builders see real momentum

A couple of months ago, I sat across from my nine-year-old daughter’s teachers at a parent-teacher conference. They were kind but concerned. She takes her time on assignments, they said—often deep in thought. How would she do on timed tests next year? I told them I wasn’t worried. What they described as a problem is, to me, one of the most important things she can learn: the ability to take a hard problem and reason through it from beginning to end. In a world optimized for efficiency, qualities like patience, perseverance, and attention to detail are not deficiencies. They are the foundation of sound judgment, which will become the skills we need most.

The more time I spend working with AI, the more convinced I become: the question that matters for her future isn’t how quickly she can answer. It’s whether she has the judgment to know when an answer can be trusted.

I’ve spent decades at Microsoft watching this tension play out: first building tools for other developers, then working across AI as models moved from research curiosities to systems deployed at scale. Now we’re building Microsoft IQ, where we’re exploring how an organization’s collective intelligence can become its greatest advantage. Through every one of those chapters, one thing has remained true: it’s never enough for a system to be powerful, it must also be trustworthy.

Trust is what turns assistance into delegation. When we can trust an agent to do what we intend, within the limits we set, we can hand off the work we never wanted to spend our lives on: the repetitive tasks that drain attention, the mundane work that fills a day without moving anything meaningful forward, the dangerous work humans should not have to do, the work too vast for any individual or team. Agents should take on that toil, extend our reach, and give us back our time for the work that calls for something only humans bring.

My daughter doesn’t know any of this yet. But by the time she’s grown, most of the work that rewards speed and repetition will be work we delegate. What will matter then is exactly what gave her teachers pause: the patience to stay with a hard problem, reason through it, and decide when she’s reached a conclusion she can trust. The very thing they feared might hold her back could be exactly what the next era prizes most.

So no, I’m not worried about the timed test. I hope she grows up in a world where software carries the toil and people are freed for the work that is unmistakably ours—to think, to judge, to create, to care for one another. That is the future I want agents to make real.

But my hope is not evidence it will happen. The future I just described turns on a single question: can we trust agents to do the work? Trust is earned one task at a time. So, I went looking for evidence of where it’s been earned, and where it hasn’t.

For the past year, the conversation around AI agents has circled the same promise: eliminate toil so people can focus on what matters. But I keep coming back to sharper questions. What, exactly, is toilsome? Where does toil actually live in people’s work? What are the technical leaders closest to this shift willing to hand off—and what gives them the confidence to do it? To find out, we partnered with MIT Technology Review Insights on new research that draws directly from the people building this frontier. Not the people talking about it, the people doing it. We surveyed 300 technical experts across AI, data, and cloud domains, spanning 12 industries and 4 regions of the world, asking them to rank their confidence across 101 of the top tasks. What we got back is the 2026 Agent Confidence Index, an honest map of where agents are delivering real value, so our community can see what’s working and move forward together with conviction.

Explore the 2026 Agent Confidence Index report

Learn from where confidence is highest

Across the 101 tasks measured, average confidence already lands at 64 out of 100, and thirty tasks clear 70. The highest scores cluster on work that is both predictable and draining: the late nights, the interruptions, the low-value repetition. Automated report generation leads at 83.5. Boilerplate code generation for new features sits at 82.5, the hours a developer no longer spends rewriting the same patterns, freed for the work that challenges them. Certificate expiration monitoring and renewal, at 81.5, ends the scramble that pulls engineers off high-stakes problems for something entirely routine. Real-time data stream monitoring follows at 80.5, and release note generation from commit history at 79.5, the manual end-of-sprint commit review, gone. This is where frontier teams are already delegating to agents, regularly.

The pattern holds across every discipline. In developer and AI workflows it extends to API client maintenance and code identification; in cloud operations, to ticket routing and cost optimization; in data, to anomaly detection. Wherever it sits in the stack, this is work technical teams now trust agents to own.

What matters most here isn’t what the data says about the tasks, it’s what it says about the people delegating them. When technical experts believe in something deeply enough to hand it real work, that belief ripples outward. It becomes the recommendation they make to their leadership, the solution they build for their customers, and the culture they create for their teams.

Even the toughest agent tasks are gaining traction

Here’s what strikes me most: the tasks ranked lower on the index are still high in absolute terms. Service mesh configuration and troubleshooting sits at 37.5, database schema migration scripting at 46.5, memory leak detection at 48.5. These sit at the very frontier, the interconnected, high-stakes work where investment and innovation are concentrated right now.

Consider what they demand. Service mesh configuration touches many systems at once. Database migration carries real stakes, requiring precision across data, application, and infrastructure layers at the same time. Memory leak detection means diving deep into a system’s behavior under load, accounting for conditions that shift from one deployment to the next. These are the challenges that have separated great engineers from exceptional ones—and even here, experts see agents helping. Not carrying the work alone, but contributing where it used to be unthinkable. That confidence is still climbing, and that’s telling.

We’re shipping new capabilities constantly to support this momentum. Database migration tooling in GitHub Copilot now covers not just scripts but the full application and infrastructure migration story. The Azure Site Reliability Engineering (SRE) Agent brings decades of experience operating Azure at scale and deep profiling capabilities directly into memory analysis and performance diagnosis.

Why human judgment remains paramount

When we asked technical experts how they’re navigating agent adoption, 59% named “keeping humans in the loop” as their top priority—ahead of better observability, ahead of governance documentation, and ahead of everything else. That’s a mark of maturity. Teams moving forward with clarity treat agent oversight as non-negotiable, regardless of how capabilities evolve.

The boundary itself is straightforward. Agents excel at well-specified, high-volume, reversible work: they synthesize data, automate known workflows, and surface anomalies at a speed and scale no human team could match. The moment a decision becomes high-stakes, context-dependent, or hard to undo, a human signs off. That isn’t a limitation of the technology, it’s the architecture of a trustworthy system.

What’s changing, and what remains underappreciated, is the skill it takes to draw that boundary well: the discipline of full-lifecycle evaluations and guardrails. Success means measuring agent output against intent and keeping behavior inside your business strategy. It’s new territory for most engineering teams, and it’s becoming table stakes for modern software faster than most organizations realize. The good news: the same tools generating the work can help you build the harness. Ask GitHub Copilot to write the evals and it will. Frontier teams are already doing this, and it’s why they’re pulling ahead.

Agents are opening career doors for engineering

Across system reliability and site operations, evaluations and quality assurance, and data pipeline management, 80% or more of respondents see meaningful career opportunity ahead. We believe this is one of the most significant moments in the history of building software, not because agents replace what technical people do, but because what’s left when they take on the toil is the work that defines a career: the judgment calls, the architectural vision, the reasoning to navigate complexity under pressure. That fluency will define the next generation of technical leadership.

We’re living this shift at Microsoft, right alongside our customers. Junior developers are using agents to explore codebases on their own and arriving at mentoring conversations with sharper, more sophisticated questions. Senior engineers are covering more ground because the repetitive work that used to fill their days is now delegated, and the work that’s left is harder, more interesting, and more consequential. Both are growing into more capable versions of themselves. For me, that’s the outcome I’ve always believed technology could deliver.

An integrated approach to intelligence and trust

Designing more sophisticated agent systems has made one thing clear: agents thrive in well-integrated environments, working best when your whole stack draws on a single source of truth. The high-confidence tasks are the ones we’ve already figured out; the meaningful frontier is the harder, interconnected work, and that’s exactly where observability, governance, security, and unified intelligence have to operate as one.

Microsoft IQ brings your enterprise context into a single, continuous intelligence layer. Within it, Work IQ builds semantic understanding of how your business operates across email, calendar, meetings, chats, files, people, and collaboration patterns. Such depth of knowledge is the reason technical teams choose us and it’s what drives my focus and passion in learning how people actually work so their agents get them. My colleague Kim Manis, CVP of Product for Microsoft Fabric, has written specifically about what this means for data professionals, and the integral role of Fabric IQ.

It’s all part of the Microsoft Agent Platform, which is becoming the operating system for enterprise AI at scale. From building in GitHub and contextualizing with Microsoft IQ, to running in Microsoft Foundry and governing in Microsoft Agent 365, Microsoft is uniquely positioned to help customers bring together data, models, agents, and human judgment into a continuously improving and secure system.

Frontier transformation is being led by builders like you.

Next steps:

Download The 2026 Agent Confidence Index from our partners at MIT Technology Review Insights.It is a free, ungated deep dive into all 101 tasks, broken out by role and workflow, with the patterns and reasoning behind where confidence is strongest and the frontier is expanding.

Join us at the AI Engineering World’s Fair (June 29-July 2) where our very own Pablo Castro will keynote, and our teams will offer 16 breakout sessions and 4 labs. Swing by the Microsoft booth as well to explore an interactive 3D visualization of the Index data. We want to hear what’s working for you right now.

Learn more about Microsoft IQ and how it connects across Work IQ, Fabirc IQ, Foundry IQ, and the newly announced Web IQ. You can catch up on all the developer innovation from Microsoft Build through Satya Nadella’s keynote, Kyle Daigle’s blog post, and the Microsoft Build CLI.

What’s Working in Agentic AI

The 2026 Agent Confidence Index report reveals where agents are trusted, the challenges they face, and what leaders should do next

Download the 2026 Agent Confidence Index report

The post The 2026 Agent Confidence Index: Where 300 builders see real momentum appeared first on Microsoft Azure Blog.
Quelle: Azure

Claude in Microsoft Foundry is now generally available

Claude in Microsoft Foundry is the production path enterprises have been asking for: true frontier model choice, Azure-native controls, simplified procurement, and faster time to value.

Most enterprise AI projects do not stall because of model quality. They stall because of everything around the model: procurement, governance, networking, and data. Claude in Microsoft Foundry is now generally available, hosted on Azure, giving teams a faster path from agent experimentation to production.

Enterprises can build with Claude through their existing Azure account, using the authentication, billing, networking, governance, and data controls their teams already trust. Instead of solving for infrastructure, teams can focus on building agentic applications that run their work with Claude, in the environment where they already operate.

This is a real step forward for customers building agentic applications and want to move from AI experimentation to production. Claude brings leading capabilities for coding, agentic workflows, and complex reasoning. Microsoft Foundry brings the enterprise harness to build, evaluate, deploy, and scale those agents on Azure. Together, they give teams a trusted path to production AI with frontier model quality and the Azure controls they already trust.

Today’s announcement builds on the strategic partnership Microsoft, NVIDIA, and Anthropic announced in November 2025 to expand enterprise access to Claude on NVIDIA accelerated computing. Claude runs on NVIDIA Blackwell Ultra systems, connected by InfiniBand networking, bringing the rack-scale AI infrastructure designed for inference performance and efficiency.

Build with Claude through your Azure account

Developers can access Claude through the Messages API and use core capabilities including prompt caching, extended thinking, and tool streaming. For teams building agents, Foundry Agent Service uses Claude as the reasoning core to orchestrate multi-step planning, tool use, and task execution across enterprise systems.

Inference is processed in Azure, and customers can choose between Global and US data zones, for teams with data residency requirements. Anthropic operates the inference and is the data processor and SLA provider. Because Claude is available natively through Foundry, teams can work inside the Azure environment they already use. They can authenticate with Microsoft Entra ID, apply Azure role-based access controls, manage access through existing governance policies, and track usage through familiar Azure management experiences.

For high-sensitivity workloads, zero data retention is also available, so prompts and completions are not retained by Anthropic after the API call completes. For commercial teams, it also simplifies how Claude is purchased and consumed. Claude usage is billed in Claude Consumption Units (CCU), a single, consolidated line on your Azure bill, with MACC drawdown and per-model detail in Foundry unchanged.

For many enterprises, that matters as much as model capability. The barrier to production isn’t only whether a model is powerful enough, it’s whether teams can procure it, govern it, secure it, and operate it at scale inside their existing cloud. With Claude in Foundry, they get frontier capabilities in an Azure environment that aligns with enterprise requirements for security, compliance posture, governance, and data residency.

Running Anthropic’s models on Azure has given us the sustained throughput and reliability our enterprise customers expect. The combination of frontier model quality and enterprise-grade infrastructure is what makes Bolt viable for the Fortune 500.
—Gary Ballabio, Vice President, Partnerships, Bolt

Customers are already building with Claude in Foundry

Enterprises aren’t just running isolated pilots; they’re building production systems and agents that need throughput, reliability, governance, security, and scale.

At NVIDIA, we use autonomous AI agents every day to help our teams move faster and think bigger. Anthropic’s Claude models bring strong reasoning, coding and enterprise capabilities that are valuable for complex technical work. With Claude now available in Microsoft Foundry running on NVIDIA GB300 GPUs, more organizations can run advanced, specialized AI agents with the performance, scale and security needed for production.
—Justin Boitano, Vice President and GM of Enterprise Computing, NVIDIA

Our customers describe their tests in plain English, and Momentic runs through the interface to verify everything works before a release ships. We found Claude’s Opus models especially suited to this, and running them on Microsoft Foundry we now serve millions of tokens per minute with the reliability our customers depend on.
—Jeff An, Co-Founder and CEO, Momentic

const currentTheme =
localStorage.getItem(‘blogInABoxCurrentTheme’) ||
(window.matchMedia(‘(prefers-color-scheme: dark)’).matches ? ‘dark’ : ‘light’);

// Modify player theme based on localStorage value.
let options = {“autoplay”:false,”hideControls”:null,”language”:”en-us”,”loop”:false,”partnerName”:”cloud-blogs”,”poster”:”https://azure.microsoft.com/en-us/blog/wp-content/uploads/2026/06/Replit_Thumb-1-scaled.jpg”,”title”:””,”sources”:[{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4240300-ReplitOnMicrosoftFoundryWithClaude-0x1080-6439k”,”type”:”video/mp4″,”quality”:”HQ”},{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4240300-ReplitOnMicrosoftFoundryWithClaude-0x720-3266k”,”type”:”video/mp4″,”quality”:”HD”},{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4240300-ReplitOnMicrosoftFoundryWithClaude-0x540-2160k”,”type”:”video/mp4″,”quality”:”SD”},{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4240300-ReplitOnMicrosoftFoundryWithClaude-0x360-958k”,”type”:”video/mp4″,”quality”:”LO”}],”ccFiles”:[{“url”:”https://azure.microsoft.com/en-us/blog/wp-json/bloginabox/v1/get-captions?url=https%3A%2F%2Fwww.microsoft.com%2Fcontent%2Fdam%2Fmicrosoft%2Fbade%2Fvideos%2Fproducts-and-services%2Fen-us%2Fazure%2F4240300-replitonmicrosoftfoundrywithclaude%2F4240300-ReplitOnMicrosoftFoundryWithClaude_cc_en-us.ttml”,”locale”:”en-us”,”ccType”:”TTML”}],”downloadableFiles”:[{“url”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4240300-ReplitOnMicrosoftFoundryWithClaude_transcript_en-us”,”locale”:”en-us”,”mediaType”:”transcript”},{“url”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4240300-ReplitOnMicrosoftFoundryWithClaude_audio_en-us”,”locale”:”en-us”,”mediaType”:”audio”}]};

if (currentTheme) {
options.playButtonTheme = currentTheme;
}

document.addEventListener(‘DOMContentLoaded’, () => {
ump(“ump-6a47a7aeadec9″, options);
});

Built for coding, agents, and complex reasoning

Claude models are especially well-suited to some of the fastest-growing enterprise AI workloads. For software teams, Claude supports code generation, refactoring, debugging, test creation, and large-scale development workflows. For teams building agents, it powers multi-step reasoning, tool use, planning, and task execution. For business teams, it supports document-heavy analysis, research synthesis, and complex decision support.

In Microsoft Foundry, these capabilities connect to the broader Azure ecosystem. With Foundry Agent Service, teams orchestrate multi-step, goal-driven agents that use Claude as their reasoning core, planning, calling tools, and executing tasks across enterprise systems. Features like model router enable customers to automatically route queries to the most appropriate Claude model, saving up to 50% while improving user satisfaction. All this governed and monitored by Foundry Control Plane which continuously runs evaluations to ensure agent responses match customer expectations, even blocking responses that violate rules before they reach users.

Between Anthropic and Azure, we get the best capabilities in the world and we get the best security in the world. And that’s exactly what nuclear needs. It’s how we compressed a safety analysis that would have taken 200 human days into a single day.
—Matt Huang, Founding Product Lead, Everstar

And with Microsoft IQ, agents have access to live enterprise context which radically improves value per token, and helps Foundry amplify the impact customers can have: tools like agent optimizer in Foundry Agent Service tune the prompts which define agents so they perform better regardless of what model is under the hood.

const currentTheme =
localStorage.getItem(‘blogInABoxCurrentTheme’) ||
(window.matchMedia(‘(prefers-color-scheme: dark)’).matches ? ‘dark’ : ‘light’);

// Modify player theme based on localStorage value.
let options = {“autoplay”:false,”hideControls”:null,”language”:”en-us”,”loop”:false,”partnerName”:”cloud-blogs”,”poster”:”https://cdn-dynmedia-1.microsoft.com/is/image/microsoftcorp/4234700-AI-Skills-Studio_tbmnl_en-us?wid=1280″,”title”:””,”sources”:[{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4234700-AI-Skills-Studio-0x1080-6439k”,”type”:”video/mp4″,”quality”:”HQ”},{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4234700-AI-Skills-Studio-0x720-3266k”,”type”:”video/mp4″,”quality”:”HD”},{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4234700-AI-Skills-Studio-0x540-2160k”,”type”:”video/mp4″,”quality”:”SD”},{“src”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4234700-AI-Skills-Studio-0x360-958k”,”type”:”video/mp4″,”quality”:”LO”}],”ccFiles”:[{“url”:”https://azure.microsoft.com/en-us/blog/wp-json/bloginabox/v1/get-captions?url=https%3A%2F%2Fwww.microsoft.com%2Fcontent%2Fdam%2Fmicrosoft%2Fbade%2Fvideos%2Fproducts-and-services%2Fen-us%2Fazure%2F4234700-ai-skills-studio%2F4234700-AI-Skills-Studio_cc_en-us.ttml”,”locale”:”en-us”,”ccType”:”TTML”}],”downloadableFiles”:[{“url”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4234700-AI-Skills-Studio_transcript_en-us”,”locale”:”en-us”,”mediaType”:”transcript”},{“url”:”https://cdn-dynmedia-1.microsoft.com/is/content/microsoftcorp/4234700-AI-Skills-Studio_audio_en-us”,”locale”:”en-us”,”mediaType”:”audio”}]};

if (currentTheme) {
options.playButtonTheme = currentTheme;
}

document.addEventListener(‘DOMContentLoaded’, () => {
ump(“ump-6a47a7aeae198″, options);
});

A stronger foundation for enterprise AI on Azure

The next phase of enterprise AI will be defined by production systems: coding agents, business process agents, research assistants, customer-facing applications, and domain-specific workflows that operate reliably at scale. That takes more than access to a model. It takes a platform.

With Claude now generally available in Microsoft Foundry and hosted on Azure, customers can build with Anthropic’s leading models, orchestrating them as agents with Foundry Agent Service and grounding them in enterprise knowledge with Microsoft IQ, while using the Azure controls, commitments, and infrastructure they already trust.

Try Claude in Foundry today

Build with Anthropic’s leading models in the Azure ecosystem you know and trust.

Get started

The post Claude in Microsoft Foundry is now generally available appeared first on Microsoft Azure Blog.
Quelle: Azure

Azure IaaS: How to design, build, and optimize cloud infrastructure for long-term cost efficiency

In this article

Compute: Matching resources to workload requirementsStorage: Balancing performance and lifecycle managementNetworking: Improving efficiency without compromising resiliencyContinuous optimization is where long-term savings happenContinue your Azure IaaS optimization journeyCreate a resilient infrastructure with Azure

This blog post is the third part of a blog series called Azure IaaS which will share best practices and guidance to help you build a trusted infrastructure platform—from performance, resiliency, and security to scalability and cost efficiency.

As organizations modernize infrastructure, migrate mission-critical workloads, build cloud-native applications, and scale AI—cost efficiency remains a foundational principle of cloud architectures.

Discover the Benefits of Azure IaaS

Yet cloud costs are rarely driven by a single decision. More often, across Azure Infrastructure-as-a-Service (IaaS) environments, they are the result of many compounded architectural choices across compute, storage, and networking.

Common examples include overprovisioning infrastructure when selecting a larger virtual machine than a workload requires or keeping infrequently accessed data on premium storage, building resilient architectures that introduce unnecessary overhead, or collecting and retaining more operational data than is needed. Individually, these decisions may seem minor, but over time they can significantly impact both cost and operational efficiency.

These challenges become even more important as organizations expand AI initiatives, modernize applications, and support growing performance and resiliency requirements.

The opportunity lies in addressing these inefficiencies before they become entrenched. By making informed infrastructure decisions during planning, deployment, and ongoing operations, organizations can improve resource utilization, reduce total cost of ownership (TCO), and create a more scalable foundation for future growth.

In this blog, we’ll explore some of the most common infrastructure cost challenges organizations face today and examine how Azure IaaS capabilities across compute, storage, and networking can help improve efficiency, reduce TCO, and highlight resources available in the Azure IaaS Resource Center to help you make more informed decisions.

Many of the most impactful optimization opportunities originate long before a workload enters production. To better understand where these opportunities exist, let’s examine common efficiency challenges (and solutions) across compute, storage, and networking.

Compute: Matching resources to workload requirements

Compute inefficiencies are often the easiest to identify because they directly affect both performance and infrastructure spend.

The goal is not simply to select the lowest-cost compute option, but rather to align infrastructure resources with workload requirements while preserving flexibility for future growth.

Azure provides a broad portfolio of virtual machine options, enabling organizations to select the architecture, processor type, performance profile, and scale characteristics that best match workload needs; allowing organizations to align infrastructure investments with workload needs rather than paying for unused capacity.

Equally important is taking advantage of Azure’s flexible pricing options. Depending on workload characteristics, organizations can combine Pay-As-You-Go pricing, Azure savings plans, Azure Reservations, and Azure Spot Virtual Machines to better align costs with actual usage patterns.

For highly scalable environments, services such as Azure Virtual Machine Scale Sets automatically balance compute demand with available capacity by scaling resources up or down in real time, ensuring the environment is right-sized while optimizing cost efficiency. Azure Compute Fleet help organizations intelligently balance capacity, availability, and price-performance across large deployments; reducing the operational complexity associated with managing infrastructure at scale.

The result is a compute environment that is not only cost-efficient, but also better aligned to application requirements and business outcomes.

Storage: Balancing performance and lifecycle management

Storage inefficiencies often develop gradually, at times making them difficult to identify until environments reach significant scale. The key is to ensure that performance, capacity, and data access requirements remain aligned.

Choose the right storage service for the workload

Storage performance requirements vary dramatically across workloads. Some applications demand consistent low-latency block storage, while others prioritize storage capacity, durability, or long-term retention. Selecting the appropriate storage service and performance tier is critical to maximizing both efficiency and value.

For example:

Business applications may benefit from Premium SSD v2 offerings.

Business-critical transactional databases may require Ultra Disk to meet stringent low-latency performance requirements.

Large-scale block storage environments can benefit from consolidated architectures using Azure Elastic Storage Area Network (SAN).

Linux/Windows file shares, home directories, and shared storage scenarios may benefit from Azure Files or Azure NetApp Files.

Object storage workloads often benefit from the alignment between Azure Blob Storage tiers and data access patterns.

Automate data lifecycle management

Equally important is ensuring data remains on the appropriate storage tier throughout its lifecycle. In many environments, data access patterns change significantly over time, yet storage configurations remain static. This disconnect often results in organizations paying for performance they no longer need.

Azure Blob Storage provides capabilities that help organizations automatically align storage costs with data access patterns. Automated tiering and lifecycle policies maintain low-latency access for frequently used data while optimizing costs by transitioning infrequently accessed data to lower-cost tiers.

The result is a storage strategy that continuously adapts as usage patterns evolve, without requiring ongoing manual intervention.

Improve visibility across your storage estate

Optimization starts with understanding where costs are being generated.

Tools such as Azure Storage Discovery and Azure Storage Actions can help organizations gain visibility into their storage environments, uncover optimization opportunities, and automate actions across large-scale deployments.

Rather than managing storage account by account, teams can identify patterns and implement cost-saving actions consistently across their entire data estate.

Together, these capabilities help organizations move beyond storage provisioning and toward ongoing storage optimization.

Networking: Improving efficiency without compromising resiliency

Networking presents a unique optimization challenge because organizations must balance connectivity, performance, resiliency, and operational visibility.

Achieve resiliency more efficiently

Historically, improving resiliency often requires duplicating infrastructure components, creating additional cost and management overhead. Today, organizations increasingly seek architectures that deliver resiliency while minimizing complexity and excess infrastructure.

Azure networking capabilities help organizations evaluate these tradeoffs more effectively. Services such as ExpressRoute Metro, Zone Redundant NAT Gateway, and scalable networking architectures provide opportunities to improve resiliency and scalability while maintaining operational efficiency.

Reduce operational and logging expenses

Operational visibility is another important consideration. Network and firewall logs are essential for troubleshooting, security, and governance, but collecting every possible data point can create significant storage and operational costs over time.

Modern filtering and analytics capabilities help teams focus on the most relevant network data, reducing both storage consumption and investigation complexity.

This gives organizations the information they need while avoiding excessive log growth and long-term retention costs.

By implementing filtering, automation, and intelligent logging strategies, organizations can focus on the data that provides actionable insights while reducing unnecessary information collection and retention.

Continuous optimization is where long-term savings happen

Infrastructure efficiency is not achieved through a single migration, architecture review, or pricing decision.

As workloads evolve, usage patterns shift, and new platform capabilities become available, opportunities for optimization continuously emerge.

The organizations that realize the greatest value from cloud investments are often those that treat optimization as an ongoing operational discipline. They regularly evaluate infrastructure utilization, revisit architectural assumptions, automate lifecycle management processes, and adopt new capabilities that improve efficiency across their environments.

While individual improvements may appear incremental, the cumulative impact can be substantial. A right-sized virtual machine (VM), a more appropriate storage tier, an automated lifecycle policy, or a more efficient networking architecture may each deliver modest savings independently. Together, they create a more efficient, scalable, and resilient infrastructure foundation.

Azure continues to deliver important capabilities such as Azure Copilot to help customers optimize cloud costs by combining real-time insights, AI-driven recommendations, and automated optimization actions, empowering teams to quickly identify waste, right-size resources, and forecast spend with minimal effort.

Continue your Azure IaaS optimization journey

Whether you’re supporting AI workloads, modernizing existing applications, migrating existing workloads, or planning future growth, building efficiency into cloud architectures has never been more important.

The Azure IaaS Resource Center provides guidance, best practices, technical resources, and optimization strategies across compute, storage, and networking to help you design, build, and optimize Azure environments with confidence.

Visit the Azure IaaS Resource Center to explore cost optimization guidance, architectural best practices, product resources, and tools that can help you maximize value from your Azure infrastructure investments.

To go deeper, explore the Azure IaaS Resource Center for tutorials, best practices, and guidance across compute, storage, and networking to help you design and operate resilient infrastructure with greater confidence.

Create a resilient infrastructure with Azure

Visit the Azure IaaS Resource Center to start building a stronger, more efficient infrastructure today.

Get started with Azure

Did you miss these posts in the Azure IaaS series?

Explore new resources for building a stronger, more efficient infrastructure

Keep critical applications running with built-in resiliency at scale

Defense in depth built on secure-by-design principles 

Deploy high-performance workloads with a system-level approach 

The post Azure IaaS: How to design, build, and optimize cloud infrastructure for long-term cost efficiency appeared first on Microsoft Azure Blog.
Quelle: Azure

Proving application resilience on Azure with Chaos Studio

Takeaway: Azure Chaos Studio helps organizations validate application resilience by simulating outages, failovers, network disruptions, and infrastructure failures before they impact production.

You don’t know with certainty that your application is resilient until that resilience is tested. Better to learn it isn’t by deliberately breaking it in a test environment and watching how it reacts, than by a failure in production. Azure Chaos Studio is our managed service for doing exactly that, safely and on purpose.

Today, Azure Chaos Studio Workspaces is in public preview: a scenario-focused approach that lets you test the failure modes Azure customers actually see in production. We’ve been hard at work making Workspaces easy to use, with broad fault support and named scenarios that mirror real outages, instead of isolated faults.

Explore Azure Chaos Studio Workspaces

Why designing for resilience isn’t enough

Azure customers have invested in resilient design: multi-zone deployments, geo-redundant storage, automatic database failover, retry logic, load-balanced front ends. However, the real question is when an incident begins: when the failure arrives, do those mechanisms recover your application in the time you assumed they would?

Real outages don’t read the architecture diagram. A zone-redundant deployment can fail because a health probe was misconfigured years ago. A database with automatic failover can leave the application dead because a connection string is hard coded to a single region. Geo-redundant storage can briefly produce stale reads the application code never expected. These mistakes are common, and they only show up when the failure happens.

Reliability and resiliency on Azure are a shared responsibility. Microsoft is responsible for the platform and the resilience built into Azure services. Customers are responsible for configuring that resilience and the code that uses it. No layer makes up for a gap in another. The only way to know whether your architecture, configuration, and application logic will hold up in production is to prove they hold under failure before an outage tests them for you.

How Chaos Studio Workspaces changes resilience testing

Chaos Studio is Azure’s managed chaos engineering service for validating how applications behave under failure. By simulating controlled disruptions across infrastructure, networking, databases, and application dependencies, it helps teams uncover resilience gaps before customers experience them. Chaos Studio Workspaces focuses on scenarios that match what happens in production, so you start from a real outage pattern instead of assembling individual faults. You begin with a named scenario like Zone Down, DNS Outage, or SQL failover, already sequenced against the resources in a Workspace.

Most outages exercise two layers at once. There’s the platform layer: did the service come back, did failover complete within your Recovery Time Objective, did traffic reroute. And there’s the application layer: did your code maintain data integrity, pick up in-flight transactions, retry the right things, degrade gracefully. A chaos test that only stops a Virtual Machine (VM) tells you about the platform layer. The scenarios in Chaos Studio Workspaces are designed to validate the entire stack.

Workspaces reduce the burden of getting started. The most common reason resilience testing stalls is that teams don’t know where to start. The Workspace is the new top-level resource: you point it at a subscription or resource group, and its managed identity discovers what’s in scope and recommends the scenarios that apply. Those scenarios show up inside the Workspace, ready to configure and run, and a refresh, updates the recommendations whenever your infrastructure changes.

A library of real outage scenarios. Chaos Studio Workspaces ships with curated scenarios informed by patterns observed in real Azure incidents, so the patterns you test against are the patterns customers actually experience. Think of these as resilience templates, a fast path to the failure modes most teams need to test, and when you need something different, design your own from the same fault library.

Available today:

Availability Zone Down: Virtual Machine Scale Sets (VMSS) shutdown with per-zone targeting to validate cross-zone routing and recovery.

Availability Zone Down and Database failover: Compute Zone Down composed with Azure Database for PostgreSQL (Flexible Server) failover, to observe failover behavior against your configured recovery objectives and application-side connection handling.

DNS Outage: a full DNS resolution outage via NSG rules that block resolver traffic, to validate how your application behaves when name resolution fails.

Microsoft Entra ID Outage: identity-provider failure that exercises authentication retry, token caching, and fallback paths.

Cache Stampede: Redis flush combined with database restart and an App Service process crash, to validate behavior under a cache-miss storm and the resulting database surge. The App Service process-crash variant currently supports Windows App Service plans.

Event-Driven Messaging Disruption: Azure Service Bus and Event Hubs disable, to validate dead-letter handling and backpressure.

Behind every scenario are granular API-level actions built for Workspaces:

Zonal VMSS shutdown

App Service process kill

Force-failover for Azure Database for PostgreSQL (Flexible Server)

Azure Managed Redis flush

Network Security Group (NSG)-based network controls

Each scenario composes the right faults automatically. And when a curated scenario doesn’t match your workload, you can build your own. The new Scenario Designer is a drag-and-drop experience in the Azure portal for composing any of these faults into a custom scenario arranging steps, branches, and faults with the same flexibility as classic Chaos Studio experiments, now available directly inside Workspaces. Start with a curated template, or design from scratch using the full fault library.

VM agent faults such as Central Processing Unit (CPU) and memory pressure also run in Workspaces. Each scenario sequences the right combination of faults automatically, so running Zone Down + Database Failover doesn’t mean thinking in terms of “shut down VMSS instances in zone 1, then force-failover the database primary.” The library will keep growing through public preview and into GA, with plans to explore additional scenarios over time, such as:

Storage account failover

Microsoft Azure SQL Managed Instance failover

Microsoft Azure Front Door and Microsoft Azure Application Gateway

Partial zone degradation

Microsoft Azure Kubernetes Service (AKS)-native pod chaos

Customer-observed region down

That same foundation is also relevant for AI applications moving into production. Copilots, agents, retrieval-augmented generation pipelines, and inference endpoints may introduce new AI-specific failure modes, but they still rely on the same Azure building blocks as other distributed applications: compute, databases, caches, search indexes, identity, networking, messaging, and storage. Chaos Studio Workspaces can validate that foundation today through scenarios like Zone Down, Database Failover, DNS Outage, Cache Stampede, and Event-Driven Messaging Disruption, while the catalog continues to evolve toward AI-specific behaviors such as retrieval drift, token throttling, and model behavior shifts under load as more insights are gathered fromworking closely with customers building AI on Azure.

Scenario reports. When a run finishes, Chaos Studio Workspaces generates a structured drill report. It lays out what the scenario injected, which resources it affected, how the recovery timeline played out, which signals were attributable to the drill versus the normal baseline, and where the workload behaved differently than expected. The report reads like an internal post-incident review, which makes it useful both for the team that ran the drill and for the leaders who want to see resilience being validated regularly. Teams can export it and attach it to change tickets, audit evidence, or service health reviews.

Bringing resilience testing into AI-powered operations

Alongside the product, we’re shipping two ways to drive Chaos Studio from the tools engineers already work in. The first is the Chaos Studio Skill for GitHub Copilot: it walks you through the whole loop in a conversation. Point a Workspace at a subscription, see the scenarios it recommends, run a drill, and get back a report of what actually happened, correlated against your Azure Monitor signals.

The second is an Model Context Protocol (MCP) server that exposes the same Chaos Studio operations as typed tools, so other assistants and autonomous agents: Claude, Cursor, Codex, or your own, can provision a Workspace, run a scenario, and query the signals around it without a person in the loop. Both run against the same Chaos Studio APIs and your own Azure sign-in, and you can try them today.

We’re shipping this on day one for one reason: When a customer asks an AI assistant about Chaos Studio, the experience should be shaped by us, not improvised by a large language model (LLM) reading our REST API. In our experience, one of the hardest parts of resilience testing is often deciding to run the drill in the first place, and that decision increasingly lives in the chat tools engineers already use, so this needs to live there too.

Where this is headed: The Skill becoming a step inside automated operations flows on Microsoft Foundry, and one of the ways an Azure SRE agent validates its own assumptions about how a workload fails. Try it and tell us what’s missing; we’ll close the gaps through public preview.

Get started

Azure Chaos Studio Workspaces is in public preview today. General availability is currently targeted for late 2026, subject to change.

To start:

Create a Workspace scoped to a subscription or resource group you want to test.

Let discovery populate the recommended scenarios for the resources it finds. Prefer to build your own? Open the Scenario Designer and compose a custom scenario from the fault library, no scripting required.

Run your first drill. If you’ve never run a chaos experiment, run Zone Down. A full availability-zone failure surfaces how compute placement, database failover, DNS resolution, and application-layer retry logic respond under stress. If your workload recovers within an acceptable time, you’ve gained evidence about how it responds to one of the most common causes of extended cloud downtime. If it doesn’t, you’ve found the gap on your terms instead of your customers’.

Resilience isn’t something a single feature, a single redundancy mechanism, or a single architecture decision will give you. It’s an engineering discipline, and the discipline requires verification. Azure Chaos Studio Workspaces is how we’re making that verification the default for Azure workloads, including the AI workloads more of our customers are putting into production.

Related resources

Azure Chaos Studio

Microsoft Azure Well-Architected Framework—Reliability

Recommendations for using availability zones and regions

Business continuity and disaster recovery

Reliability guides for Azure services

Chaos Studio Workspaces documentation

Scenario catalog

Quickstart: create a Workspace and run your first drill

Announcing Azure Infrastructure Resiliency Manager Public Preview

Run your first resilience testing today

With Azure Chaos Studio Workspaces, you can simulate failures across your stack and gain practical insight into recovery behavior

Try now

The post Proving application resilience on Azure with Chaos Studio appeared first on Microsoft Azure Blog.
Quelle: Azure

Meet Brain: The AI system behind Azure reliability

In this article

How Azure's AI-powered reliability intelligence system worksWhy Brain is neededWhat is Brain? Azure’s centralized AIOps for cloud reliabilityFoundations of Azure’s digital twin for cloud healthWhat it means to operate against a cloud intelligence systemThe future of agentic AI and cloud operationsWhat's next for Azure reliability and BrainAzure reliability

Takeaway: Brain is Azure’s AI-powered cloud reliability intelligence system: an AIOps system that sits as an intelligent layer on top of Azure Resource Graph and fuses platform telemetry, AI/ML models, service dependencies, and customer impact into a single, continuously updated view of how every service, region, and workload is performing. It already powers customer Azure resource health notifications, deployment safeguards, and outage declaration, and it is the foundation for agentic AI now reshaping how Azure operates. This post starts a multi-part series on what Brain is, how we built it, what we’ve learned operating it at scale, and where it goes next.

How Azure’s AI-powered reliability intelligence system works

Azure runs on a digital twin of its own health. Brain is an AIOps-powered cloud health intelligence system that operates as an intelligent layer on top of Azure Resource Graph (ARG); together, they form this digital twin. It integrates platform telemetry, AI/ML models, and data engineering to continuously maintain and enrich a real-time view of how services, regions, and customer workloads are performing across Azure. Over time, that shared picture is becoming the foundation for a more automated reliability surface: one that can turn insight into action.

Today, Brain already powers important reliability workflows across Azure, such as health notifications for customer’s resources, deployment safeguards, and outage declaration. If you run on Azure, Brain is already changing three things you can notice:

How fast we tell you when something is wrong.

How accurately we scope it to your resources.

How quickly the right engineer gets on it.

This post is about how and what it lets you do differently.

We’re starting a multi-post series with this one to take you through what Brain is, how we built it, what it has learned operating Azure at scale, and where it goes next. Today, the foundation.

Learn more about Azure reliability

Why Brain is needed

Azure runs hundreds of services across more than 80 Azure regions, over 500 datacenters, and over 800,000 kilometers of fiber and subsea cable, representing one of the world’s largest global cloud footprints. And yet with the massive amount of activity these Azure services create, manage, and process worldwide, on a quietly degrading day, we will sometimes still learn about an issue from a customer before our own systems do. For customers, that gap is the worst kind of incident; the one where they are debugging their own application before they learn the fault was ours.

That gap between what we measure and what we know is the limiting factor on cloud reliability today. It is not a tooling problem. We have plenty of tools. It is a comprehension problem. The amount of signal a hyperscale cloud produces has outgrown the human ability to read it, and the conventional answer: more dashboards, more alerts, more on-call rotations. It’s a treadmill, not an answer. Every additional dashboard gives an operator another window to look through; what’s missing is something that tells them what they’re looking at, in time to act.

Closing that gap meant building something we hadn’t built yet: not better dashboards, not smarter alerts, but a continuously updated model of the platform’s health that reasons across every signal in real time, and acts on those conclusions automatically at the scale the platform demands.

What is Brain? Azure’s centralized AIOps for cloud reliability

Brain is Azure’s centralized AIOps-powered cloud health intelligence system that uses AI/ML, including agentic AI and data engineering, to continuously model Azure’s health and to automatically take reliability actions based on it. It has been utilized in Azure production generating resource health determinations across the platform. 

At its core, Brain is shaped by three things: what goes in, what comes out, and what those outputs drive.

Brain at a glance.

Brain ingests signals from three classes of source:

Standardized service-level indicators: the SLIs Azure customers and operators already know from their reliability dashboards.

Domain-specific monitors that individual service teams have built and registered with Brain, and the broader telemetry stream including deployments, support volume, and cross-service dependency signals.

Third-party indicators that surround every Azure operation.

Each path serves a different purpose; together, they give Brain coverage that no single path could.

Regardless of the input, Brain evaluates every subject (service, region, deployment unit, or customer resource) and returns four outputs: health state, severity, impact, and the reason for its conclusion. Standard outputs in standard vocabulary mean every downstream system speaks the same language; no more disconnect in what “impacted” means across teams.

The insights generated by Brain power a growing set of automated reliability actions, including:

Outage declarations based on blast radius.

Customer notifications targeted to affected subscriptions and regions.

Incident routing to the appropriate service team.

Deployment gates that pause harmful rollouts.

Linking related incidents.

Diagnostic tools that help engineers investigate issues.

Foundations of Azure’s digital twin for cloud health

To understand what makes “the intelligence system” different from “a dashboard,” it helps to look at what’s actually in its foundation. Brain’s representation of Azure carries, at minimum:

Topology: every service, region, availability zone, deployment unit, and dependency graph enabled by Azure Resource Graph is represented as a live model that updates as services scale, dependencies change, and new components come online. This transparency into Azure service health and downstream impact helps Azure customers understand and diagnose application issues more quickly and improves the reliability of applications built on Azure.

Service catalog: what each service does, who owns it, what its tier is, what its expected behavior looks like, and what its service-level objectives are.

Runtime state: live indicators of how every component is currently behaving, including error rates, latency, throughput, resource utilization, and error distributions across customers.

Intent: what’s supposed to be happening right now, which deployments are in flight, which planned operations are underway, and which capacity changes are scheduled.

History: prior incidents, what caused them, what mitigated them, and which signals preceded them. The system’s working memory of how Azure has gotten unhealthy before, and what worked to fix it.

The customer’s view: what each tenant is currently experiencing. Not just what the platform is emitting, but what’s actually arriving at the customer’s application. Errors customers see, latency customers feel, and regions where their traffic is succeeding or failing.

None of these are novel on their own: every cloud platform has versions of each. Brain brings them together into a single, unified, AI-driven representation instead of scattering them across twelve separate dashboards in twelve separate tools that an operator has to mentally connect under time pressure.

When Brain says a service is degrading, that statement is not a threshold being crossed. It is a determination made by reasoning across topology, runtime state, current intent, historical patterns, and customer-side evidence simultaneously. It is the intelligence system speaking, not a metric firing. And it is the speed of that determination measured in seconds, not in the minutes a human would take to assemble the same picture from separate tools that translates directly into customer experience: shorter incidents, sharper notifications, and faster routing.

What it means to operate against a cloud intelligence system

This is the move that changes everything for an Azure customer, and it’s the one most easily missed if you read “digital twin” as a metaphor rather than as a system.

Consider how a deployment-driven degradation typically resolves in two different worlds.

In a world without a shared intelligence system, the work is reconstruction. A rollout is in flight. A region’s error rate begins to drift.

The team that owns the service sees the drift in their dashboard.

The team that owns the upstream dependency sees a different metric drift in their dashboard.

The team that owns the deployment system sees the rollout proceeding normally from their dashboard.

None of those three teams initially have the picture; they get on a bridge and assemble it from fragments. While they assemble, the customer impact spreads. By the time the connection between the rollout, the dependency, and the customer-visible errors is made, by humans, under pressure, mid-incident, the rollout has reached more regions, the customer ticket queue has grown, and the resolution is now harder than it had to be.

In a world with the intelligence system, the work is consumption. The rollout is in the intelligence system, Brain knows it’s in flight: what it’s changing, what regions it’s reaching, what it’s supposed to do. The error-rate drift is in the system: Brain sees it correlated to the rollout, weighted against the dependency graph, evaluated against historical patterns of what “small wobble” looks like versus what “real degradation” looks like.

The affected customers are in the system, their tenants map to platform resources affected by the upstream dependency, which is itself affected by the rollout. Brain produces a single determination: the rollout is causing customer-visible impact in this region; expected resolution requires the rollout to pause.

That determination then flows, at the same moment, to every system that needs to act on it. The deployment system pauses the rollout while the determination is true, so the next set of customers Brain would have impacted aren’t impacted at all.

The incident management system creates a single incident with the upstream dependency identified, not three duplicate incidents from three confused teams so the right engineer reaches the right problem first. The customer communication system drafts a notification with the right tenant scope and the right plain-English description, so the customers who are affected receive updates from Microsoft sooner, with information they can actually use. For Azure customers, none of that coordination is visible. What’s visible is a shorter incident, an accurate alert that hit their automation instead of a human, and diagnosis that was already named when their on call opened the bridge. On services where Brain’s resource-health evaluation is in production, detection precision for service-impacting issues has improved significantly, and coverage of in-scope incidents continues to expand.

In the past year, a substantial majority of Brain-integrated outages were auto-communicated to affected customers, and on those, time-to-notification improved materially compared to manually issued notifications.

None of those downstream systems are doing their own investigation. They all consume the same determination from the intelligence system, in the same vocabulary, with the same supporting evidence. That is what “operating against an intelligence system” means and it is the first thing we found we had to build before any of the agentic AI work that people associate with Azure today became viable.

This not only helps to improve Azure’s reliability, but also benefits Azure customers who built their applications on top of Azure by providing transparency of service health and timely communications.

The future of agentic AI and cloud operations

There is a larger conversation happening across the cloud industry this year about agentic AI and about AI systems that act, not just observe. Microsoft is part of that conversation. But the conversation has a quiet asymmetry that gets less attention than it deserves.

Agents need something to be agentic about:

A triage agent that doesn’t know the dependency graph cannot triage anything.

A diagnosis agent that cannot reach prior incident history cannot reason about root cause.

A communication agent that doesn’t know which customers are actually affected cannot write to them.

None of these systems are meaningfully autonomous; none of them deserve your trust if every one of them has to do their own investigation of what reality is, every time, from raw signals.

That is what made the health intelligence system “the digital twin”: the prerequisite, not the consequence, of agentic operations at this scale. Build the agents first, on top of fragmented data, and you get a federation of confident systems that disagree with each other in production. Build the model first, and the agents become composable: they reason from the same picture, and the picture is one you can audit.

This is the throughline of the series we’re starting today. Brain is the cloud health intelligence system the next generation of cloud agents will need. If your organization is exploring agentic AI for any operations function: your cloud, your applications, or your infrastructure, the architectural pattern Brain represents is one to look at carefully. The agents are the headline; the intelligence system underneath is the work.

What’s next for Azure reliability and Brain

We have the system. The system has determination. A service in a region is degrading.

However, degrading compared to what? Healthy by whose definition? When two teams disagree about whether their service is healthy, which one is right? When the platform is degrading but no individual customer is impacted yet, what state are we actually in?

Those are not philosophical questions. They are the next engineering questions we have to answer, because a system cannot make determinations until the people building it agree on what determinations actually are. Most of the industry, until recently, has been quietly getting this wrong.

In the next post in this series, we’ll show you exactly how, and what we built to replace the broken vocabulary of cloud health that the industry has been operating on for the last decade. To follow the series as new posts are published, see the Advancing reliability blog tag.

Azure reliability

Query, explore, and analyze your cloud resources at scale.

Learn more

Acknowledgments

This work reflects the contributions of many engineers and researchers across the Brain AIOps team, MSR (Microsoft Research), and Azure service teams.
The post Meet Brain: The AI system behind Azure reliability appeared first on Microsoft Azure Blog.
Quelle: Azure