A Developer’s Guide to Managing Models, Cost and Quality in Microsoft Foundry

The hardest part of building AI systems today is no longer getting access to a capable model. It is knowing how to choose, validate, optimize, and operate the right model across the full lifecycle of a real application.

Take a retrieval-augmented generation (RAG)-based customer support copilot or a tool-calling agent that helps employees complete business workflows. In a prototype, it may be enough to pick a strong model, connect a few data sources, and get a useful response. In production, the system needs to retrieve the right context, call the right tools, meet quality and safety thresholds, stay within latency targets, and run at a cost the business can sustain.

Models evolve, costs shift, and production requirements often arrive after the first version is already working. Success depends less on choosing the most powerful model and more on building a disciplined operating approach around the application.

That is where Microsoft Foundry comes in: a unified platform to select, evaluate, optimize, operate, and continuously improve AI applications at production scale.

What’s new

Microsoft Foundry continues to expand the model ecosystem and operating surface for developers building production AI systems.

Fireworks AI on Microsoft Foundry is now generally available, giving developers access to production-grade open model inference through a single Azure endpoint, with enterprise service-level agreements (SLAs) and zero-setup onboarding.

Foundry is also adding new model families and capabilities across modalities, including Microsoft AI models, partner models, open-source models, custom models, and post-trained variants. Together, these updates give developers more choice while keeping selection, evaluation, deployment, and operations in one consistent workflow.

The challenge is no longer access. It is operations.

In a prototype, the questions are simple: Can the model answer the prompt? Can it connect to my data? Can it complete the happy path?

In production, the questions change. Which model fits each task? How do I validate it on my own data? What latency budget does this experience require? How much throughput do I need at peak? What happens when quota is constrained, costs spike, or a newer model becomes available? How do I monitor quality, detect eval drift, roll back safely, and prove the system is governed?

Agentic systems often fail when the model is mismatched, evaluation is incomplete, costs run unchecked, or governance arrives too late. Teams that rely on a single provider face another risk: lock-in, with no escape hatch when a model degrades, pricing changes, or capacity becomes constrained.

Foundry is built on the opposite philosophy. It is a model-agnostic platform spanning Microsoft, open-source, and independent software vendor (ISV) partner models, all on the same operating surface.

The answer is to treat model selection and optimization as a continuous operating discipline: 

1. Select the right model for the task

Model selection is about workload fit, not leaderboard rank. Before choosing a model, define the task contract: what the model needs to do, what good looks like, what constraints it must operate within, and which failure modes are unacceptable.

A routing step may need low latency. A policy question may need grounded reasoning with citations. A coding agent may need deeper reasoning and tool use. A customer-facing copilot may need strong safety boundaries, predictable latency, and cost efficiency at scale.

A simple model selection framework:

Workload needFavor this approachWhyClassification, routing, extraction, or high-volume chatSmaller, lower-latency modelKeeps cost and latency lowComplex reasoning, coding, or planningStronger reasoning modelImproves quality for harder tasksImage, speech, voice, or physical AIModality-specific modelMatches the model to the input and output typeMixed workloads with different complexityModel RouterRoutes each request based on quality, cost, and latencyDomain-specific behavior, tone, or formatFine-tuned or custom modelImproves consistency for your scenario

Effective model choice depends on four dimensions: capability, safety, latency, and cost.

Foundry helps developers make these tradeoffs through a broad model ecosystem and a consistent operating surface. Developers can access Microsoft models, leading base models, partner models like Fireworks AI, open-source models, custom models, and post-trained variants through one selection, evaluation, and deployment workflow.

Developer tip: For developers who want to bypass manual selection, Foundry provides Model Router in Foundry Models. Model Router automatically routes each request to the most appropriate model based on workload characteristics, cost targets, and latency requirements.

2. Validate with your own evals and data

Benchmarks are not enough. A model that leads a public leaderboard may still underperform on your prompts, your data, your users, and your business rules. Production confidence comes from evaluating against the workloads your application will actually run.

With Foundry, developers can bring their own evaluation inputs, including CSV or JSONL datasets with prompts, expected outputs, labels, or ground-truth answers. They can run side-by-side comparisons across models and prompts, evaluate agents and multi-step workflows, and inspect results across datasets, traces, and production-like scenarios.

Built-in quality and safety evaluators help measure signals such as relevance, groundedness, coherence, fluency, safety, and policy adherence. Custom evaluators can capture application-specific rules, formats, and business logic.

A strong evaluation covers:

Quality: Did the model complete the task correctly? Accuracy and groundedness: Did it produce reliable answers based on the right context? Safety: Did it follow policies and avoid unacceptable responses? Performance: Did it meet latency, throughput, and reliability requirements? Cost: Did it deliver the right outcome at the right price?

Evaluation should run continuously as new model versions, fine-tuned variants, agent changes, or new model families become available.

Developer tip: Define success criteria before opening the model catalog. Criteria-first evaluation prevents anchoring on model reputation instead of workload fit.

3. Optimize cost and performance

Cost is a first-class architectural concern, not an afterthought. In prototypes, it may be acceptable to send every task to the most capable model. In production, that approach breaks down quickly.

A simple classification task, a RAG response, a long-context reasoning workflow, and a multi-step agentic process should not always use the same model or deployment strategy.

Foundry gives developers levers to optimize across quality, cost, and latency at the system level:

Intelligent routing: Send each task to the right model based on complexity and budget. Batching: Use asynchronous processing for workloads that do not require real-time responses. Caching: Avoid paying repeatedly for identical or near-identical requests. Provisioned throughput: Use dedicated capacity for predictable performance at scale. Quota management: Scale more predictably with quota tiering, global customer quota, and data zone customer quota. Model optimization: Use model compression, fine-tuning, or distillation where appropriate.

Fireworks AI on Foundry is now generally available, giving developers access to a high-performance open model catalog through a single Azure endpoint, with enterprise SLAs, no separate infrastructure, and no separate contracts.

Developer tip: Profile cost by task type before optimizing globally. Routing decisions are workload-specific, not one-size-fits-all.

4. Operate at scale with enterprise confidence

Deploying an endpoint is not the same as operating a production AI system. Teams need to understand how the system behaves, enforce policies, monitor usage and cost, test model changes safely, and roll back when quality or performance regresses.

Foundry brings these operating capabilities into one surface: versioning, SLA-backed reliability, security, governance, access controls, audit logging, usage monitoring, and controlled upgrades.

Teams can monitor token usage and throughput, inspect logs and traces, evaluate model and agent behavior, enforce policies, and compare changes before rolling them out broadly. As new model versions become available, they can test against evaluation datasets and traces, validate quality, latency, and cost impact, and reduce risk with versioning and rollback strategies.

The Fireworks AI on Foundry generally available (GA) release is a concrete example of this operating model, with enterprise SLAs, provisioned throughput unit (PTU) Data Zone support, SOC2 readiness, and the same access controls and audit logging that govern Foundry.

Production adopters span AI-native and traditional enterprise workloads, including Perplexity, Motif, UiPath, and StackBlitz. During preview, the platform processed more than 176 billion tokens across 17 S&P 500 enterprises.

Developer tip: Treat model upgrades like dependency upgrades: test against baselines, stage rollouts, monitor regressions, and keep a rollback plan.

5. Continuously improve as models and workloads evolve

AI systems are dynamic. Models improve, workloads shift, user behavior changes, pricing evolves, and new model families arrive. The best system today may not be the best system six months from now.

That is why the lifecycle loop matters:

Select the right model for the task. Evaluate it against your own data and production baselines. Optimize for quality, cost, latency, and throughput. Operate with governance, observability, and reliability. Improve as new models, tools, and customization options emerge.

For engineering teams, every model, prompt, tool, agent, or workflow change should be treated like a production change. New model versions should be tested automatically against regression datasets, production traces, and known edge cases before rollout.

A model may improve quality but increase latency, reduce cost but weaken groundedness, or perform better on common cases while regressing on high-risk scenarios. Automated evaluations help teams detect those tradeoffs early.

Developer tip: Automate your evaluation pipeline so every new model version is compared against production baselines for quality, safety, latency, throughput, and cost before deployment.

What this means for developers

The next phase of AI development will not be won by teams that simply have access to the biggest models. It will be won by teams that know how to operate models well.

That means choosing by workload fit, validating with real data, optimizing cost and performance, deploying with governance, and improving as the landscape shifts.

Microsoft Foundry is designed for exactly this reality: a model-agnostic platform spanning Microsoft, open-source, and ISV models, all on one operating surface. No lock-in. No re-architecture. No guesswork.

The future of AI development is not about guessing which model might work. It is about building an operating discipline that lets you know.

Get started

Microsoft Foundry portal

Microsoft Foundry documentation

Fireworks AI on Foundry (now generally available)

Evaluation quickstart

Quota management docs

Watch BRK230: Build smarter AI systems in Foundry as models and costs evolve

Claude Foundry Skilling Learning Path

The post A Developer’s Guide to Managing Models, Cost and Quality in Microsoft Foundry appeared first on Microsoft Azure Blog.
Quelle: Azure

Amazon ECS Managed Daemons now support inter-task visibility and communication

Amazon ECS Managed Daemons now support inter-task visibility and communication, enabling customers to deploy tracing, profiling, and security agents that require access to application processes and shared IPC resources on ECS Managed Instances. With this launch, you can configure two new settings in ECS daemon definitions: pidMode controls whether the daemon can see all processes on the instance, and ipcMode controls whether the daemon shares an IPC namespace with other containers on the instance. Setting either to “shared” grants the daemon access to the respective namespace; the default of “none” keeps daemons isolated from application containers and other tasks. These settings let you run process-aware and IPC-dependent agents as ECS daemons instead of embedding them as sidecars in application task definitions. ECS places exactly one daemon task per managed instance and starts daemons before application tasks, so platform teams can deploy and update agents independently with consistent coverage across all workloads. To get started, register a daemon task definition specifying pidMode or ipcMode set to “shared” using the AWS Console, CLI, CloudFormation, or AWS SDKs, then create or update a daemon with associated ECS Managed Instances capacity providers in your clusters. This feature is now available in all AWS Regions at no additional cost. For more details, refer to our documentation.
Quelle: aws.amazon.com

Amazon OpenSearch Service launches MCP Apps for agentic observability

Amazon OpenSearch Service now supports MCP Apps, bringing observability workflows directly into compatible agentic IDEs such as Claude Desktop and VS Code. With this capability, your AI agent in local environment can investigate incidents using logs, traces, metrics, and alerts stored in OpenSearch domains, collections and Amazon Managed Service for Prometheus. You can easily review and verify the results in interactive MCP App visualizations without leaving your local environment. Each MCP App tool call returns a dual response, a concise text summary for your agent to reason over and an interactive visualization rendered in the same conversation thread for you to review. You can work alongside your observability agent from firing an alert, perform root cause analysis, exploring distributed traces, service maps, PromQL metric charts, and cross-signal correlations all within a single conversation. Available MCP App tools cover log, metrics and trace investigation, service performance, topology, dynamic visualizations, agent health, cluster health, and instrumentation scoring.
The OpenSearch MCP app experience is available is available in all AWS Regions where Amazon OpenSearch UI is offered. To get started, follow the instructions in OpenSearch Agentic observability with MCP Apps. To learn more about OpenSearch, visit Amazon OpenSearch Service Developer Guide.
Quelle: aws.amazon.com

OpenAI GPT-5.4 and GPT-5.5 models now available in US East (N. Virginia) on Amazon Bedrock

Today, AWS announces the expanded availability of OpenAI’s GPT-5.4 and GPT-5.5 models, which are now available in the US East (N. Virginia) Region on Amazon Bedrock. With GPT-5.4 and GPT-5.5, you can build generative AI applications across reasoning, coding, computer use, document workflows, and long-running agentic tasks. GPT-5.5 is OpenAI’s most capable model, designed for advanced coding, research, analysis, software operation, document workflows, and long-running agentic tasks. It can understand open-ended goals, use tools, reason across longer workflows, navigate ambiguity, and carry complex tasks through to completion with less orchestration. GPT-5.4 brings frontier reasoning, coding, computer use, long-context workflows, and tool use to production applications that interpret context, interact with tools, operate software environments, and verify outputs across multiple steps. Both models support a 272K-token context window, accept text and image input, and are available through the Responses API with support for server-side and client-side tool calling, projects, and response streaming. With this launch, GPT-5.4 and GPT-5.5 are now available in additional AWS Regions. To get started, visit the GPT-5.5 and GPT-5.4 model cards in our documentation.
Quelle: aws.amazon.com

Amazon FSx for OpenZFS Intelligent-Tiering storage class is now available in 8 additional AWS Regions

You can now create Amazon FSx for OpenZFS file systems with the Intelligent-Tiering storage class in 8 additional AWS Regions across the US, Europe, Asia Pacific, and South America.
FSx Intelligent-Tiering is built for general-purpose file workloads such as file shares, archives, media libraries, and migrations from on-premises HDD storage. It automatically moves your data across three storage tiers (Frequent Access, Infrequent Access, and Archive) based on access patterns, and an optional SSD read cache keeps your active data fast. You get high performance for active workloads and low-cost storage for everything else, paying only for what you store with no capacity to manage. With FSx Intelligent-Tiering, you can save up to 85% compared to the FSx SSD storage class and up to 20% compared to on-premises HDD-based NAS.
With this expansion, the FSx Intelligent-Tiering storage class is now available for FSx for OpenZFS file systems in the following additional AWS Regions: US West (N. California), Europe (London, Stockholm, Spain, Zurich), Asia Pacific (Hyderabad, Seoul), and South America (São Paulo). To learn more, visit the FSx Intelligent-Tiering page and the Amazon FSx for OpenZFS product page, and see the FSx for OpenZFS Region Table for complete regional availability information.
Quelle: aws.amazon.com