New serverless model customization capability in Amazon SageMaker AI

Amazon Web Services (AWS) announces a new serverless model customization capability that empowers AI developers to quickly customize popular models with supervised fine-tuning and the latest techniques like reinforcement learning. Amazon SageMaker AI is a fully managed service that brings together a broad set of tools to enable high-performance, low-cost AI model development for any use case. 
Many AI developers seek to customize models with proprietary data for improved accuracy, but this often requires lengthy iteration cycles. For example, AI developers must define a use case and prepare data, select a model and customization technique, train the model, then evaluate the model for deployment. Now AI developers can simplify the end-to-end model customization workflow, from data preparation to evaluation and deployment, and accelerate the process. With an easy-to-use interface, AI developers can quickly get started and customize popular models, including Amazon Nova, Llama, Qwen, DeepSeek, and GPT-OSS, with their own data. They can use supervised fine-tuning and the latest customization techniques such as reinforcement learning and direct preference optimization. In addition, AI developers can use the AI agent-guided workflow (in preview), and use natural language to generate synthetic data, analyze data quality, and handle model training and evaluation—all entirely serverless. 
You can use this easy-to-use interface in the following AWS Regions: Europe (Ireland), US East (N. Virginia), Asia Pacific (Tokyo), and US West (Oregon). To join the waitlist to access the AI agent-guided workflow, visit the sign-up page. 
To learn more, visit the SageMaker AI model customization page and blog.
Quelle: aws.amazon.com

Announcing TypeScript support in Strands Agents (preview) and more

In May, we open sourced the Strands Agents SDK, an open source python framework that takes a model-driven approach to building and running AI agents in just a few lines of code. Today, we’re announcing that TypeScript support is available in preview. Now, developers can choose between Python and TypeScript for building Strands Agents. TypeScript support in Strands has been designed to provide an idiomatic TypeScript experience with full type safety, async/await support, and modern JavaScript/TypeScript patterns. Strands can be easily run in client applications, in browsers, and server-side applications in runtimes like AWS Lambda and Bedrock AgentCore. Developers can also build their entire stack in Typescript using the AWS CDK. We’re also announcing three additional updates for the Strands SDK. First, edge device support for Strands Agents is generally available, extending the SDK with bidirectional streaming and additional local model providers like llama.cpp that let you run agents on small-scale devices using local models. Second, Strands steering is now available as an experimental feature, giving developers a modular prompting mechanism that provides feedback to the agent at the right moment in its lifecycle, steering agents toward a desired outcome without rigid workflows. Finally, Strands evaluations is available in preview. Evaluations gives developers the ability to systematically validate agent behavior, measure improvements, and deploy with confidence during development cycles. Head to the Strands Agents GitHub to get started building.
Quelle: aws.amazon.com

Amazon SageMaker HyperPod now supports checkpointless training

Amazon SageMaker HyperPod now supports checkpointless training, a new foundational model training capability that mitigates the need for a checkpoint-based job-level restart for fault recovery. Checkpointless training maintains forward training momentum despite failures, reducing recovery time from hours to minutes. This represents a fundamental shift from traditional checkpoint-based recovery, where failures require pausing the entire training cluster, diagnosing issues manually, and restoring from saved checkpoints, a process that can leave expensive AI accelerators idle for hours, costing your organization wasted compute.
Checkpointless training transforms this paradigm by preserving the model training state across the distributed cluster, automatically swapping out faulty training nodes on the fly and using peer-to-peer state transfer from healthy accelerators for failure recovery. By mitigating checkpoint dependencies during recovery, checkpointless training can help your organization save on idle AI accelerator costs and accelerate time. Even at larger scales, checkpointless training on Amazon SageMaker HyperPod enables upwards of 95% training goodput on cluster sizes with thousands of AI accelerators.
Checkpointless training on SageMaker HyperPod is available in all AWS Regions where Amazon SageMaker HyperPod is currently available. You can enable checkpointless training with zero code changes using HyperPod recipes for popular publicly available models such as Llama and GPT OSS. For custom model architectures, you can integrate checkpointless training components with minimal modifications for PyTorch-based workflows, making it accessible to your teams regardless of their distributed training expertise.
To get started, visit the Amazon SageMaker HyperPod product page and see the checkpointless training GitHub page for implementation guidance.
Quelle: aws.amazon.com

Azure networking updates on security, reliability, and high availability

Enabling the next wave of cloud transformation with Azure Networking

The cloud landscape is evolving at an unprecedented pace, driven by the exponential growth of AI workloads and the need for seamless, secure, and high-performance connectivity. Azure Network services stand at the forefront of this transformation, delivering the hyperscale infrastructure, intelligent services, and resilient architecture that empower organizations to innovate and scale with confidence.

Get the latest Azure Network services updates here

Azure’s global network is purpose-built to meet the demands of modern AI and cloud applications. With over 60 AI regions, 500,000+ miles of fiber, and more than 4 petabits per second (Pbps) of WAN capacity, Azure’s backbone is engineered for massive scale and reliability. The network has tripled its overall capacity since the end of FY24, now reaching 18 Pbps, ensuring that customers can run the most demanding AI and data workloads with uncompromising performance.

In this blog, I am excited to share about our advancements in data center networking that provides the core infrastructure to run AI training models at massive scale, as well as our latest product announcements to strengthen the resilience, security, scale, and the capabilities needed to run cloud native workloads for optimized performance and cost.

AI at the heart of the cloud

AI is not just a workload—it’s the engine driving the next generation of cloud systems. Azure’s network fabric is optimized for AI at every layer, supporting long-lasting, high-bandwidth flows for model training, low-latency intra-datacenter fabrics for GPU clusters, and secure, lossless traffic management. Azure’s architecture integrates InfiniBand and high-speed Ethernet to deliver ultra-fast, lossless data transfer between compute and storage, minimizing training times and maximizing efficiency. Azure’s network is built to support workloads with distributed GPU pools across datacenters and regions using a dedicated AI WAN. Distributed GPU clusters are connected to the services running in Azure regions via a dedicated and private connection that uses Azure Private Link and hardware based VNet appliance running high performant DPUs.

Azure Network services are designed to support users at every stage—from migrating on-premises workloads to the cloud, to modernizing applications with advanced services, to building cloud-native and AI-powered solutions. Whether it’s seamless VNet integration, ExpressRoute for private connectivity, or advanced container networking for Kubernetes, Azure provides the tools and services to connect, build, and secure the cloud of tomorrow.

Resilient by default

Resiliency is foundational to Azure Networking’s mission. We continue to execute on the goal to provide resiliency by default. In continuing with the trend of offering zone resilient SKUs of our gateways (ExpressRoute, VPN, and Application Gateway), the latest to join the list is Azure NAT Gateway. At Ignite 2025, we announced the public preview of Standard NAT Gateway V2 which offers zone redundant architecture for outbound connectivity at no additional cost. Zone Redundant NAT gateways automatically distribute traffic to available zones during an outage of a single zone. It also supports 100 Gbps of total throughput and can handle 10 million packets per second. It is IPv6 ready out of the gate and provides traffic insights with flow logs. Read the NAT Gateway blog for more information.

Pushing the boundaries on security

We continue to advance our platform with security as the top mission, adhering to the principles of Secure Future Initiatives. Along these lines, we are happy to announce the following capabilities in preview or GA:

DNS Security Policy with Threat Intel: Now generally available, this feature provides smart protection with continuous updates, monitoring, and blocking of known malicious domains.

Private Link Direct Connect: Now in public preview, this extends Private Link connectivity to any routable private IP address, supporting disconnected VNets and external SaaS providers, with enhanced auditing and compliance support.

JWT Validation in Application Gateway: Application Gateway now supports JSON Web Token (JWT) validation in public preview, delivering native JWT validation at Layer 7 for web applications, APIs, and service-to-service (S2S) or machine-to-machine (M2M) communication. This feature shifts the token validation process from backend servers to the Application Gateway, improving performance and reducing complexity. This capability enables organizations to strengthen security without adding complexity, offering consistent, centralized, secure-by-default Layer 7 controls that allow teams to build and innovate faster while maintaining a trustworthy security posture.​

Forced tunneling for VWAN Secure Hubs: Forced Tunnel allows you to configure Azure Virtual WAN to inspect Internet-bound traffic with a security solution deployed in the Virtual WAN hub and route inspected traffic to a designed next hop instead of directly to the Internet. Route Internet traffic to edge Firewall connected to Virtual WAN via the default route learnt from ExpressRoute, VPN or SD-WAN. Route Internet traffic to your favorite Network Virtual Appliance or SASE solution deployed in spoke Virtual Network connected to Virtual WAN.

Providing ubiquitous scale

Scale is of utmost importance to customers looking to fine tune their AI models or low latency inferencing for their AI/ML workloads. Enhanced VPN and ExpressRoute connectivity, and scalable private endpoints further strengthen the platform’s reliability and future-readiness.

ExpressRoute 400G: Azure will be supporting 400G ExpressRoute direct ports in select locations starting 2026. Users can use multiple of these ports to provide multi-terabit throughput via dedicated private connection to on-premises or remote GPU sites.

High throughput VPN Gateway: We are announcing GA of 3x faster VPN gateway connectivity with support for single TCP flow of 5Gbps and a total throughput of 20 Gbps with four tunnels.

High scale Private Link: We are also increasing the total number of private endpoints allowed in a virtual network to 5000 and a total of 20,000 cross peered VNets.

Advanced traffic filtering for storage optimization in Azure Network Watcher: Targeted traffic logs help optimize storage costs, accelerate analysis, and simplify configuration and management.

Enhancing the experience of cloud native applications

Elasticity and the ability to scale seamlessly are essential capabilities Azure customers who deploy containerized apps expect and rely on. AKS is an ideal platform for deploying and managing containerized applications that require high availability, scalability, and portability. Azure’s Advanced Container Networking Service is natively integrated with AKS and offered as a managed networking add-on for workloads that require high performance networking, essential security and pod level observability.

We are happy to announce the product updates below in this space:

eBPF Host Routing in Advanced Container Networking Services for AKS: By embedding routing logic directly into the Linux kernel, this feature reduces latency and increases throughput for containerized applications.

Pod CIDR Expansion in Azure CNI Overlay for AKS: This new capability allows users to expand existing pod CIDR ranges, enhancing scalability and adaptability for large Kubernetes workloads without redeploying clusters.

WAF for Azure Application Gateway for Containers: Now generally available, this brings secure-by-design web application firewall capabilities to AKS, ensuring operational consistency and seamless policy management for containerized workloads.

Azure Bastion now enables secure, simplified access to private AKS clusters, reducing setup effort and maintaining isolation and providing cost savings to users.

These innovations reflect Azure Networking’s commitment to delivering secure, scalable, and future-ready solutions for every stage of your cloud journey. For a full list of updates, visit the official Azure updates page.

Get started with Azure Networking

Azure Networking is more than infrastructure—it’s the catalyst for foundational digital transformation, empowering enterprises to harness the full potential of the cloud and AI. As organizations navigate their cloud journeys, Azure stands ready to connect, secure, and accelerate innovation at every step.

All updates in one spot
From Azure DNS to Virtual Network, stay informed on what's new with Azure Networking.

Get more information here

The post Azure networking updates on security, reliability, and high availability appeared first on Microsoft Azure Blog.
Quelle: Azure