Amazon AWS News, Entwicklungen, Updates, HowTos - Seite 8 von 668

Amazon SageMaker HyperPod now supports Managed Tiered KV Cache and Intelligent Routing for large language model (LLM) inference, enabling customers to optimize inference performance for long-context prompts and multi-turn conversations. Customers deploying production LLM applications need fast response times while processing lengthy documents or maintaining conversation context, but traditional inference approaches require recalculating attention mechanisms for all previous tokens with each new token generation, creating computational overhead and escalating costs. Managed Tiered KV Cache addresses this challenge by intelligently caching and reusing computed values, while Intelligent Routing directs requests to optimal instances. These capabilities deliver up to 40% latency reduction, 25% throughput improvement, and 25% cost savings compared to baseline configurations. The Managed Tiered KV Cache feature uses a two-tier architecture combining local CPU memory (L1) with disaggregated cluster-wide storage (L2). AWS-native disaggregated tiered storage is the recommended backend, providing scalable terabyte-scale capacity and automatic tiering from CPU memory to local SSD for optimal memory and storage utilization. We also offer Redis as an alternative L2 cache option. The architecture enables efficient reuse of previously computed key-value pairs across requests. The newly introduced Intelligent Routing maximizes cache utilization through three configurable strategies: prefix-aware routing for common prompt patterns, KV-aware routing for maximum cache efficiency with real-time cache tracking, and round-robin for stateless workloads. These features work seamlessly together. Intelligent routing directs requests to instances with relevant cached data, reducing time to first token in document analysis and maintaining natural conversation flow in multi-turn dialogues. Built-in observability integration with Amazon Managed Grafana provides metrics for monitoring performance. You can enable these features through InferenceEndpointConfig or SageMaker JumpStart when deploying models via the HyperPod Inference Operator on EKS-orchestrated clusters. These features are available in all regions where SageMaker HyperPod is available. To learn more, see the user guide.
Quelle: aws.amazon.com

26. November 2025

da Agency

Amazon SageMaker AI now supports EAGLE speculative decoding

Amazon SageMaker AI now supports EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) speculative decoding to improve large language model inference throughput by up to 2.5x. This capability enables models to predict and validate multiple tokens simultaneously rather than one at a time, improving response times for AI applications. As customers deploy AI applications to production, they need capabilities to serve models with low latency and high throughput to deliver responsive user experiences. Data scientists and ML engineers lack efficient methods to accelerate token generation without sacrificing output quality or requiring complex model re-architecture, making it hard to meet performance expectations under real-world traffic. Teams spend significant time optimizing infrastructure rather than improving their AI applications. With EAGLE speculative decoding, SageMaker AI enables customers to accelerate inference throughput by allowing models to generate and verify multiple tokens in parallel rather than one at a time, maintaining the same output quality while dramatically increasing throughput. SageMaker AI automatically selects between EAGLE 2 and EAGLE 3 based on your model architecture, and provides built-in optimization jobs that use either curated datasets or your own application data to train specialized prediction heads. You can then deploy optimized models through your existing SageMaker AI inference workflow without infrastructure changes, enabling you to deliver faster AI applications with predictable performance. You can use EAGLE speculative decoding in the following AWS Regions: US East (N. Virginia), US West (Oregon), US East (Ohio), Asia Pacific (Tokyo), Europe (Ireland), Asia Pacific (Singapore), and Europe (Frankfurt) To learn more about EAGLE speculative decoding, visit AWS News Blog here, and SageMaker AI documentation here.
Quelle: aws.amazon.com

26. November 2025

da Agency

Manage Amazon SageMaker HyperPod clusters with the new Amazon SageMaker AI MCP Server

The Amazon SageMaker AI MCP Server now supports tools that help you setup and manage HyperPod clusters. Amazon SageMaker HyperPod removes the undifferentiated heavy lifting involved in building generative AI models by quickly scaling model development tasks such as training, fine-tuning, or deployment across a cluster of AI accelerators. The SageMaker AI MCP Server now empowers AI coding assistants to provision and operate AI/ML clusters for model training and deployment. MCP servers in AWS provide a standard interface to enhance AI-assisted application development by equipping AI code assistants with real-time, contextual understanding of various AWS services. The SageMaker AI MCP server comes with tools that streamline end-to-end AI/ML cluster operations using the AI assistant of your choice—from initial setup through ongoing management. It enables AI agents to reliably setup HyperPod clusters orchestrated by Amazon EKS or Slurm complete with pre-requisites, powered by CloudFormation templates that optimize networking, storage, and compute resources. Clusters created via this MCP server are fully optimized for high-performance distributed training and inference workloads, leveraging best practice architectures to maximize throughput and minimize latency at scale. Additionally, it provides comprehensive tools for cluster and node management—including scaling operations, applying software patches, and performing various maintenance tasks. When used in conjunction with AWS API MCP Server, AWS Knowledge MCP Server, and Amazon EKS MCP Server you gain complete coverage for all SageMaker HyperPod APIs and you can effectively troubleshoot common issues, such as diagnosing why a cluster node became inaccessible. For cluster administrators, these tools streamline daily operations. For data scientists, they enable you to set up AI/ML clusters at scale without requiring infrastructure expertise, allowing you to focus on what matters most—training and deploying models. You can manage your AI/ML clusters through the SageMaker AI MCP server in all regions where SageMaker HyperPod is available. To get started, visit the AWS MCP Servers documentation.
Quelle: aws.amazon.com