Amazon CloudWatch now provides lock contention diagnostics for Amazon RDS for PostgreSQL

Amazon CloudWatch Database Insights now provides lock contention diagnostics for Amazon RDS for PostgreSQL instances. This feature helps you identify the root cause behind both ongoing and historical lock contention issues within minutes. The lock contention diagnostics feature is available exclusively in the Advanced mode of CloudWatch Database Insights. With this launch, you can visualize a locking condition in the Database Insights console, which shows the relationship between blocking and waiting sessions. The visualization helps you quickly identify the dominating sessions, queries, or objects causing lock contention. Additionally, this feature persists historical locking data for 15 months, allowing you to analyze and investigate historical locking conditions. You no longer need to manually run custom queries or rely on application logs to diagnose lock contention issues, streamlining the troubleshooting process. You can get started with this feature by enabling the Advanced mode of CloudWatch Database Insights on your Amazon RDS for PostgreSQL clusters using the RDS console, AWS APIs, or the AWS SDK. CloudWatch Database Insights delivers database health monitoring aggregated at the fleet level, as well as instance-level dashboards for detailed database and SQL query analysis. CloudWatch Database Insights is available in all public AWS Regions and offers vCPU-based pricing – see the pricing page for details. For further information, visit the Database Insights documentation.
Quelle: aws.amazon.com

Amazon Connect now supports dynamic dialing mode switching for outbound campaigns

Today, AWS announces the general availability of dynamic dialing mode switching for Amazon Connect Outbound Campaigns, which allows contact center administrators to change between preview and non-preview dialing modes during active campaign execution. Previously, campaigns were locked into their initial dialing mode once started, requiring administrators to stop and restart campaigns to adjust strategies. This launch solves the problem of inflexible dialing strategies that couldn’t adapt to real-time business needs and agent availability changes.
Dynamic dialing mode switching enables contact centers to optimize agent productivity and campaign efficiency in real-time without campaign interruptions. For example, you can automatically switch from progressive dialing to preview mode when handling high-priority contacts that require additional context, then revert back when traffic returns to normal patterns. This flexibility is particularly valuable for campaigns with varying contact priorities or fluctuating agent availability throughout the day.
Dynamic dialing mode switching is available at no additional cost in all AWS Regions where Amazon Connect Outbound Campaigns is supported: US East (N. Virginia), US West (Oregon), Canada (Central), Europe (Frankfurt), Europe (London), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), and Africa (Cape Town).
To learn more, see the Amazon Connect Administrator Guide or visit the Amazon Connect website. 
Quelle: aws.amazon.com

AWS Marketplace now supports multiple purchases of SaaS and Professional Services products

AWS Marketplace now supports Concurrent Agreements for SaaS and Professional Services products, enabling buyers to make multiple purchases for the same product within a single AWS account. Previously, buyers could only maintain one active agreement per product per AWS account, requiring sellers to use workarounds to support expansion deals. Concurrent Agreements removes this constraint, allowing different business units to procure independently with their own negotiated terms and pricing.
Both buyers and sellers benefit from the flexibility Concurrent Agreements provides. Buyers can accept multiple offers for the same product without disrupting existing agreements, supporting multi-team procurement within centralized AWS accounts, mid-term expansions, and repeat purchases. Sellers can close multi-business unit deals that couldn’t happen before, transact expansions immediately instead of waiting for renewal cycles, and eliminate the operational overhead of managing workarounds. 
Concurrent Agreements is enabled by default for all Professional Services listings starting today, with no seller action required. For SaaS listings, sellers must update their AWS Marketplace integration to handle multiple active subscriptions, including updating subscription notifications to use EventBridge and updating entitlement and metering APIs. Starting June 1, 2026, support for Concurrent Agreements will be required for new SaaS products. Sellers who have completed the integration work can opt in to enable Concurrent Agreements for their SaaS products now. 
This capability is available in all AWS Regions where AWS Marketplace is supported. Concurrent Agreements purchasing is available on SaaS products where sellers have completed the integration, and is enabled by default for all Professional Services listings. To learn more about enabling Concurrent Agreements as a seller of SaaS products, review the Concurrent Agreements integration lab.
Quelle: aws.amazon.com

Amazon ECS Managed Instances now integrates with Amazon EC2 Capacity Reservations

Amazon Elastic Container Service (Amazon ECS) Managed Instances now integrates with Amazon EC2 Capacity Reservations, enabling you to leverage your reserved capacity for predictable workload availability, while ECS handles all infrastructure management. This integration helps you balance reliable capacity scaling with cost efficiency, helping achieve high availability for mission‑critical workloads. Amazon ECS Managed Instances is a fully managed compute option designed to eliminate infrastructure management overhead, dynamically scale EC2 instances to match your workload requirements, and continuously optimize task placement to reduce infrastructure costs. With today’s launch, you can configure your ECS Managed Instances capacity providers to use capacity reservations by setting the capacityOptionType parameter to reserved, in addition to the existing spot and on-demand options. You can also specify reservation preferences to optimize cost and availability: use reservations-only to launch EC2 instances exclusively in reserved capacity for maximum predictability, reservations-first to prefer reservations while maintaining flexibility to fall back to on-demand capacity when needed, or reservations-excluded to prevent your capacity provider from using reservations altogether. To get started, you can use the AWS Management Console, AWS CLI, AWS CloudFormation, or AWS SDKs to configure your ECS Managed Instances capacity provider by choosing capacityOptionType=reserved and providing a capacity reservation group and reservation strategy. This feature is now available in all AWS Regions. For more details, refer to the documentation.
Quelle: aws.amazon.com

Docker Model Runner Brings vLLM to macOS with Apple Silicon

vLLM has quickly become the go-to inference engine for developers who need high-throughput LLM serving. We brought vLLM to Docker Model Runner for NVIDIA GPUs on Linux, then extended it to Windows via WSL2.

That changes today. Docker Model Runner now supports vllm-metal, a new backend that brings vLLM inference to macOS using Apple Silicon’s Metal GPU. If you have a Mac with an M-series chip, you can now run MLX models through vLLM with the same OpenAI-compatible API, same Anthropic-compatible API for tools like Claude Code, and all in one, the same Docker workflow.

What is vllm-metal?

vllm-metal is a plugin for vLLM that brings high-performance LLM inference to Apple Silicon. Developed in collaboration between Docker and the vLLM project, it unifies MLX, the Apple’s machine learning framework, and PyTorch under a single compute pathway, plugging directly into vLLM’s existing engine, scheduler, and OpenAI-compatible API server.

The architecture is layered: vLLM’s core (engine, scheduler, tokenizer, API) stays unchanged on top. A plugin layer consisting of MetalPlatform, MetalWorker, and MetalModelRunner handles the Apple Silicon specifics. Underneath, MLX drives the actual inference while PyTorch handles model loading and weight conversion. The whole stack runs on Metal, Apple’s GPU framework.

+————————————————————-+
| vLLM Core |
| Engine | Scheduler | API | Tokenizers |
+————————————————————-+
|
v
+————————————————————-+
| vllm_metal Plugin Layer |
| +———–+ +———–+ +————————+ |
| | Platform | | Worker | | ModelRunner | |
| +———–+ +———–+ +————————+ |
+————————————————————-+
|
v
+————————————————————-+
| Unified Compute Backend |
| +——————+ +—————————-+ |
| | MLX (Primary) | | PyTorch (Interop) | |
| | – SDPA | | – HF Loading | |
| | – RMSNorm | | – Weight Conversion | |
| | – RoPE | | – Tensor Bridge | |
| | – Cache Ops | | | |
| +——————+ +—————————-+ |
+————————————————————-+
|
v
+————————————————————-+
| Metal GPU Layer |
| Apple Silicon Unified Memory Architecture |
+————————————————————-+

Figure 1: High-level architecture diagram of vllm-metal. Credit: vllm-metal

What makes this particularly effective on Apple Silicon is unified memory. Unlike discrete GPUs where data must be copied between CPU and GPU memory, Apple Silicon shares a single memory pool. vllm-metal exploits this with zero-copy tensor operations. Combined with paged attention for efficient KV cache management and Grouped-Query Attention support, this means you can serve longer sequences with less memory waste.

vllm-metal runs MLX models published by the mlx-community on Hugging Face. These models are built specifically for the MLX framework and take full advantage of Metal GPU acceleration. Docker Model Runner automatically routes MLX models to vllm-metal when the backend is installed, falling back to the built-in MLX backend otherwise.

How vllm-metal works

vllm-metal runs natively on the host. This is necessary because Metal GPU access requires direct hardware access and there is no GPU passthrough for Metal in containers.

When you install the backend, Docker Model Runner:

Pulls a Docker image from Hub that contains a self-contained Python 3.12 environment with vllm-metal and all dependencies pre-packaged.

Extracts it to `~/.docker/model-runner/vllm-metal/`.

Verifies the installation by importing the `vllm_metal` module.

When a request comes in for a compatible model, the Docker Model Runner’s scheduler starts a vllm-metal server process that communicates over TCP, serving the standard OpenAI API. The model is loaded from Docker’s shared model store, which contains all the models you pull with `docker model pull`.

Which models work with vllm-metal?

vllm-metal works with safetensors models in MLX format. The mlx-community on Hugging Face maintains a large collection of quantized models optimized for Apple Silicon. Some examples you can try:

https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit

https://huggingface.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

https://huggingface.co/mlx-community/Qwen3-Coder-Next-4bit

vLLM everywhere with Docker Model Runner

With vllm-metal, Docker Model Runner now supports vLLM across the three major platforms:

Platform

Backend

GPU

Linux

vllm

NVIDIA (CUDA)

Windows (WSL2)

vllm

NVIDIA (CUDA)

macOS

vllm-metal

Apple Silicon (Metal)

The same docker model commands work regardless of platform. Pull a model, run it. Docker Model Runner picks the right backend for your platform.

Get started

Update to Docker Desktop 4.62 or later for Mac, and install the backend:

docker model install-runner –backend vllm-metal

Check out the Docker Model Runner documentation to learn more. For contributions, feedback, and bug reports, visit the docker/model-runner repository on GitHub.

Giving Back: vllm-metal is Now Open Source

At Docker, we believe that the best way to accelerate AI development is to build in the open. That is why we are proud to announce that Docker has contributed the vllm-metal project to the vLLM community. Originally developed by Docker engineers to power Model Runner on macOS, this project now lives under the vLLM GitHub organization. This ensures that every developer in the ecosystem can benefit from and contribute to high-performance inference on Apple Silicon. The project also has had significant contributions by Lik Xun Yuan, Ricky Chen and Ranran Haoran Zhang.

The $599 AI Development Rig

For a long time, high-throughput vLLM development was gated behind a significant GPU cost. To get started, you typically need a dedicated Linux box with an RTX 4090 ($1,700+) or enterprise-grade A100/H100 cards ($10,000+).

vllm-metal changes the math

Now, a base $599 Mac Mini with an M4 chip becomes a viable vLLM development environment. Because Apple Silicon uses Unified Memory, that 16GB (or upgraded 32GB/64GB) of RAM is directly accessible by the GPU. This allows you to:

Develop & Test Locally: Build your vLLM-based applications on the same machine you use for coding.

Production-Mirroring: Use the exact same OpenAI-compatible API on your Mac Mini as you would on an H100 cluster in production.

Energy Efficiency: Run inference at a fraction of the power consumption (and heat) of a discrete GPU rig.

How does vllm-metal compare to llama.cpp?

We benchmarked both backends using Llama 3.2 1B Instruct with comparable 4-bit quantization, served through Docker Model Runner on Apple Silicon.

llama.cpp

vLLM-Metal

Model

unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_0

mlx-community/llama-3.2-1b-instruct-4bit

Format

GGUF (Q4_0)

Safetensors (MLX 4-bit)

Throughput (tokens/sec, wall-clock)

max_tokens

llama.cpp

vLLM-Metal

speedup

128

333.3

251.5

1.3x

512

345.1

279.0

1.3x

1024

338.5

275.4

1.2x

2048

339.1

279.5

1.2x

Each configuration was run 3 times across 3 different prompts (9 total requests per data point).

Throughput is measured as completion_tokens / wall_clock_time, applied consistently to both backends.

Key observations:

llama.cpp is consistently ~1.2x faster than vLLM-Metal across all output lengths.

llama.cpp throughput is remarkably stable (~333-345 tok/s regardless of max_tokens), while vLLM-Metal shows more variance between individual runs (134-343 tok/s).

Both backends scale well. Neither backend shows significant degradation as output length increases.

Quantization methods differ (GGUF Q4_0 vs MLX 4-bit), so this benchmarks the full stack, engine + quantization, rather than the engine alone.

The benchmark script used for these results is available as a GitHub Gist.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. To get involved:

Star the repository: Show your support by starring the Docker Model Runner repo.

Contribute your ideas: Create an issue or submit a pull request. We’re excited to see what ideas you have!

Spread the word: Tell your friends and colleagues who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Learn More

Read the companion post: OpenCode with Docker Model Runner for Private AI Coding

Check out the Docker Model Runner General Availability announcement

Visit our Model Runner GitHub repo

Get started with a simple hello GenAI application

Quelle: https://blog.docker.com/feed/