Docker Model Runner Brings vLLM to macOS with Apple Silicon

vLLM has quickly become the go-to inference engine for developers who need high-throughput LLM serving. We brought vLLM to Docker Model Runner for NVIDIA GPUs on Linux, then extended it to Windows via WSL2.

That changes today. Docker Model Runner now supports vllm-metal, a new backend that brings vLLM inference to macOS using Apple Silicon’s Metal GPU. If you have a Mac with an M-series chip, you can now run MLX models through vLLM with the same OpenAI-compatible API, same Anthropic-compatible API for tools like Claude Code, and all in one, the same Docker workflow.

What is vllm-metal?

vllm-metal is a plugin for vLLM that brings high-performance LLM inference to Apple Silicon. Developed in collaboration between Docker and the vLLM project, it unifies MLX, the Apple’s machine learning framework, and PyTorch under a single compute pathway, plugging directly into vLLM’s existing engine, scheduler, and OpenAI-compatible API server.

The architecture is layered: vLLM’s core (engine, scheduler, tokenizer, API) stays unchanged on top. A plugin layer consisting of MetalPlatform, MetalWorker, and MetalModelRunner handles the Apple Silicon specifics. Underneath, MLX drives the actual inference while PyTorch handles model loading and weight conversion. The whole stack runs on Metal, Apple’s GPU framework.

+————————————————————-+
| vLLM Core |
| Engine | Scheduler | API | Tokenizers |
+————————————————————-+
|
v
+————————————————————-+
| vllm_metal Plugin Layer |
| +———–+ +———–+ +————————+ |
| | Platform | | Worker | | ModelRunner | |
| +———–+ +———–+ +————————+ |
+————————————————————-+
|
v
+————————————————————-+
| Unified Compute Backend |
| +——————+ +—————————-+ |
| | MLX (Primary) | | PyTorch (Interop) | |
| | – SDPA | | – HF Loading | |
| | – RMSNorm | | – Weight Conversion | |
| | – RoPE | | – Tensor Bridge | |
| | – Cache Ops | | | |
| +——————+ +—————————-+ |
+————————————————————-+
|
v
+————————————————————-+
| Metal GPU Layer |
| Apple Silicon Unified Memory Architecture |
+————————————————————-+

Figure 1: High-level architecture diagram of vllm-metal. Credit: vllm-metal

What makes this particularly effective on Apple Silicon is unified memory. Unlike discrete GPUs where data must be copied between CPU and GPU memory, Apple Silicon shares a single memory pool. vllm-metal exploits this with zero-copy tensor operations. Combined with paged attention for efficient KV cache management and Grouped-Query Attention support, this means you can serve longer sequences with less memory waste.

vllm-metal runs MLX models published by the mlx-community on Hugging Face. These models are built specifically for the MLX framework and take full advantage of Metal GPU acceleration. Docker Model Runner automatically routes MLX models to vllm-metal when the backend is installed, falling back to the built-in MLX backend otherwise.

How vllm-metal works

vllm-metal runs natively on the host. This is necessary because Metal GPU access requires direct hardware access and there is no GPU passthrough for Metal in containers.

When you install the backend, Docker Model Runner:

Pulls a Docker image from Hub that contains a self-contained Python 3.12 environment with vllm-metal and all dependencies pre-packaged.

Extracts it to `~/.docker/model-runner/vllm-metal/`.

Verifies the installation by importing the `vllm_metal` module.

When a request comes in for a compatible model, the Docker Model Runner’s scheduler starts a vllm-metal server process that communicates over TCP, serving the standard OpenAI API. The model is loaded from Docker’s shared model store, which contains all the models you pull with `docker model pull`.

Which models work with vllm-metal?

vllm-metal works with safetensors models in MLX format. The mlx-community on Hugging Face maintains a large collection of quantized models optimized for Apple Silicon. Some examples you can try:

https://huggingface.co/mlx-community/Llama-3.2-1B-Instruct-4bit

https://huggingface.co/mlx-community/Mistral-7B-Instruct-v0.3-4bit

https://huggingface.co/mlx-community/Qwen3-Coder-Next-4bit

vLLM everywhere with Docker Model Runner

With vllm-metal, Docker Model Runner now supports vLLM across the three major platforms:

Platform

Backend

GPU

Linux

vllm

NVIDIA (CUDA)

Windows (WSL2)

vllm

NVIDIA (CUDA)

macOS

vllm-metal

Apple Silicon (Metal)

The same docker model commands work regardless of platform. Pull a model, run it. Docker Model Runner picks the right backend for your platform.

Get started

Update to Docker Desktop 4.62 or later for Mac, and install the backend:

docker model install-runner –backend vllm-metal

Check out the Docker Model Runner documentation to learn more. For contributions, feedback, and bug reports, visit the docker/model-runner repository on GitHub.

Giving Back: vllm-metal is Now Open Source

At Docker, we believe that the best way to accelerate AI development is to build in the open. That is why we are proud to announce that Docker has contributed the vllm-metal project to the vLLM community. Originally developed by Docker engineers to power Model Runner on macOS, this project now lives under the vLLM GitHub organization. This ensures that every developer in the ecosystem can benefit from and contribute to high-performance inference on Apple Silicon. The project also has had significant contributions by Lik Xun Yuan, Ricky Chen and Ranran Haoran Zhang.

The $599 AI Development Rig

For a long time, high-throughput vLLM development was gated behind a significant GPU cost. To get started, you typically need a dedicated Linux box with an RTX 4090 ($1,700+) or enterprise-grade A100/H100 cards ($10,000+).

vllm-metal changes the math

Now, a base $599 Mac Mini with an M4 chip becomes a viable vLLM development environment. Because Apple Silicon uses Unified Memory, that 16GB (or upgraded 32GB/64GB) of RAM is directly accessible by the GPU. This allows you to:

Develop & Test Locally: Build your vLLM-based applications on the same machine you use for coding.

Production-Mirroring: Use the exact same OpenAI-compatible API on your Mac Mini as you would on an H100 cluster in production.

Energy Efficiency: Run inference at a fraction of the power consumption (and heat) of a discrete GPU rig.

How does vllm-metal compare to llama.cpp?

We benchmarked both backends using Llama 3.2 1B Instruct with comparable 4-bit quantization, served through Docker Model Runner on Apple Silicon.

llama.cpp

vLLM-Metal

Model

unsloth/Llama-3.2-1B-Instruct-GGUF:Q4_0

mlx-community/llama-3.2-1b-instruct-4bit

Format

GGUF (Q4_0)

Safetensors (MLX 4-bit)

Throughput (tokens/sec, wall-clock)

max_tokens

llama.cpp

vLLM-Metal

speedup

128

333.3

251.5

1.3x

512

345.1

279.0

1.3x

1024

338.5

275.4

1.2x

2048

339.1

279.5

1.2x

Each configuration was run 3 times across 3 different prompts (9 total requests per data point).

Throughput is measured as completion_tokens / wall_clock_time, applied consistently to both backends.

Key observations:

llama.cpp is consistently ~1.2x faster than vLLM-Metal across all output lengths.

llama.cpp throughput is remarkably stable (~333-345 tok/s regardless of max_tokens), while vLLM-Metal shows more variance between individual runs (134-343 tok/s).

Both backends scale well. Neither backend shows significant degradation as output length increases.

Quantization methods differ (GGUF Q4_0 vs MLX 4-bit), so this benchmarks the full stack, engine + quantization, rather than the engine alone.

The benchmark script used for these results is available as a GitHub Gist.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. To get involved:

Star the repository: Show your support by starring the Docker Model Runner repo.

Contribute your ideas: Create an issue or submit a pull request. We’re excited to see what ideas you have!

Spread the word: Tell your friends and colleagues who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!

Learn More

Read the companion post: OpenCode with Docker Model Runner for Private AI Coding

Check out the Docker Model Runner General Availability announcement

Visit our Model Runner GitHub repo

Get started with a simple hello GenAI application

Quelle: https://blog.docker.com/feed/

AWS Lambda Durable Execution SDK for Java now available in Developer Preview

Today, AWS announces the developer preview of the AWS Lambda Durable Execution SDK for Java. With this SDK, developers can build resilient multi-step applications like order processing pipelines, AI-assisted workflows, and human-in-the-loop approvals using Lambda durable functions, without implementing custom progress tracking or integrating external orchestration services.
Lambda durable functions extend Lambda’s event-driven programming model with operations that checkpoint progress automatically and pause execution for up to a year when waiting on external events. The new Durable Execution SDK for Java provides an idiomatic experience for building with durable functions and is compatible with Java 17+. This preview includes steps for progress tracking, waits for efficient suspension, and durable futures for callback-based workflows.
To get started, see the Lambda durable functions developer guide and the AWS Lambda Durable Execution SDK for Java on GitHub. To learn more about Lambda durable functions, visit the product page.
On-demand functions are not billed for duration while paused. For pricing details, see AWS Lambda Pricing. For information about AWS Regions where Lambda durable functions are available, see the AWS Regional Services List.
Quelle: aws.amazon.com

Amazon Cognito enhances client secret management with secret rotation and custom secrets

Amazon Cognito enhances client secret lifecycle management for app clients of Cognito user pools by adding client secret rotation and support for custom client secrets. Cognito helps you implement secure sign-in and access control for users, AI agents, and microservices in minutes, and a Cognito app client is a configuration that interacts with one mobile or web application that authenticates with Cognito. Previously, Cognito automatically generated all app client secrets. With this launch, in addition to the automatically generated secrets, you have the option to bring your own custom client secrets for new or existing app clients. Additionally, you can now rotate client secrets on-demand and maintain up to two active client secrets per app client.
The new client secret lifecycle management capabilities address needs for organizations with periodic credential rotation requirements, companies improving security posture, and enterprises migrating from other authentication systems to Cognito. Maintaining two active secrets per app client allows gradual transition to the new secret without application downtime.
Client secret rotation and custom client secrets are available in all AWS Regions where Amazon Cognito user pools are available. To learn more, see the Amazon Cognito Developer Guide. You can get started using the new capabilities through the AWS Management Console, AWS Command Line Interface (CLI), AWS Software Development Kits (SDKs), or AWS CloudFormation.
Quelle: aws.amazon.com

AWS Security Hub launches Extended plan for pay-as-you-go partner solutions

Today, we’re announcing the general availability of AWS Security Hub Extended, a new plan that extends unified security operations across your enterprise through a single-vendor experience. This plan helps address the complexity of managing multiple vendor relationships and lengthy procurement cycles by bringing together the best of AWS detection services and curated partner security solutions. The Security Hub Extended plan delivers three critical advantages. First, it helps streamline procurement by consolidating solution usage into one bill—thereby reducing procurement complexity while preserving direct access to each provider’s domain expertise. AWS Enterprise Support Customers also benefit from unified Level 1 support from AWS. Second, it enables you to establish more comprehensive protection by bringing together the best of AWS detection services with curated partner solutions across endpoint, identity, email, network, data, browser, cloud, AI, and security operations. Third, it helps enhance operational efficiency by streamlining security findings in a standard format, providing centralized visibility across your security environment while reducing the burden of manual integration work. You can access and review partner solutions across security categories through the Security Hub console, selecting only the solutions you need with flexible pay-as-you-go or flat-rate pricing—no upfront investments or long-term commitments required. With AWS as the seller of record, the Extended plan may be eligible for AWS Private Pricing opportunities. This gives you the flexibility to add or remove security categories as your business needs evolve, while enabling you to streamline vendor contract negotiations and consolidate billing. For a list of AWS commercial Regions where Security Hub is available, see the AWS Region table. For more information about pricing, visit the AWS Security Hub pricing page. To get started, visit the AWS Security Hub console or product page.
Quelle: aws.amazon.com