Docker Model Runner Integrates vLLM for High-Throughput Inference

Expanding Docker Model Runner’s Capabilities

Today, we’re excited to announce that Docker Model Runner now integrates the vLLM inference engine and safetensors models, unlocking high-throughput AI inference with the same Docker tooling you already use.

When we first introduced Docker Model Runner, our goal was to make it simple for developers to run and experiment with large language models (LLMs) using Docker. We designed it to integrate multiple inference engines from day one, starting with llama.cpp, to make it easy to get models running anywhere.

Now, we’re taking the next step in that journey. With vLLM integration, you can scale AI workloads from low-end to high-end Nvidia hardware, without ever leaving your Docker workflow.

Why vLLM?

vLLM is a high-throughput, open-source inference engine built to serve large language models efficiently at scale. It’s used across the industry for deploying production-grade LLMs thanks to its focus on throughput, latency, and memory efficiency.

Here’s what makes vLLM stand out:

Optimized performance: Uses PagedAttention, an advanced attention algorithm that minimizes memory overhead and maximizes GPU utilization.

Scalable serving: Handles batch requests and streaming outputs natively, perfect for interactive and high-traffic AI services.

Model flexibility: Works seamlessly with popular open-weight models like GPT-OSS, Qwen3, Mistral, Llama 3, and others in the safetensors format.

By bringing vLLM to Docker Model Runner, we’re bridging the gap between fast local experimentation and robust production inference.

How vLLM Works

Running vLLM models with Docker Model Runner is as simple as installing the backend and running your model, no special setup required.

Install Docker Model Runner with vLLM backend:

docker model install-runner –backend vllm –gpu cuda

Once the installation finishes, you’re ready to start using it right away:

docker model run ai/smollm2-vllm "Can you read me?"

Sure, I am ready to read you.

Or access it via API:

curl –location 'http://localhost:12434/v1/chat/completions'
–header 'Content-Type: application/json'
–data '{
"model": "ai/smollm2-vllm",
"messages": [
{
"role": "user",
"content": "Can you read me?"
}
]
}'

Note that there’s no reference to vLLM in the HTTP request or CLI command.

That’s because Docker Model Runner automatically routes the request to the correct inference engine based on the model you’re using, ensuring a seamless experience whether you’re using llama.cpp or vLLM.

Why Multiple Inference Engines?

Until now, developers had to choose between simplicity and performance. You could either run models easily (using simplified portable tools like Docker Model Runner with llama.cpp) or achieve maximum throughput (with frameworks like vLLM).

Docker Model Runner now gives you both.

You can:

Prototype locally with llama.cpp.

Scale to production with vLLM.

Use the same consistent Docker commands, CI/CD workflows, and deployment environments throughout.

This flexibility makes Docker Model Runner a first in the industry — no other tool lets you switch between multiple inference engines within a single, portable, containerized workflow.

By unifying these engines under one interface, Docker is making AI truly portable, from laptops to clusters, and everything in between.

Safetensors (vLLM) vs. GGUF (llama.cpp): Choosing the Right Format

With the addition of vLLM, Docker Model Runner is now compatible with the two most dominant open-source model formats: Safetensors and GGUF. While Model Runner abstracts the complexity of setting up the engines, understanding the difference between these formats helps in choosing the right tool for your infrastructure.

GGUF (GPT-Generated Unified Format): The native format for llama.cpp, GGUF is designed for high portability and quantization. It is excellent for running models on commodity hardware where memory bandwidth is limited. It packages the model architecture and weights into a single file.

Safetensors: The native format for vLLM and the modern standard for high-end inference, safetensors is built for high-throughput performance.

Docker Model Runner intelligently routes your request: if you pull a GGUF model, it utilizes llama.cpp; if you pull a safetensors model, it leverages the power of vLLM. With Docker Model Runner, both can be pushed and pulled as OCI images to any OCI registry.

vLLM-compatible models on Docker Hub

vLLM models are in safetensors format. Some early safetensors models available on Docker Hub:

Available Now: x86_64 with Nvidia

Our initial release is optimized for and available on systems running the x86_64 architecture with Nvidia GPUs. Our team has dedicated its efforts to creating a rock-solid experience on this platform, and we’re confident you’ll feel the difference.

What’s Next?

This launch is just the beginning. Our vLLM roadmap is focused on two key areas: expanding platform access and continuous performance tuning.

WSL2/Docker Desktop compatibility: We know that a seamless “inner loop” is critical for developers. We are actively working to bring the vLLM backend to Windows via WSL2. This will allow you to build, test, and prototype high-throughput AI applications Docker Desktop with the same workflow you use in Linux environments, starting with Nvidia Windows machines.

DGX Spark compatibility: We are optimizing Model Runner for different kinds of hardware. We are working to add compatibility for Nvidia DGX systems.

Performance Optimization: We’re also actively tracking areas for improvement. While vLLM offers incredible throughput, we recognize that its startup time is currently slower than llama.cpp’s. This is a key area we are looking to optimize in future enhancements to improve the “time-to-first-token” for rapid development cycles.

Thank you for your support and patience as we grow.

How You Can Get Involved

The strength of Docker Model Runner lies in its community, and there’s always room to grow. We need your help to make this project the best it can be. To get involved, you can:

Star the repository: Show your support and help us gain visibility by starring the Docker Model Runner repo.

Contribute your ideas: Have an idea for a new feature or a bug fix? Create an issue to discuss it. Or fork the repository, make your changes, and submit a pull request. We’re excited to see what ideas you have!

Spread the word: Tell your friends, colleagues, and anyone else who might be interested in running AI models with Docker.

We’re incredibly excited about this new chapter for Docker Model Runner, and we can’t wait to see what we can build together. Let’s get to work!
Quelle: https://blog.docker.com/feed/

Amazon CloudFront now supports TLS 1.3 for origin connections

Amazon CloudFront now supports TLS 1.3 when connecting to your origins, providing enhanced security and improved performance for origin communications. This upgrade offers stronger encryption algorithms, reduced handshake latency, and better overall security posture for data transmission between CloudFront edge locations and your origin servers. TLS 1.3 support is automatically enabled for all origin types, including custom origins, Amazon S3, and Application Load Balancers, with no configuration changes required on your part. TLS 1.3 provides faster connection establishment through a reduced number of round trips during the handshake process, delivering up to 30% improvement in connection performance when your origin supports it. CloudFront will automatically negotiate TLS 1.3 when your origin supports it, while maintaining backward compatibility with lower TLS versions for origins that haven’t yet upgraded. This enhancement benefits applications requiring high security standards, such as financial services, healthcare, and e-commerce platforms that handle sensitive data. TLS 1.3 support for origin connections is available at no additional charge in all CloudFront edge locations. To learn more about CloudFront origin TLS, see the Amazon CloudFront Developer Guide.
Quelle: aws.amazon.com

Amazon Braket introduces spending limits feature for quantum processing units

Amazon Braket now supports spending limits, enabling customers to set spending caps on quantum processing units (QPUs) to manage costs. With spending limits, customers can define maximum spending thresholds on a per-device basis, and Amazon Braket automatically validates each task submission doesn’t exceed the pre-configured limits. Tasks that would exceed remaining budgets are rejected before creation. For comprehensive cost management across all of Amazon Web Services, customers should continue to use the AWS Budgets feature as part of AWS Cost Management. Spending limits are particularly valuable for research institutions managing quantum computing budgets across multiple users, for educational environments preventing accidental overspending during coursework, and for development teams experimenting with quantum algorithms. Customers can update or delete spending limits at any time as their requirements change. Spending limits apply only to on-demand tasks on quantum processing units and do not include costs for simulators, notebook instances, hybrid jobs, or tasks created during Braket Direct reservations. Spending limits are available now in all AWS Regions where Amazon Braket is supported at no additional cost. Researchers at accredited institutions can apply for credits to support experiments on Amazon Braket through the AWS Cloud Credits for Research program. To get started, visit the Spending limits page in the Amazon Braket console and read our launch blog post.
Quelle: aws.amazon.com

Amazon EC2 Mac instances now support Apple macOS Tahoe

Starting today, customers can run Apple macOS Tahoe (version 26) as Amazon Machine Images (AMIs) on Amazon EC2 Mac instances. Apple macOS Tahoe is the latest major macOS version, and introduces multiple new features and performance improvements over prior macOS versions including running Xcode version 26.0 or later (which includes the latest SDKs for iOS, iPadOS, macOS, tvOS, watchOS, and visionOS). Backed by Amazon Elastic Block Store (EBS), EC2 macOS AMIs are AWS-supported images that are designed to provide a stable, secure, and high-performance environment for developer workloads running on EC2 Mac instances. EC2 macOS AMIs include the AWS Command Line Interface, Command Line Tools for Xcode, Amazon SSM Agent, and Homebrew. The AWS Homebrew Tap includes the latest versions of AWS packages included in the AMIs. Apple macOS Tahoe AMIs are available for Apple silicon EC2 Mac instances and are published to all AWS regions where Apple silicon EC2 Mac instances are available today. Customers can get started with macOS Tahoe AMIs via the AWS Console, Command Line Interface (CLI), or API. Learn more about EC2 Mac instances here or get started with an EC2 Mac instance here. You can also subscribe to EC2 macOS AMI release notifications here.
Quelle: aws.amazon.com

AWS Glue supports additional SAP entities as zero-ETL integration sources

AWS Glue now supports full snapshot and incremental load ingestion for new SAP entities using zero-ETL integrations. This enhancement introduces full snapshot data ingestion for SAP entities that lack complete change data capture (CDC) functionality, while also providing incremental data loading capabilities for SAP entities that don’t support the Operational Data Provisioning (ODP) framework. These new features work alongside existing capabilities for ODP-supported SAP entities, to give customers the flexibility to implement zero-ETL data ingestion strategies across diverse SAP environments. Fully managed AWS zero-ETL integrations eliminate the engineering overhead associated with building custom ETL data pipelines. This new zero-ETL functionality enables organizations to ingest data from multiple SAP applications into Amazon Redshift or the lakehouse architecture of Amazon SageMaker to address scenarios where SAP entities lack deletion tracking flags or don’t support the Operational Data Provisioning (ODP) framework. Through full snapshot ingestion for entities without deletion tracking and timestamp-based incremental loading for non-ODP systems, zero-ETL integrations reduce operational complexity while saving organizations weeks of engineering effort that would otherwise be required to design, build, and test custom data pipelines across diverse SAP application environments. This feature is available in all AWS Regions where AWS Glue zero-ETL is currently available. To get started with the enhanced zero-ETL coverage for SAP sources refer to the AWS Glue zero-ETL user guide.
Quelle: aws.amazon.com