Custom MCP Catalogs and Profiles: Advancing Enterprise MCP Adoption

We’re excited to announce the general availability of Custom Catalogs and Profiles for managing Model Context Protocol (MCP) servers. These two complementary capabilities fundamentally change how teams package, distribute, and manage AI tooling. 

Custom MCP Catalogs let organizations curate and distribute approved collections of MCP servers. MCP Profiles enable individual developers to easily build, run, and share their MCP tools and configurations across projects and teams.

In this post, we’ll walk through how to create your own custom catalog – building on and improving our previous approach. We’ll also introduce Profiles, a new primitive that lets you define portable, named groupings of MCP servers. Profiles are designed to solve several practical use cases today, while giving us a foundation to expand in the future.

Creating custom catalogs with Docker

As organizations adopt MCP, we consistently hear the same need: teams need a way to curate a trusted list of MCP servers, including internally built servers.

To address these needs, we built Custom Catalogs. Instead of every team member searching for MCP servers across the open internet, organizations can publish and distribute catalogs that define approved servers. This allows developers to centrally discover and use trusted MCP servers within organizational boundaries.

Custom Catalogs can reference servers from Docker’s MCP Catalog, community sources, and custom MCP servers developed internally, bringing flexibility, control, and trust together in a single experience. We will show you how to do that with a Custom Catalog. 

Step-by-step: Building and sharing a custom MCP catalog 

In this example, we will create a Custom Catalog containing servers from the Docker MCP Catalog and an MCP server we created ourselves from the CLI. Then we will show you how to use Docker Desktop to import the catalog.

All the functionality we will show can be exercised through the CLI, while a subset of primarily user-centric features can be exercised through Docker Desktop.

Here, we will use my personal Docker Hub ID roberthouse224 in the commands, but you should adapt to use your information where appropriate (e.g. pushing an image).

Step 1: Creating my custom MCP server and pushing it to Docker Hub

We built a reference server called roll-dice (GitHub Repository). It is a regular MCP server that communicates over stdio and can be built as a Docker image. The image has already been built and pushed to Docker Hub.

We can create the metadata that describes the server including where the image can be found and save it to a file named mcp-dice.yaml to be used when creating our catalog.

name: roll-dice
title: Roll Dice
type: server
image: roberthouse224/mcp-dice@latest
description: An mcp server that can roll dice

Step 2: Creating a catalog that includes servers from the Docker MCP Catalog alongside a server you have built yourself

Now we can create a custom catalog containing servers from the Docker MCP Catalog and the MCP server we created ourselves.

docker mcp catalog create roberthouse224/our-catalog
–title "Our Catalog"
–server catalog://mcp/docker-mcp-catalog/playwright
–server catalog://mcp/docker-mcp-catalog/github-official
–server catalog://mcp/docker-mcp-catalog/context7
–server catalog://mcp/docker-mcp-catalog/atlassian
–server catalog://mcp/docker-mcp-catalog/notion
–server catalog://mcp/docker-mcp-catalog/markitdown
–server file://./mcp-dice.yaml

Step 3: Verifying the MCP servers in the custom catalog 

We can now list our catalogs and see the catalog that we createddocker mcp catalog list

We can also inspect the contents of the catalogdocker mcp catalog show roberthouse224/our-catalog –format yaml

Step 4: Share the catalog

At the moment our custom catalog only lives on our machine. But what we have – and this is really powerful – is an immutable OCI artifact containing our trusted MCP servers that can be easily shared.

We can push our catalog to a container registry, in this example we’re using Docker Hub. Now, anyone that has access to your organization’s namespace can access the catalog.

docker mcp catalog push roberthouse224/our-catalog

Using a custom MCP catalog

Now that our custom catalog has been shared, colleagues can import it from within Docker Desktop (or from the cli using docker mcp catalog pull).

Import the catalog from Docker Desktop by selecting “Import catalog,” and then specifying the OCI reference in the dialog.

Figure 1: Importing a custom catalog from OCI reference

The catalog is now browsable. You can double click into the catalog and see all of the servers contained within it. Notice the custom MCP server that we added named “Roll Dice.”

Figure 2: A custom MCP catalog within the Docker Desktop app, including a newly added “Roll Dice” server.

To make this a private catalog all you need to do is manage access to the repository the way you always have for container images – no new infrastructure to manage or systems to learn.

This is exactly what Jim Clark was describing in his post Private MCP Catalogs and the Path to Composable Enterprise AI.

This simple pattern can be extended to support more complex use cases. For example, you might use a private container registry instead of Docker Hub, or connect to a remote MCP server over streamable HTTP you host yourself rather than running a containerized server as shown in the example.

Now that we have a shareable custom catalog of trusted MCP servers we can shift focus to how individuals can effectively leverage MCP servers from the catalog we built in their workflows.

Using Profiles to create and share MCP Workflows

With MCP Profiles, developers can organize workflows efficiently and maintain separate server collections and configurations for different use cases. Profiles can be shared across teams, enabling collaboration on server setups and ensuring consistent configurations for teams working within the same projects or contexts.

Switch between Profiles

At a basic level, a Profile is a named grouping of MCP servers that can be connected to an agent session. This makes it straightforward to define different Profiles for different ways of working.

Now let’s see an example in action. 

We create a profile named coding and another named planning. We browse our custom catalog, select the MCP servers that we want (e.g. Playwright, GitHub, and Context7) then select the “Add to” drop down, and select “New profile”.

Figure 3: Selecting MCP servers to be added to a new profile

Give the profile a name, select the client you want to connect to, and select “Create”.

Figure 4: Creating a new MCP profile named coding in Docker Desktop.

From the Profiles tab, we can see the profile we just created. Our client is connected and our tools are ready to use. 

Figure 5: Example of a profile that is connected to a client.

Next we create a profile named planning with servers relevant to planning (e.g. Atlassian, Markitdown, Notion). 

Navigate back to “our-catalog” (if not already there), select the servers relevant to planning, and select “Add to” -> “New profile.” Give the profile a name (e.g. planning). Then select “Create” to create the planning profile without a client. Specifying the client is optional.

Figure 6: Example of creating multiple profiles, including separate profiles for coding and planning 

Now we have two profiles that mirror two modes of working. When we switch to planning mode we only want the tools from our planning profile to be in context. To do that, we can easily reassign our client to the planning profile.

Figure 7: Reassign Claude Code to the planning profile.

If we go back to coding mode, we just reassign our client back to the coding profile. You can have any number of Profiles that mirror your many ways of working and easily switch between them, keeping only the tools you care about in context.

This will work with any agent, not just Claude Code. Profiles provide a truly portable way to manage your MCP server setups and avoid vendor lock-in.

Persist configuration

You can avoid repeatedly configuring MCP servers by using a Profile. Profiles add a persistence layer for MCP server configurations. When an MCP server exposes configurable options, you can define them once in a Profile and reload them as needed, avoiding repeated configuration.

In this example, we are specifying which paths Markitdown can access.

Figure 8: Using an MCP profile to save server configurations for reuse

Context windows can easily fill up when the MCP servers you use export a lot of tools. With Profiles you can specify which tools are enabled, making sure only the tools you need for a specific task are used.

Here we enable the get_me tool from the GitHub MCP server and disable all the others. All the other tools will not show up in our agent session or contribute to the context window.

Figure 9: Optimize your context window by enabling only the tools you need in the MCP profile

This model of saved configuration becomes far more powerful for MCP servers you build in-house. By exposing richer configuration options, you can reuse the same server across projects, reconfigure its behavior per context, and achieve more predictable outcomes.

Share Profiles

Identifying MCP servers and configurations that work well for a project doesn’t need to be repeated by every team member. Once you’ve found a setup that works, share it with the rest of the team.

To share a Profile you can push it as an OCI artifact to a container registry just like we did with our custom catalog. Just provide a name for it along with an OCI reference.

➜ ~ docker mcp profile push coding [your-namespace]/coding

For someone to pull it down, all they have to do is issue the corresponding pull command.

➜ ~ docker mcp profile pull [your-namspace]/coding

Although the example above demonstrates sharing Profiles across a team, the concept extends naturally to agents as well. An agent skill could, for instance, reference a Profile and pull in the required MCP servers and their configurations as dependencies.

Conclusion and What’s Next 

As MCP adoption grows, the challenge isn’t access to tools — it’s coordination. Teams need a way to standardize what’s trusted and supported without constraining how individuals actually work. Custom Catalogs and Profiles are designed to solve exactly that problem.

Custom Catalogs: shared foundation

Custom Catalogs allow platform and admin teams to define approved MCP servers, bundle internal and public tooling together, and distribute those choices as a single, portable artifact. This creates clarity and consistency while significantly reducing the cost of discovery and evaluation.

Profiles: supercharge workflow

Profiles give individual developers a lightweight way to assemble, configure, and reuse MCP servers for specific contexts like coding, planning, or research. Profiles persist configuration, limit context to what matters, and make effective setups easy to share across teams.

Together, these primitives separate:

What an organization recommends (via Custom Catalogs)

How people work day to day (via Profiles)

This separation enables a healthy balance. Platform teams can publish “golden paths” that establish standards and guardrails, while developers retain the freedom to adapt, experiment, and compose profiles that fit their needs.

The result is a system that is portable, composable, and scalable — making MCP easier to adopt, safer to manage, and more effective as it grows across an organization.

What’s Next?

Custom Catalogs and Profiles are the foundation for managing MCP at scale, and we’re just getting started. Next, we’re focused on extending these primitives to support stronger governance, better reuse, and more advanced agent workflows:

Governance and policy controls to restrict MCP usage to approved Custom Catalogs and trusted server sources

Improved discoverability and sharing for both Catalogs and Profiles, making proven setups easier to find and reuse across teams

Expanded Profile-scoped secrets and configuration, providing a more secure and flexible alternative to project-level mcp.json files

Clear best practices for Profiles, including saving dynamic MCP server configurations for reuse and pairing Profiles with emerging workflow optimizations like agent skills

Getting started with Custom Catalogs and Profiles

If you have Docker Desktop 4.56 you are already using Catalogs – our Docker MCP Catalog is now distributed as an OCI artifact and Profiles are supported starting with Docker Desktop 4.63. Try creating your first Profile by exploring the MCP Toolkit in Docker Desktop.

Learn more

Dive into our documentation on Custom Catalogs and Profiles to get started quickly.

Explore Docker’s MCP Catalog and Toolkit on our website.

Ready to go hands-on? Open Docker Desktop or the CLI and start using MCP to streamline and automate your development workflows.

Quelle: https://blog.docker.com/feed/

NIST Narrows the NVD: What Container Security Programs Should Reassess

On April 15, NIST announced a prioritized enrichment model for the National Vulnerability Database. Most CVEs will still be published, but fewer will receive the CVSS scores, CPE mappings, and CWE classifications that container scanners and compliance programs have historically relied on.

The change formalizes a drift that has been visible to anyone pulling NVD feeds for the past two years. What shifted on April 15 is the expectation: NIST has now said plainly that it does not intend to return to full-coverage enrichment. For programs that built scanning, prioritization, and SLA workflows around the assumption that NVD sits as the authoritative secondary layer on top of CVE, that assumption is worth a structured review.

What changed

Three categories of CVEs will continue to receive full enrichment:

CVEs in CISA’s Known Exploited Vulnerabilities catalog, targeted within one business day

CVEs affecting software used within the federal government

CVEs affecting “critical software” as defined by Executive Order 14028

Everything else moves to a new “Not Scheduled” status. Organizations can request enrichment by emailing nvd@nist.gov, though no service-level timeline applies. NIST has also stopped duplicating CVSS scores when the submitting CNA provides one, and all unenriched CVEs published before March 1, 2026 have been moved into “Not Scheduled.”

The NIST volumes behind the decision

NIST cited a 263% increase in CVE submissions between 2020 and 2025, with Q1 2026 running roughly a third higher than the same period a year earlier. The rise tracks with a broader expansion in CVE numbering: more CNAs, more open source projects running their own disclosure processes, and more tooling surfacing issues that would not have reached CVE a few years ago.

Year

Published CVEs

Source

2023

~29,000

CVE.org

2024

~40,000

CVE.org

2025

~48,000

NIST

2026 (forecast)

~59,500 (median)

FIRST

AI is a compounding factor on both sides of this curve. In January, curl founder Daniel Stenberg shut down the project’s HackerOne bug bounty after six and a half years, citing “death by a thousand slops”: AI-generated reports that read like real research but described vulnerabilities that didn’t exist. Node.js, Django, and others have tightened intake under similar pressure. On the signal side, Anthropic’s April announcement of Claude Mythos Preview described a model that autonomously discovered thousands of zero-day vulnerabilities across every major operating system and web browser, including a 17-year-old unauthenticated RCE in FreeBSD’s NFS server. Earlier Anthropic research documented Claude Opus 4.6 finding and validating more than 500 high-severity vulnerabilities in production open source.

More noise and more real signal are heading toward the same pipeline. NIST enriched roughly 42,000 CVEs in 2025, its highest annual total, and still fell further behind incoming volume.

How it lands in compliance

The operational question is what programs have to document when NVD scoring is not available, and how consistently that documentation holds up across assessments.

Framework

NVD reference

Likely effect

FedRAMP

NVD CVSSv3 as original risk rating, with CVSSv2 and native scanner score as documented fallbacks

More variance in how remediation SLAs are applied across CSPs

PCI-DSS 4.0

Req. 11.3.2 external scans reference CVSS; ASV guidance points to NVD

More ambiguity on pass/fail determinations for unscored findings

NIST SP 800-53 (RA-5)

Lists NVD as an example source; permissive language

Lower direct impact, though auditors commonly expect CVSS-based severity evidence

DORA / SOC 2

No direct reference

Principles-based; audit expectations around severity rationale still apply

None of these frameworks break on their own. Mature vulnerability management programs generally have language in their SSPs and risk registers covering fallback scoring and exception handling. Programs that do not will likely need it before their next audit cycle.

The gap that is relevant to the container ecosystem

Two NVD inputs matter most for container scanning:

CPE applicability statements map a CVE to specific software packages. When CPE strings are missing, a scanner that matches primarily on CPE cannot determine which packages in an image are affected. The CVE exists in the database but is operationally invisible to the scan.

CVSS scores drive prioritization and SLA routing. Without a score, a CVE may surface as UNKNOWN severity or fall outside remediation workflows entirely.

Container images create a compounding effect here. Each image inherits packages from a base layer, application dependencies, and often a long transitive dependency chain. When any of those packages carries a CVE that NVD has not enriched, the gap propagates through every downstream image built on top of it. Scanners that draw on multiple advisory sources, and that match on package identifiers other than CPE, are less exposed.

Questions worth putting to image vendors

What advisory sources does your tooling use beyond NVD?

When a CVE has no NVD CVSS score, what does the tool display, and does it trigger remediation workflows?

How do you define “patched,” and is that definition in your written CVE policy?

Are your remediation SLAs measured from CVE disclosure date or NVD enrichment date?

Can a third-party scanner reproduce your clean-scan result against public advisory data?

Where Docker sits

Docker Hardened Images are designed so that vulnerability management in container workloads does not depend primarily on NVD enrichment. Each image ships with signed attestations for build provenance, SBOMs in both CycloneDX and SPDX formats, OpenVEX exploitability statements, and scan results. SBOMs are generated from the SLSA Build Level 3 pipeline rather than inferred from external databases, so package inventory is accurate regardless of NVD’s enrichment state. Hardened System Packages allow package-level patching independent of upstream distribution timelines, which means remediation is not gated on a distribution maintainer’s release cadence or on an NVD analyst’s queue. When a CVE is not exploitable in a specific image context, that assessment is published as a signed VEX document that third-party scanners including Trivy, Grype, and Wiz consume natively.

Docker Scout, the scanning layer that reads these attestations, aggregates 22 advisory sources including NVD, CISA KEV, EPSS, GitHub Advisory Database, and 13 Linux distribution security trackers. Scout matches on Package URLs (PURLs) rather than NVD’s CPE scheme, which allows package identification to continue when CPE strings are unavailable. NVD remains a valuable input to this architecture, one of several rather than the spine.

What to reassess

Audit open findings against the March 1, 2026 cutoff. Any CVE published before that date that has not received NVD enrichment has already been moved to “Not Scheduled.” Programs carrying open findings tied to those CVEs may have severity scores and CPE mappings in their trackers that no longer reflect an active NVD record. Verify that the scoring basis for those findings is documented and defensible independent of NVD.

For programs running DHI, the NVD policy change does not require an operational response. For programs evaluating container security vendors more broadly, the question worth elevating in the next procurement cycle is whether NVD is one source of vulnerability intelligence in their stack, or the primary one.

The NVD will continue to play a role. That role is narrowing, and the signals suggest it will keep narrowing. Programs that use the April announcement as a prompt to audit their data sources now will have a cleaner answer the next time a regulator, an auditor, or a board asks where their vulnerability data actually comes from.

Sources and further reading

NIST, “NIST Updates NVD Operations to Address Record CVE Growth,” April 15, 2026 https://www.nist.gov/news-events/news/2026/04/nist-updates-nvd-operations-address-record-cve-growth

FIRST, “2026 CVE Vulnerability Forecast” https://www.first.org/blog/20260211-vulnerability-forecast-2026

FedRAMP Vulnerability Scanning Requirements v3.0 https://www.fedramp.gov/docs/rev5/playbook/csp/continuous-monitoring/vulnerability-scanning/

Docker Scout advisory database sources https://docs.docker.com/scout/deep-dive/advisory-db-sources/

Docker Hardened Images documentation https://docs.docker.com/dhi/

“Why We Chose the Harder Path: Docker Hardened Images, One Year Later”https://www.docker.com/blog/why-we-chose-the-harder-path-docker-hardened-images-one-year-later/

Quelle: https://blog.docker.com/feed/

Docker AI Governance: Unlock Agent Autonomy, Safely

Introducing Docker AI Governance: centralized control over how agents execute, what they can reach on the network, which credentials they can use, and which MCP tools they can call, so every developer in your company can run AI agents safely, wherever they work.

Your laptop is the new prod

Agents are the biggest productivity unlock the modern workplace has seen in a generation, and engineering is where the shift is most obvious. Developers aren’t using agents to autocomplete a function anymore. They’re using them to read whole codebases, refactor across services, and ship entire products, end to end. Vibe coding is real, it’s shipping to main, and it’s happening on laptops everywhere today.

The same shift is moving through every other function. A new class of agents called Claws is already in production, sending emails, managing calendars, booking travel, pulling CRM data, reconciling reports, and querying production systems. Marketing, finance, sales, and support are adopting them as fast as engineering is, because the productivity gains are too large to ignore and the companies that move first will out-execute the ones that don’t. Org-wide rollouts that used to take quarters are landing in weeks.

What’s more interesting than the speed of adoption is where all of this actually runs. Agents and Claws live outside the systems enterprises spent two decades hardening. They don’t sit behind your CI/CD pipeline, they don’t live inside your VPC, and they don’t follow your IAM model. They run on the developer’s machine, with the developer’s credentials, reaching into private repos, production APIs, customer records, and the open internet, often in the same session. The laptop just became the most powerful node in your enterprise, and it also became the most exposed. Laptop and agent environments are the new prod, and they need to be governed like prod.

What governance actually has to solve

The instinct in most enterprises is to reach for the tools that already exist, but none of them see what an agent is doing. CI/CD doesn’t see it because the agent isn’t a pipeline. The VPC doesn’t see it because the laptop is outside the perimeter. IAM doesn’t see it because the agent is acting as the developer. The result is that CISOs can’t tell what an agent touched, what it ran, or where the data went, and they also can’t tell the business to slow down. This is the bind every security leader is in right now.

Strip the problem to first principles and an agent has two paths to do significant harm. It either executes code itself, touching files and opening network connections, or it calls a tool through an MCP server to act on an external system. Govern both paths and you’ve governed the agent. Miss either one and you haven’t.

That’s the test for any AI governance solution worth taking seriously, and it has two parts. The controls have to live at the runtime layer where the agent actually executes, not as advisory rules layered on top that a clever prompt can route around. And they have to work consistently wherever the agent ends up running, because agents don’t stay on the laptop. They migrate to CI runners, to staging clusters, to production. A policy that only holds in one of those places is a gap waiting to be found.

Why Docker

Docker is the only company that meets both parts of that test, and the reason is structural.

Docker built the sandbox that contains the first path. Every agent session runs inside an microVM-based isolated environment where filesystem and network access are controlled by a hard boundary, which means enforcement happens at the level of the process, not as a suggestion the agent can ignore. Docker built the MCP Gateway that contains the second path. Every tool call routes through a single chokepoint where it can be authenticated, authorized, and logged before it reaches the external system. These controls at a primitive level, Docker Sandboxes and Docker MCP Gateway, make enforcement strict instead of advisory. We own the substrate the agent is running on, so the policy isn’t a wrapper around someone else’s runtime, it’s the runtime.

The second part is what makes this durable. The same sandbox primitive runs on the developer’s laptop, inside Kubernetes, and across cloud environments, with the same policy model and the same enforcement guarantees. When an agent moves from a developer’s machine to a CI runner to a production cluster, the policy moves with it, because the runtime underneath is the same in all three places. No other vendor can say that, because no other vendor is the runtime. Endpoint security tools don’t extend to clusters. Cluster security tools don’t reach the laptop. Cloud security tools don’t run on either. Docker covers all three because Docker is what’s actually executing the agent in all three.

Docker AI Governance is the control plane that sits on top of that runtime. It turns the sandbox and the MCP Gateway into centralized policy, defined once in the admin console, enforced at every node the agent touches, and auditable from end to end.

How Docker AI Governance works

From a single admin console, security teams define and enforce policy across four control surfaces: network, filesystem, credentials, and MCP tools. One policy layer that doesn’t need a per-machine setup and that consistently works across thousands of developers.

Sandbox policy for network and filesystem. Admins define allow and deny rules for domains, IPs, and CIDRs, alongside mount rules for filesystem paths with read-only or read-write scope. Every agent session runs inside an isolated sandbox where only approved endpoints are reachable and only approved directories are mountable, with enforcement happening at the proxy and mount level rather than as an advisory layer the agent can ignore.

Credential governance. Agents are dangerous in proportion to what they can authenticate as, so Docker AI Governance controls which credentials, tokens, and secrets an agent session can see, scopes them to the duration of that session, and blocks exfiltration to unapproved destinations. Developers stop pasting tokens into prompts, and security stops wondering where those tokens ended up.

MCP tool governance. Admins control which MCP servers and tools are available through organization-wide managed policies, with unapproved servers blocked by default. Every MCP call flows through the same policy engine as network, filesystem, and credential requests, so there’s no separate surface to configure and no bypass path.

Role-based policy assignment. Different teams need different levels of access, and security research will reasonably require broader MCP usage than finance. Create policy groups, assign users through your IdP, and layer team-specific rules on top of organization-wide guardrails that can’t be overridden. It scales to thousands of developers through existing SAML and SCIM integrations with no per-user setup.

Audit and visibility. Every policy evaluation generates a structured event with user identity, timestamp, session context, and the rule that triggered the decision, and logs export cleanly to your existing SIEM and compliance systems. This is the evidence CISOs need to approve AI usage at scale rather than tolerate it under the table.

Automatic policy propagation. When a developer authenticates, their machine pulls the latest policy, and updates reach every device automatically. Admins define policy once and Docker enforces it everywhere.

What this unlocks

CISOs get the governance layer they’ve been missing and the confidence to approve agent usage at scale rather than block it. Platform teams get an easy way to set up governance: by defining a policy once and having it enforced everywhere with full audibility. This removes the operational burden of scaling AI adoption across the company. Developers get what agents promised in the first place: real speed and autonomy, with governance that stays out of the way. We built Docker AI Governance with these principles in mind: agents should be autonomous and governance should be invisible.

Available today

Docker AI Governance is available now. If you’re a security leader trying to close the AI governance gap, or a platform team ready to roll out agents without compromising control, it was built for you.

Contact sales to learn more.

Quelle: https://blog.docker.com/feed/

Comparing Different Approaches to Sandboxing

“AI agents will become the primary way we interact with computers in the future. They will be able to understand our needs and preferences, and proactively help us with tasks and decision making.“

Satya Nadella
CEO of Microsoft

Whether you are a software engineer, a product manager, or a designer, this quote should fundamentally change how we approach our daily routine. We are no longer just building interfaces; we are creating environments where agents can operate autonomously with minimal human interaction. What could be the fundamental requirement for such an environment ?

In a single word: Isolation.

A user interacting with traditional software is constrained by the actions it allows. But Agents are non-deterministic, and therefore prone to hallucination and prompt injections. Once you give an AI write access to your systems, there is nothing stopping it from executing a rm -rf to delete all your data. Of course, there are different ways to solve this problem, with one approach being sandboxing: an isolated, controlled environment used for experimentation and testing without affecting the surrounding system.

So, I started exploring different strategies to sandbox the agents. Starting with a bare minimum setup and going all the way to setting up a cloud VM. Here is what I learned at each step.

1. Let’s Start with the Baseline

Chroot has been the traditional way to achieve file system isolation. It works well when you want the process to think that a specific, restricted directory is the absolute root of the machine.

However, there are two major caveats.

If the process inside the chroot has root privileges, it could break out.

While it offers file isolation, process isolation is still a problem. A malicious agent can still see other processes running on your system and try to kill them.

As you can see above, doing an ls /proc still shows all the processes running on the host.

This is when I learnt about systemd-nspawn, also called “chroot on steroids”. The difference between chroot and systemd-spawn is that the latter provides isolation at the network and process levels in addition to the file system.

Now, when I do the same ls /proc in the systemd-nspawn mybox container, I just see the processes in the mybox container achieving process-level isolation.

Pros

Lightweight compared to other container processes like Docker, it offers faster startup times.

Native support in Linux.

Caveats

systemd-nspawn is not very popular in the developer community unless you are deep into Linux.

While this works for Linux, what if you need to run your agents on Windows? You will have to find alternatives depending on the platform.

2. Are Containers Enough?

Another technology that comes to mind when thinking about isolated environments is Docker. And unlike the previous concepts we discussed, Docker has a broader ecosystem and a strong community.

With containers, you also get isolated file systems, network interfaces, and process trees. They also come with cross-platform support across Mac, Windows, and Linux. With all these advantages, creating and running agents across different platforms becomes very easy, which makes containers an obvious choice.

However, the model becomes more complex when containers become a dev platform for agents. More often than not, agents need to execute generated code in separate environments, which in practice means spinning up new Docker containers on demand. This introduces a container-in-container pattern (Docker-in-Docker), where an agent running inside a container needs to build and run other containers. 

To make Docker-in-Docker to work, we would have to run the container in privileged mode (–privileged), which gives the container processes elevated permissions rights and dramatically weakens the isolation. At this point, the isolation guarantees are significantly diminished. As a result, complete isolation for agents using only containers becomes tricky.

3. Do Virtual Machines Help?

As you might have already predicted, Virtual Machines (VMs) offer the strongest isolation. With a VM, you can get an entire OS, file system, and network of your own. For example, I currently run MacOS with lima – Linux VM to run Linux-specific workloads.

However, the tradeoff is that spinning up a VM is expensive. And if this needs to be done for every agent, it is not scalable. Some stats that show how expensive spinning up a VM with system-nspawn looks like.

Approach

Per Agent Cost

Boot Time

10 Agents

VM (Lima)

~4GB RAM + 4 CPU

30-60s

~40GB RAM

systemd-nspawn

~10MB RAM

< 1s

~100MB RAM

chroot

1MB RAM

instant

~10MB RAM

For example, in the below screenshot you can see the cost it takes to run a lima vm.

4. MicroVMs to the rescue

A MicroVM (Micro Virtual Machines) felt like the perfect answer to the isolation story. So what is MicroVM, and what makes it better?

MicroVM is a lightweight virtualisation technology that provides the strong security and isolation of a traditional VM, along with the speed of a container.

Strong security and isolation are enabled because a MicroVM gets its own kernel, aka the Guest Kernel, unlike containers, which use a shared kernel. Because of this, any compromise inside the Guest OS does not directly affect the host or the other VMs.

Speed: unlike traditional VMs, it is provisioned with minimal hardware (no USB or PCI buses) and bypasses BIOS/UEFI boot, significantly reducing device emulation overhead and startup latency.

Amazon open-sourced Firecracker in 2018, which was the earliest adoption of the MicroVM architecture. While this helped catalyze the MicroVM architecture, Firecracker was restricted to Linux environments. And most of the agentic orchestration tends to happen on developers’ laptops which run MacOS and Windows as well.

Docker addressed this gap with its Sandbox offering. The best part is their MicroVM-based architecture, which runs natively across macOS, Windows, and Linux, delivering better isolation, faster startup times, and a smoother developer experience. We will learn about this in a bit.

5. gVisor

gVisor takes a unique approach to solving the isolation problem. While the previous strategies used the OS Kernel, gVisor creates its own Kernel called the “application kernel” running in the user space.

When a standard containerized app wants to do something like open a file, allocate memory, or send network traffic, it makes a “system call” (syscall) directly to the host’s Linux kernel.

With gVisor, your app is bundled with a component called the Sentry.

The Sentry intercepts every single syscall your application makes.

It processes that request in user-space using its own implementation of Linux networking, file systems, and memory management.

If the Sentry absolutely needs the host kernel to do something (like actual disk I/O), it translates the request into an extremely restricted, heavily filtered, safe call to the host.

However, it suffers from the same problem as systemd-nspawn. Not much broader community supports and only supports Linux.

Docker Sandbox

With Docker Sandboxes, AI coding agents run in isolated microVM environments. The performance is as seamless as it can be, identical to running on the host, but with significantly stronger isolation and security. This means you can run your autonomous agents without worrying about host compromise or unintended access to your local environment. 

Sandbox achieves this levels of security through three layers of isolation:

Hypervisor Isolation: Every Sandbox has its own Linux Kernel. So, anything that affects the sandbox kernel will not affect the host or other sandbox kernels.

Network Isolation

Each Sandbox has its own isolated network. Meaning multiple sandboxes cannot communicate with each other or with the host.

In addition, network policies can be enforced to allow or disallow traffic from a source.

Docker Engine Isolation

This is what made me fall in love with this new architecture. Every Sandbox gets its own Docker Engine. As a result, whenever the agent runs docker pull or docker compose, those commands are executed against the internal engine rather than the external Docker daemon.

Because of this, agents running inside can only see Docker services within their sandbox and nothing else, adding an additional layer of security.

Attribute

Traditional VM

Container

Docker MicroVM

Isolation

Strong (dedicated kernel)

Weak (shared kernel)

Strong (dedicated kernel)

Boot time

Minutes

Milliseconds

Seconds (after the first image pull)

Attack Surface

Large

Medium

Minimal

To demonstrate Docker Engine isolation, I created two Sandbox sessions, ran the Docker hello-world container image in one, and then ran docker ps -a in both.

​As you can see from the screenshot below, one session has the hello-world container and the other does not. This is possible because both of them are running two different Docker engine daemons.

More on the Sandbox architecture here: https://www.docker.com/blog/why-microvms-the-architecture-behind-docker-sandboxes/

Conclusion

If there is one takeaway; it’s this: isolation plays a major role when building autonomous AI agents because the blast radius of a security mistake is significant. 

Each approach we explored till now solves a different piece of the isolation puzzle. Containers improve portability and developer experience, but inherit the risks of a shared kernel. Virtual Machines deliver strong isolation, but the overhead doesn’t scale when you’re spinning up dozens of agents. gVisor sits in an interesting middle ground, though compatibility and community trade offs might slow you down.

Among all these, what makes Docker Sandbox with MicroVMs compelling is how it unifies these dimensions: VM-level security, container-like startup speed, and a workflow developers already know. Per-sandbox Docker Engines and strict network boundaries make it a strong foundation for running untrusted, autonomous workloads at scale.

So, what are you waiting for? Go ahead and try it out today.

For macOS: brew install docker/tap/sbx

For Windows: winget install Docker.sbx

Quelle: https://blog.docker.com/feed/

Generate Images Locally with Docker Model Runner and Open WebUI

We’ve all been there: you need to generate a few images for a project, you fire up an AI image service, and suddenly you’re wondering what happens to your prompts, how many credits you have left, or why that “safe content” filter rejected your perfectly reasonable request for a dragon wearing a business suit. What if you could skip all of that and run the whole thing on your own machine, with a slick chat UI on top?

That’s exactly what Docker Model Runner now makes possible. With a couple of commands you can pull an image-generation model, connect it to Open WebUI, and start generating images right from a chat interface fully local, fully private, fully yours.

Let’s build it. Your own private DALL-E, no cloud subscription required.

What You’ll Need

Docker Desktop (macOS) or Docker Engine (Linux)

~8 GB of free RAM for a small model (more is better)

GPU: optional but highly recommended, NVIDIA (CUDA), Apple Silicon (MPS), or CPU fallback

If you can run docker model version without errors, you’re good to go.

How  Docker Model Runner works with Open WebUI

Before we dive in, here’s the big picture:

Docker Model Runner acts as the control plane. It downloads the model, manages the inference backend lifecycle, and exposes a 100% OpenAI-compatible API — including the POST /v1/images/generations endpoint that Open WebUI already knows how to talk to.

Step 1: Pull an Image Generation Model

Docker Model Runner uses a compact packaging format called DDUF (Diffusers Unified Format) to distribute image generation models through Docker Hub, just like any other OCI artifact.

Pull a model to get started:

docker model pull stable-diffusion

You can confirm it’s ready:

docker model inspect stable-diffusion

{
"id": "sha256:5f60862074a4c585126288d08555e5ad9ef65044bf490ff3a64855fc84d06823",
"tags": [
"docker.io/ai/stable-diffusion:latest"
],
"created": 1768470632,
"config": {
"format": "diffusers",
"architecture": "diffusers",
"size": "6.94GB",
"diffusers": {
"dduf_file": "stable-diffusion-xl-base-1.0-FP16.dduf",
"layout": "dduf"
}
}
}

What’s happening under the hood? The model is stored locally as a DDUF file, a single-file format that bundles all the components of a diffusion model (text encoder, VAE, UNet/DiT, scheduler config) into one portable artifact. Docker Model Runner knows how to unpack it at runtime.

Step 2: Launch Open WebUI

This is a magic trick. Docker Model Runner has a built-in launch command that knows exactly how to wire up Open WebUI against the local inference endpoint:

docker model launch openwebui

That’s it. Behind the scenes this runs:

docker run –rm
-p 3000:8080
-e OPENAI_API_BASE=http://model-runner.docker.internal/engines/v1
-e OPENAI_BASE_URL=http://model-runner.docker.internal/engines/v1
-e OPENAI_API_KEY=sk-docker-model-runner
ghcr.io/open-webui/open-webui:latest

The model-runner.docker.internal hostname is a special DNS entry that Docker Desktop containers use to reach the Model Runner running on the host, no port-forwarding gymnastics required. If you use Docker CE, you’ll see the docker/model-runner container address instead of model-runner.docker.internal.

Open your browser at http://localhost:3000, create a local account (it stays offline), and you’ll land on the chat interface.

Tip: Want to run it in the background? Add –detach:

docker model launch openwebui –detach

Prefer Docker Compose? See the full setup here: https://docs.docker.com/ai/model-runner/openwebui-integration/

Step 3: Configure Open WebUI for Image Generation

Open WebUI already uses Docker Model Runner for text chat automatically (it reads the OPENAI_API_BASE env var). For image generation you need to point it at the images endpoint too, a 30-second job in the settings UI.

Got to http://localhost:3000/admin/settings/images

Enable Image Generation

Fill in the fields:

Click Save.

Field

Value

Model

stable-diffusion

API Base URL

http://model-runner.docker.internal/engines/diffusers/v1

API Key

whatever-you-want

Why the dummy API key? Docker Model Runner doesn’t require authentication, it’s a local service. The key is only there because Open WebUI’s form requires one. Any non-empty string works.

Step 4: Pull a Chat Model

Open WebUI is also a full-featured chat interface, and one of its best tricks is letting you ask the LLM to generate an image right from the conversation. For that to work, you need a language model too.

# Lightweight option — runs on almost any machine
docker model pull smollm2

# Recommended — more capable, better at understanding creative prompts
docker model pull gpt-oss

Both will show up automatically in the Open WebUI model selector. Use smollm2 if you’re tight on RAM, or gpt-oss if you want richer, more creative responses before image generation.

No extra configuration needed, Open WebUI picks up text models from the same OPENAI_API_BASE endpoint it was already configured with.

Step 5: Generate Your First Image

Head back to the main chat view. You’ll notice a small image icon in the message input bar.

Click it to toggle image generation mode, type your prompt, and send.

Try something like:

Create an image of a whale.

The first request takes a little longer while the backend loads the model into memory. After that, subsequent images generate much faster.

Open WebUI will automatically route image-generation requests to the diffusers backend and text requests to the language model, seamlessly, in the same conversation.

Step 6: Generate Images Directly via the API

For developers who want to integrate image generation into their own apps, Docker Model Runner exposes the standard OpenAI Images API directly:

curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations
-H "Content-Type: application/json"
-d '{
"model": "stable-diffusion",
"prompt": "A cat sitting on a couch",
"size": "512×512"
}'

The response follows the OpenAI Images API format exactly:

{
"created": 1742990400,
"data": [
{
"b64_json": "/9j/4AAQSkZJRgABAQAAAQABAAD/2wBD…"
}
]
}

Decode and save the image:

curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations
-H "Content-Type: application/json"
-d '{
"model": "stable-diffusion",
"prompt": "A cat sitting on a couch",
"size": "512×512"
}' | jq -r '.data[0].b64_json' | base64 -d > cat.png

open cat.png

Advanced Parameters

The API supports all the parameters you’d expect from a full diffusers pipeline:

curl http://localhost:12434/engines/diffusers/v1/images/generations
-X POST
-H "Content-Type: application/json"
-d '{
"model": "stable-diffusion",
"prompt": "A serene Japanese zen garden, cherry blossoms, koi pond, photorealistic",
"negative_prompt": "blurry, low quality, distorted, watermark",
"size": "768×512",
"n": 2,
"num_inference_steps": 30,
"guidance_scale": 7.5,
"seed": 42,
"response_format": "b64_json"
}'| jq -r '.data[0].b64_json' | base64 -d > garden.png

Parameter

What it does

prompt

What you want in the image

negative_prompt

What you want to avoid

size

Resolution as WIDTHxHEIGHT (e.g., 512×512, 768×512)

n

Number of images to generate (1–10)

num_inference_steps

More steps = higher quality, slower (default: 50)

guidance_scale

How closely to follow the prompt (1–20, default: 7.5)

seed

Integer for reproducible results; omit for random

Pro tip: Set a seed while you’re iterating on a prompt. Once you’re happy with the composition, remove it to get unique variations.

Under the Hood: How the Diffusers Backend Works

When you first request an image, Docker Model Runner:

Unpacks the DDUF file: extracts the model components and loads them via DiffusionPipeline.from_pretrained()

Starts a FastAPI server: this is the server that Open WebUI and your curl commands talk to through Docker Model Runner

The server is installed on first use by downloading a self-contained Python environment from Docker Hub (version-pinned, so updates are explicit). It lives at ~/.docker/model-runner/diffusers/ — no Python version conflicts, no virtualenv setup.

Troubleshooting

The model takes forever to load on first use. That’s normal, the model weights are being loaded from disk and transferred to GPU memory. Subsequent requests in the same session are much faster because the backend stays warm.

I get a “No model loaded” 503 error Make sure the model is fully downloaded (docker model list) and that you’re sending the correct model name in the model field.

Image quality is poor / generations are too fast Increase num_inference_steps (try 20–50 steps). Higher values = slower but sharper results.

Open WebUI can’t connect to the image endpoint Double-check the URL in Admin Panel → Settings → Images. Inside a Docker container it must be http://model-runner.docker.internal/engines/diffusers/v1, not localhost.

Conclusion and What’s Next

Docker Model Runner makes local image generation simple. It packages and serves image models through an OpenAI-compatible API, while Open WebUI provides an easy chat interface on top. Together, they let you generate images privately on your own machine, either through the browser or directly through the API, without relying on a cloud service.

This feature opens up a lot of possibilities:

Multimodal workflows: Chat with a text model about an idea, then immediately generate an image of it — in the same Open WebUI conversation

RAG + image generation: Build a pipeline that generates illustrations for your documents

Custom models: The diffusers backend supports any DDUF-packaged model, so you can package your own fine-tuned models using Docker’s model packaging tools

The Docker Model Runner team is actively expanding model support on Docker Hub. Check docker model search for the latest available models.

Quelle: https://blog.docker.com/feed/

Precision Container Security with Docker and Black Duck

The complexity of modern containerized applications often leaves developers drowning in a sea of “noise”—vulnerabilities that exist in the file system but pose zero actual risk to the application. The integration between Black Duck and Docker Hardened Images (DHI) provides a definitive answer to this challenge. By combining Docker’s secure-by-default foundations, using VEX (Vulnerability Exploitability eXchange) statements, and Black Duck’s industry-leading analysis engines, teams can now automatically separate base-layer noise from application-layer risk.

By combining Docker’s secure-by-default foundations, using VEX (Vulnerability Exploitability eXchange) statements, and Black Duck’s industry-leading analysis engines, teams can now automatically separate base-layer noise from application-layer risk.

TL;DR: The Black Duck + Docker Value Proposition

Zero-Config Recognition: Black Duck automatically identifies DHI base images during scanning without manual tagging.

Precision Triage: Leverage Docker-provided VEX data and Black Duck Security Advisories (BDSAs) to ignore “not affected” base image vulnerabilities.

Comprehensive Vulnerability Intelligence: Combine Docker’s exploitability data with Black Duck’s proprietary research to reduce triage costs and eliminate false positives.

Compliance on Autopilot: Export high-fidelity SBOMs enriched with VEX exploitability status, supporting transparent vulnerability obligations present in global regulations like the European Cyber Resilience Act (CRA) and industry standards such as those mandated by the FDA for medical devices and governmental agencies.

A Comprehensive Strategy for Software Integrity

Black Duck’s strategy for container security is built on a “Better Together” philosophy, leveraging two distinct but complementary analysis technologies to provide 360-degree visibility:

Black Duck Binary Analysis (BDBA): Our primary integration for DHI was released on April 14, 2026. BDBA provides deep, signature-based inspection of compiled assets within DHI, verifying the “as-shipped” state of your containers without needing access to source code.

Black Duck Software Composition Analysis (SCA): Soon, Black Duck will extend this DHI identification and verification support to our flagship SCA platform. This upcoming release will unify DHI intelligence with source-side dependency management, providing a single, comprehensive Software Bill of Materials (SBOM) across the entire SDLC.

Deep Visibility with Binary Match & SCA Roadmap

While traditional scanners often rely on simple package manager manifests, Black Duck looks deeper.

Signature-Based Accuracy: Using BDBA (launching March 31st), Black Duck identifies DHI components by their binary “fingerprint,” ensuring accuracy even if package metadata is stripped or modified.

The Path to Unified SCA: Our roadmap includes bringing these DHI insights directly into Black Duck SCA. This will allow security teams to apply the same governance policies to DHI-based containers as they do to their application source code, all within a single pane of glass.

Layer-Specific Analysis: Easily pivot between the hardened base image and your custom application layers to understand exactly where a risk was introduced.

Dynamic Risk Triage: VEX + BDSA Intelligence

The most significant drain on developer productivity is manual triage. This integration operationalizes “Reachability” and “Exploitability” through automated data streams:

VEX Integration: Black Duck ingests Docker’s VEX statements as a primary source of truth. If Docker confirms a base image vulnerability is “not_affected” due to the hardening process, Black Duck automatically suppresses the alert.

Beyond the NVD: While competitors rely on the National Vulnerability Database (NVD), Black Duck uses BDSAs. These advisories often arrive days before the NVD, providing deeper exploitability context and specific remediation paths.

Bulk Policy Enforcement: Security teams can set global Black Duck policies to automatically “ignore” any vulnerability backed by a “not_affected” vulnerability status statement from Docker, potentially clearing thousands of non-actionable alerts with zero manual effort.

Operationalizing Security with Automated Workflows

Black Duck does more than find issues; it manages the lifecycle of the container:

SLA Tracking: Automatically trigger Jira tickets or email alerts when a vulnerability in a custom layer exceeds your organization’s risk threshold.

Pipeline Gating: Use the Black Duck Detect CLI to fail builds only when reachable or unaddressed risks are found in your application code, keeping the CI/CD pipeline moving.

Continuous Patching: For Enterprise DHI users, Black Duck verifies when a patched base image is mirrored to your private repository, confirming mitigation without requiring a developer to manually “re-scan” to prove compliance.

Get started for free

Check Docker Documentation on VEX at https://docs.docker.com/dhi/core-concepts/vex/

Learn more Docker’s approach to CVE exploitability and auditability at https://www.docker.com/blog/why-we-chose-the-harder-path-docker-hardened-images-one-year-later/

Read on Black Duck’s VEX documentation at https://documentation.blackduck.com/bundle/bd-hub/page/Reporting/vexReport_global.html

Quelle: https://blog.docker.com/feed/

A Virtual Agent team at Docker: How the Coding Agent Sandboxes team uses a fleet of agents to ship faster

I work on Coding Agent Sandboxes, aka “sbx” at Docker. The project provides secure, microVM-based isolation for running AI coding agents like Claude Code, Gemini, Codex, Docker Agent and Kiro. Agents get full autonomy inside a sandbox (their own Docker daemon, network, filesystem) without touching your host system. Over the past couple of weeks, we built something on top of it: a virtual team of seven AI agent roles that test the product, triage issues, post release notes, and even fix bugs, all running autonomously in CI. We call it the Fleet.

The Fleet is built on Claude Code skills: markdown files that give an agent a persona, a set of responsibilities, and the tools it’s allowed to use. Think of a skill not as a script that says “run these steps,” but as a role description that says “you are the build engineer, here’s what you know and how you make decisions.” That distinction matters because agents need judgment, not just instructions. When a test fails unexpectedly, a script stops. A role investigates.

The same skill file, the same behavior, whether it runs on a developer’s laptop or in CI.

Local First, CI Second

Coding Agent Sandboxes is a CLI tool (sbx) that manages sandbox lifecycles: create, start, stop, remove, configure networking, mount workspaces, and more. It runs on MacOS, Linux and Windows. Every release needs testing across both platforms, across upgrade paths between versions, and under sustained load to catch resource leaks. The team also needs daily visibility into what shipped, and a way to triage the growing issue backlog without it becoming a full-time job.

We could have written traditional test scripts and reporting tools. Instead, we built agent roles that handle these tasks autonomously, both on our laptops and in CI.

The design principle behind the Fleet is simple: every skill runs on your machine first.

When we built the /cli-tester skill (the Fleet’s exploratory tester, more on that below), we didn’t start by writing a GitHub workflow. We started by invoking it locally. We watched it build the binaries, exercise the CLI commands, find issues, and report them. We tweaked the skill until it did the right thing in our terminal. Only then did we wire it into a workflow.

This matters because the alternative is painful. If you build CI-only agents, you debug them through commit-push-wait-read-logs cycles. Every iteration takes minutes. When the skill runs locally first, the iteration takes seconds. You see the agent think. You see where it gets confused. You fix the skill file, re-invoke, and try again.

CI is just another runtime for the same skill. The /cli-tester that runs nightly on MacOS, Linux and Windows runners is the exact same skill we invoke from our terminals. The workflow sets up the environment, checks out the code, and calls the skill. That’s it. No separate “CI version.” No translation layer. One skill, two runtimes.

This is what makes the Fleet practical. You’re not maintaining two systems. You’re maintaining one set of skills and a set of workflows that invoke them.

The Roster

The skills directory has 20 skills in total. Most are foundational knowledge (architecture, code style, Go conventions, security, testing patterns). Seven of them are the Fleet: the roles that run autonomously on CI. Each one is a SKILL.md file that describes a persona, not a procedure.

/build-engineer is the foundation that other skills stand on. It references topic files for building binaries, container templates, and local installs. It knows the Taskfile.yml, the docker-bake.hcl, and the platform-specific build flags. It doesn’t run on CI by itself. Other skills load it when they need to compile anything.

/project-manager is the team’s memory. It deduplicates findings against existing issues and PRs before creating new ones, manages the GitHub Projects board (setting status, priority, and labels), and handles interactive triage when running locally. On CI, it switches to fully automatic mode: no questions asked, just deduplicate and create. It uses GraphQL pagination to scan the entire project board, not just the first page. Every other skill that discovers something calls the project-manager before opening an issue.

/product-owner translates commit-speak into human language. It collects merged PRs from a date range, categorizes them (New Features, Bug Fixes, Improvements, Documentation, Maintenance), and rewrites each one in plain English. “feat(cli): add TZ env passthrough” becomes “Docker Sandboxes now automatically use your local timezone.” On CI, it outputs Slack Block Kit JSON. Locally, it renders a markdown table. It filters out noise from bots (Dependabot bumps, workflow-only changes) and skips posting when there’s nothing meaningful to report.

/cli-tester is the exploratory tester of the Fleet, and it’s the largest skill by far. Unlike traditional test scripts that assert expected output and fail on any deviation, the cli-tester investigates what it finds. When output doesn’t match expectations, it asks why before filing a bug.

It defines 52+ test scenarios organized into 14 tiers: Core Lifecycle, Agent Smoke, Workspace, Network Policy, Sandbox Features, Blueprint, CLI UX, Environment, Code Tasks, Agent Network, Reliability, Collaboration, Error Recovery, and Human-Only (skipped in CI). It builds the binaries through the build-engineer, triages findings through the project-manager, and loads product scenarios defined by the actual Product Manager on the team. It monitors disk space during testing, posts an executive summary to Slack when it finishes, and runs nightly on CI across MacOS, Linux and Windows.

It also powers a slash command on GitHub. When someone comments /cli-tester-review on a pull request, CI spins up three runners (MacOS, Linux and Windows), each loading the skill to exercise the PR’s changes on that platform. The agents explore the code, run the scenarios, and post their findings as comments directly on the pull request.

/performance-tester runs in two modes. Lifecycle Endurance repeatedly cycles create/stop/rm to detect reliability issues and resource leaks, producing xUnit JSON output. Code Exploration Benchmark clones a real Git repository and compares host-vs-sandbox I/O performance and Claude Code session behavior. Both modes measure disk usage over time and flag regressions. The goal is catching the slow degradation that no single test run would notice.

/upgrade-tester runs a four-phase test plan. Phase A creates pre-upgrade state (sandboxes, configurations). Phase B installs the new version. Phase C verifies everything still works after the upgrade. Phase D optionally downgrades and verifies again. It takes two version tags as input, builds the binaries for each, creates VMs, and produces an executive summary with pass/fail per phase. Upgrade regressions are the kind of bug that’s invisible in a single-version test suite.

/software-engineer operates in two modes. Reactive: when someone adds the agent-fix label to a GitHub issue, a MacOS runner picks it up and runs a ralph-loop to work the issue, contributing a PR with minimal, focused changes. Proactive: weekly, it runs in architect mode, scanning the codebase for quality issues, producing up to five findings, triaging them through the project-manager, then spawning three MacOS runners in parallel to fix three of them. Each runner delivers a PR targeting a specific simplification or tech-debt reduction.

Skills That Compose

Individual skills are useful. Skills that load other skills are a team.

The seven Fleet roles sit on top of thirteen foundational skills: architecture, code style, Go conventions, software design, security, testing patterns, development workflow, git worktrees, and others. The foundational skills encode project knowledge. The Fleet roles encode behavior. A Fleet role loads the foundational skills it needs, the same way a new team member reads the project’s contributing guide before writing code.

The /cli-tester doesn’t know how to build binaries. It loads the /build-engineer for that. It doesn’t know whether the bug it found is a duplicate. It loads the /project-manager to check. The tester focuses on testing. The builder focuses on building. The manager focuses on triaging. Each role stays in its lane, and the composition creates something none of them could do alone.

The /software-engineer follows the same pattern. It loads the /build-engineer so it can compile the project, and it loads coding best practices and software design conventions so its output meets the team’s standards. The skill doesn’t try to encode everything. It delegates to the foundational skills.

The /performance-tester loads the /cli-tester, extending it with duration and metrics. Instead of duplicating the testing logic, it reuses it and adds a measurement layer on top.

This is the skills-as-roles principle in practice. When you design skills as personas with clear responsibilities (instead of step-by-step commands), they compose naturally. A tester that loads a builder and a manager is doing the same thing a human tester does: asking a colleague to compile the project and checking with the PM before filing a bug. The difference is that the “asking” happens through skill composition instead of a Slack message.

The Ralph-Loop Is the Engine

The Ralph Wiggum loop is a pattern popularized by Geoffrey Huntley in 2025: a Bash loop that keeps feeding an AI coding agent the same task until the work is done. At its simplest, it’s while :; do cat PROMPT.md | claude-code ; done. Each iteration spawns a fresh agent with a clean context window. The agent reads the task, implements one piece, runs the tests, commits if they pass, and exits. The loop restarts, and the next iteration picks up where the previous one left off. Instead of hoping for first-try perfection, you design for iteration.

Our implementation of this pattern is called a Ralph-loop. The Fleet skills define what each agent role knows. The Ralph-loop defines how the iteration runs.

Our Ralph-loop is a composite GitHub Action backed by a shell script that adds a layer on top of the basic pattern: a separate worker and reviewer. It fetches the issue context, creates a working branch, and iterates: the worker implements changes and writes a summary, the reviewer evaluates the diff and decides SHIP or REVISE. If REVISE, the feedback goes back to the worker for another pass. Up to five iterations by default. If the reviewer says SHIP, the loop pushes the branch, creates a PR, and comments on the original issue.

The worker and reviewer run as separate Claude invocations with different models. The worker uses Opus for implementation. The reviewer uses Opus with 1M context to evaluate the full diff against the task requirements. Each one loads the /software-engineer skill (which in turn loads the build-engineer and coding best practices), so they share the same project knowledge but apply it from different perspectives.

Separating generation from evaluation is deliberate. The same agent that wrote the code shouldn’t evaluate whether the code is good. It’s the oldest principle in quality assurance: the person who built the thing shouldn’t be the only person who tests it. The worker’s job is to solve the problem. The reviewer’s job is to decide whether the problem is actually solved.

The Ralph-loop works locally too. The same ralph-loop.sh script that CI calls can be invoked from your terminal with –issue-number 42. Locally, it parses CLI arguments instead of reading environment variables, and outputs plain text instead of streaming JSON. Same loop, same prompts, same iteration pattern. We debugged the worker and reviewer prompts on our laptops before they ever ran in CI.

The workflows handle scheduling and triggering: nightly cron for the testers, label events for the software-engineer, weekly cron for the architect mode. The Ralph-loop handles the iteration pattern. The skills handle the domain knowledge. Three layers, each with a clear job.

This separation is what made the Fleet possible to build in a couple of weeks. We didn’t have to reinvent the automation loop for every role. The Ralph-loop already knew how to iterate. We just needed to give each role its own skill file and wire the triggers.

What the Fleet Ships

The Fleet has been running for a couple of weeks. Here’s what it delivers.

Automated issue resolution. A team member labels an issue with agent-fix. The CI grabs a MacOS runner, reads the issue, and starts working. The result is a pull request that addresses the issue. Not every PR lands without changes, but the first draft is there for review, often within the hour.

Daily release notes. The product-owner traverses the git log every day and posts a Slack summary for stakeholders. No one has to manually compile “what shipped this week.” The stakeholders see progress in real time, at the speed the team actually moves.

Nightly exploratory testing. The cli-tester runs every night on MacOS and Windows. It loads the product scenarios that the Product Manager has defined, exercises the CLI, and opens issues for anything it finds. Before opening an issue, it checks for duplicates through the project-manager. When it finishes, it posts a Slack message with the results.

Performance and upgrade testing. The performance-tester and upgrade-tester run on CI across both platforms. Disk usage regressions, behavioral differences between sandbox and non-sandbox modes, and version compatibility issues get caught before they reach a human reporter.

Weekly tech-debt reduction. Every week, the software-engineer runs in architect mode. It reviews the codebase, identifies three spots where code can be simplified or legacy patterns can be cleaned up, spawns three parallel runners, and delivers three PRs. Each one is a small, focused improvement. Over time, they compound.

What We Don’t Automate

The Fleet creates pull requests. It does not merge them.

That’s the trust boundary, and it’s deliberate. Merge decisions stay with humans. So do architectural choices, scope decisions, and prioritization. The agents do the work. The team decides what work matters and whether the output meets the bar.

The supervision model scales the same way it works on a developer’s laptop. When we run multiple agents locally in parallel worktrees, we review their output before merging. With the Fleet, the team supervises seven agent roles running on CI. The shape of the oversight is the same: review the output, approve or adjust, move on. The difference is that the agents don’t need anyone’s laptop to start working.

The Fleet is not replacing the team. It’s extending it. Seven roles that handle repetitive, well-defined work so humans can focus on work that requires judgment, context, and taste. The Fleet has many arms, but the team still steers the ship.

What We Learnt Building the Fleet

Start with the foundation, not the flashiest skill. We started with the /cli-tester because testing the CLI felt like the highest-value target. But it needed to build binaries, triage issues, and load product scenarios, all things that depended on other skills we hadn’t written yet. We should have started with the /build-engineer, the skill everything else stands on. The second skill was better because of what we learned from the first. Don’t design the full fleet upfront.

Build locally first, deploy to CI second. The commit-push-wait-read-logs cycle is where velocity goes to die. If you can’t debug a skill in your terminal, it’s not ready for a workflow. Some behaviors only surface on CI runners (different OS, permissions, network constraints), and those iterations cost hours of wall-clock time. Minimize what can only be tested in CI.

Write skills as roles, not scripts. Ask yourself: “If a new team member joined tomorrow with this exact role, what would I tell them?” What do they need to know? What tools can they use? How should they handle ambiguity? That conversation is your SKILL.md. “You are the build engineer, here’s what you know” produces better judgment than “run these five steps.” When something unexpected happens, a role investigates. A script stops.

Compose skills like you compose teams. The /cli-tester doesn’t know how to build binaries or triage bugs. It loads the /build-engineer and /project-manager for that. Each role stays in its lane. The composition creates what none of them could do alone.

Separate generation from evaluation. The agent that wrote the code shouldn’t be the only one that reviews it. Our Ralph-loop uses a worker and a reviewer for a reason: the oldest principle in quality assurance applies to agents too.

Triage matters more than detection. The /cli-tester initially filed issues for every unexpected output. Transient failures, timing-dependent behavior, environment quirks: everything became an issue. The signal-to-noise ratio got bad enough that the team started ignoring findings. Getting the triage right (deduplication, confirming before filing) took longer than building the tester itself.And one more thing. All Fleet agents, even on ephemeral CI runners, run inside Coding Agent Sandboxes. We test with what our users use.
Quelle: https://blog.docker.com/feed/

From Security Blocked to Prod Ready: ClickHouse on Docker Hardened Images

In November 2025, a team self-hosting Langfuse, an open-source LLM observability platform, on Kubernetes uploaded their ClickHouse image to AWS ECR as part of their production preparation. They found that the pipeline scanner had returned three critical vulnerabilities – not in ClickHouse, but in the base image. Their security team saw the findings and blocked the deployment before it ever reached production.

“Our security team is not allowing us to take it to production. Please suggest alternatives.“

vinaygoel586
GitHub Issue #286, November 28, 2025

If you’ve shipped containers into an enterprise environment recently, this situation will sound familiar. A perfectly functional deployment gets blocked not because something is broken, but because a scanner found CVEs in packages the application never even touches. A day goes into investigating the findings, a risk exception gets written up, and the security team rejects it anyway, because the vulnerabilities are technically real even if they’re practically irrelevant to your workload.

This post is about how Docker Hardened Images (DHI) gets you unstuck, when a security team blocks the deployment of a container that has CVEs. In this case we will specifically look at the image for ClickHouse, one of the most widely pulled database images on Docker Hub.

A Quick Word on ClickHouse

ClickHouse is an open-source columnar database built for analytical workloads at scale. It is capable of querying billions of rows and returning results in milliseconds in a way that traditional row-oriented databases simply can’t match. Companies such as Cloudflare, Uber, and Spotify all run it in production. With over 100 million pulls from Docker Hub, it has become the default infrastructure choice for teams that need serious analytics throughput. The image’s default security posture, though, was designed with developer ease-of-use in mind rather than the hardening that enterprise production environments demand and that gap is where the trouble starts.

Figure: The layered architecture of ClickHouse

How ClickHouse is Structured

ClickHouse follows a layered architecture. It is designed for analytical speed at scale. SQL queries arrive over HTTP (port 8123) or TCP (port 9000), then pass through the optimizer which parses into an abstract syntax tree and prunes it before the pipeline executor picks it up and hands the work off to parallel threads. Beneath the query layer sits the MergeTree storage engine, the heart of ClickHouse which stores data in columnar .bin files. It uses a sparse primary index to skip irrelevant granules without reading entire columns, and runs background merge processes to compact parts and maintain query performance over time. 

At the bottom, storage is pluggable: local disk, S3, HDFS, or Azure Blob, with tiered hot/warm/cold policies to balance cost and latency. In distributed deployments, ClickHouse Keeper (or ZooKeeper) coordinates replication across replicas, while sharding splits data horizontally across nodes allowing the cluster to scale reads and writes independently. The result is a database that processes hundreds of millions of rows per second per server, making it the default choice for teams running serious analytics workloads.

The Real Problem: It’s Not ClickHouse, It’s the Packaging

The standard clickhouse/clickhouse-server image is built on a full Ubuntu 22.04 base. The base ships with a lot of things ClickHouse doesn’t need such as Perl, system utilities, apt itself, and dozens of transitive dependencies that exist in the image simply because Ubuntu brought outdated package along and in many cases, Ubuntu maintainers decide to not backport fixes from upstream.

ClickHouse doesn’t use most of those system utilities. But the CVEs in those packages are real. They show up in Trivy, Grype, and AWS ECR has no way to distinguish a vulnerable library that’s never loaded from one that’s actively running in production. Your security team sees critical findings and blocks the deployment, which is the correct thing for them to do given what the scanner is telling them.

The instinct at this point is to argue the case, documenting why each CVE doesn’t apply to your workload, writing risk exceptions and escalating, but that’s a slow process. The only real fix is to remove those unnecessary packages entirely. That’s what Docker Hardened Images do.

What DHI Actually Changes

Docker Hardened Images for ClickHouse are built around a straightforward question: what does the database actually need to run? Rather than starting from a full Ubuntu base and hoping the CVE count stays manageable, DHI ships only what ClickHouse requires and leaves everything else out.

The most immediate consequence of that is the absence of apt at runtime. Without a package manager, an attacker who gains a foothold in the container has no obvious path to installing tools or establishing persistence. Network utilities like curl and wget are gone for the same reason, the standard clickhouse/clickhouse-server image has been carrying wget with CVE-2021-31879 unpatched since 2021 because there is no upstream fix as noted by the Ubuntu maintainer, a vulnerability in a tool ClickHouse never needed in the first place. DHI doesn’t patch it; it simply doesn’t include wget at all. A shell is still available for operational work, but without the package manager and network tools, there’s very little an attacker can actually do with it.

To make this practical across different stages of a pipeline, DHI ships two variants. The development image (dev) includes additional tooling that makes local testing and debugging more comfortable. The production image (runtime) strips that back to the absolute minimum, giving you the smallest possible attack surface for the workload that actually faces the world. The intent is that teams adopt the dev variant early in the pipeline and promote the hardened production image through to deployment, rather than discovering the differences at the point where it matters most.

The image also runs as a non-root user uid=65532 out of the box, with no additional Dockerfile configuration required. On the provenance side, every DHI image ships with SLSA Level 3 attestation, which provides cryptographic proof of exactly what went into the build and how it was produced. Docker’s security team actively tracks and patches CVEs, and the presence of 2026 CVE IDs in DHI’s findings is evidence of that remediation happening ahead of public disclosure feeds rather than in response to them.

Getting Started

Before you can pull a DHI image, you need to mirror it to your organization’s namespace on Docker Hub. This is a one-time setup per image not per tag and it means all future updates flow to your namespace automatically.

Log in to Docker Hub and open the DHI catalog

Find clickhouse-server and select Mirror to repository

Follow the on-screen instructions

Authenticate locally: docker login dhi.io

Once that’s done, you’re pulling from your own namespace with the same image, same tags, same ClickHouse – just hardened.

Your first DHI ClickHouse container

docker run –name my-clickhouse-server -d
–ulimit nofile=262144:262144
dhi.io/clickhouse-server:26.2-debian13

The –ulimit nofile=262144:262144 flag is a ClickHouse requirement, not a DHI one – ClickHouse needs high file descriptor limits to operate correctly. Keep it in all your run commands.

Verify it started:

docker exec my-clickhouse-server clickhouse-client
–query "SELECT 'Hello from DHI ClickHouse!'"

Production setup with persistent storage

For anything beyond local testing, you want volumes and a password:

docker run -d
–name my-clickhouse-server
–ulimit nofile=262144:262144
-e CLICKHOUSE_PASSWORD=mysecretpassword
-v clickhouse-data:/var/lib/clickhouse
-v clickhouse-logs:/var/log/clickhouse-server
-p 8123:8123 -p 9000:9000
dhi.io/clickhouse-server:26.2-debian13

Note that CLICKHOUSE_PASSWORD is required if you want to access ClickHouse over the network. DHI disables unauthenticated network access by default which is the right call for any production deployment.

Test it over HTTP:

curl "http://localhost:8123/?query=SELECT%20version()&user=default&password=mysecretpassword"

Custom configuration

If you’re already running ClickHouse with custom XML config, nothing changes. Same format, same mount path:

cat > custom-config.xml << EOF
<clickhouse>
<logger>
<level>information</level>
<console>true</console>
</logger>
<listen_host>0.0.0.0</listen_host>
</clickhouse>
EOF

docker run -d
–name my-clickhouse-server
–ulimit nofile=262144:262144
-v $(pwd)/custom-config.xml:/etc/clickhouse-server/config.d/custom.xml:ro
-p 8123:8123 -p 9000:9000
dhi.io/clickhouse-server:26.2-debian13

Running DHI ClickHouse on Kubernetes

For Kubernetes, there’s one important addition to your pod spec. Since DHI runs as a non-root user, you need to set fsGroup to ensure your persistent volume data is accessible:

spec:
template:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 65532 # DHI nonroot user
fsGroup: 65532 # makes mounted volumes accessible to the nonroot user
containers:
– name: clickhouse-server
image: dhi.io/clickhouse-server:26.2-debian13
ports:
– containerPort: 8123
– containerPort: 9000
volumeMounts:
– name: clickhouse-data
mountPath: /var/lib/clickhouse
– name: clickhouse-logs
mountPath: /var/log/clickhouse-server
resources:
limits:
cpu: "2"
memory: "4Gi"

One thing worth mentioning: ClickHouse’s default ports 8123 and 9000 are above the 1024 privileged port boundary, so running as nonroot doesn’t cause any port binding issues.

The metrics exporter

If you’re running ClickHouse on Kubernetes and need Prometheus metrics, Docker also ships clickhouse-metrics-exporter – a hardened image that works with the ClickHouse Operator to expose a /metrics endpoint. It’s 65% smaller than the standard exporter (10.3 MB vs 29.4 MB) and has 75% fewer layers (5 vs 20). Same data, dramatically smaller surface.

containers:
– name: metrics-exporter
image: dhi.io/clickhouse-metrics-exporter:0-debian13
ports:
– name: metrics
containerPort: 8888
resources:
limits:
cpu: 100m
memory: 128Mi
requests:
cpu: 50m
memory: 64Mi

Debugging without the usual tools

The debugging story is simpler than it might seem. docker debug attaches an ephemeral layer to the running container that includes bash, curl, strace, vim, and anything else you need without modifying the production image itself. When you exit, the layer disappears and the container is exactly as it was. It’s a cleaner approach than shelling directly into a production container, and in practice it’s a single command:

docker debug my-clickhouse-server

Or if you prefer, you can mount a debug image alongside the container:

docker run –rm -it –pid container:my-clickhouse-server
–mount=type=image,source=<your-namespace>/dhi-busybox,destination=/dbg,ro
dhi.io/clickhouse-server:26.2-debian13 /dbg/bin/sh

There’s also a broader security benefit that goes beyond CVE counts. If something does go wrong in production, an attacker who gets into the container finds no package manager to install tools with, no curl or wget to exfiltrate data through, and no obvious path to reach out to the network which significantly limits what a compromise can actually turn into.

ClickHouse: Non-hardened Image vs. Hardened Image Compared

A Docker Scout scan of both images puts the difference in plain numbers. Using ubuntu:22.04 as its base, the standard image carries 8 medium and 11 low severity vulnerabilities across 111 packages, including the wget and tar findings that are most likely to trigger a security block in an enterprise pipeline. The DHI image eliminates all medium severity findings entirely and comes in at 14 low severity items but these are in core system libraries like glibc and openssl where no fix exists on any distribution, not in unnecessary utilities that had no business being in the image. The 3 unconfirmed findings that Scout surfaces have already been assessed and suppressed via VEX attestation, which ships with the image as part of its SLSA Level 3 provenance

To view the difference between versions for any other image, you can run your own scan with Docker Scout for a quick comparison using this command:

docker scout quickview clickhouse/clickhouse-server:latest

docker pull dhi.io/clickhouse-server:26.2-debian13
docker tag dhi.io/clickhouse-server:26.2-debian13 clickhouse-dhi:latest
docker scout quickview clickhouse-dhi:latest

Non-Hardened  ClickHouse Image

Docker Hardened Image

Default user

root (steps down to clickhouse user at runtime via entrypoint, but Dockerfile has no USER directive overridable with CLICKHOUSE_RUN_AS_ROOT=1)

nonroot (enforced at image level via USER directive cannot be overridden at runtime)

Shell access

Full shell (bash/sh) available

bash present, no network tools or package manager

Package manager

apt available

No package manager

CVE exposure

Ships wget (CVE-2021-31879, unpatched since 2021), tar (CVE-2025-45582)

No wget, no tar – unnecessary packages removed entirely

CVE patching

Unpatched findings from 2021–2025 due to the lack of upstream fixes from Ubuntu base image.

Actively tracked, 2026 CVE IDs show proactive remediation

Provenance

Standard

SLSA Level 3 attestation

Compliance

Manual hardening required

CIS, NIST, FedRAMP-aligned

Debugging

Traditional shell debugging

Use docker debug or Image Mount for troubleshooting

The Security Team Conversation

The team that got blocked at AWS ECR in November 2025 didn’t have a ClickHouse problem, they had a base image problem. Their database was fine; what the scanner was finding were CVEs in Perl, system utilities, and other packages that had come along in the Debian base and never used by the application. Nothing in the scanner output made that distinction, so the security team did exactly what they were supposed to do and blocked the deployment.

With DHI, that conversation with your security team becomes considerably more straightforward. Rather than building a case for why specific CVEs don’t apply to your workload, you can point to an image built by Docker’s security team from the minimum required components, with SLSA Level 3 provenance and independent validation by SRLabs. The ClickHouse runtime itself is unchanged ~ queries, ports, configuration files, and performance all carry over so the only thing you’re actually changing is the answer you can give when someone asks whether this image can go to production.For teams that need stronger guarantees, DHI Enterprise adds SLA-backed CVE remediation within seven days, FIPS and STIG variants, and extended lifecycle support. For most teams, the free Enterprise trial is the right starting point. It answers the question that actually matters before you commit to anything. Interested to learn further? Start with this blog that walks through the trial and sets you up for success.

Migration Checklist

☐ Mirror clickhouse-server DHI image to your Docker Hub namespace (one-time setup)
☐ Update your image reference to dhi.io/clickhouse-server:26.2-debian13
☐ Set CLICKHOUSE_PASSWORD (required for network access in DHI)
☐ Keep –ulimit nofile=262144:262144 on all run commands
☐ In Kubernetes: add fsGroup: 65532 to your pod securityContext
☐ Switch from kubectl exec to kubectl debug for troubleshooting
☐ Run trivy against both images to see the difference yourself:
trivy image clickhouse/clickhouse-server:latest
trivy image dhi.io/clickhouse-server:26.2-debian13

The migration is narrower in scope than it might appear – your volume mounts, port mappings, and existing XML configuration files all carry over without modifications, and on Kubernetes the only structure addition is the fsGroup security context. Everything else is an image reference change.

Resources

Docker Hardened Images Documentation

DHI ClickHouse Server Guide

DHI ClickHouse Metrics Exporter Guide

Docker Debug Documentation

Free DHI Catalog

DHI Community Announcement

Docker Scout Documentation

Quelle: https://blog.docker.com/feed/

Trivy, KICS, and the shape of supply chain attacks so far in 2026

Catching the KICS push: what happened, and the case for open, fast collaboration

In the past few weeks we’ve worked through two supply chain compromises on Docker Hub with a similar shape: first Trivy, now Checkmarx KICS. In both cases, stolen publisher credentials were used to push malicious images through legitimate publishing flows. In both cases, Docker’s infrastructure was not breached. And in both cases, the software supply chain of everyone who pulled the compromised tags was briefly exposed.

This is our account of what happened with KICS, what affected users should do, and what the pattern says about where defenders need to invest.

What happened

On April 22, 2026 at approximately 12:35 UTC, a threat actor authenticated to Docker Hub using valid Checkmarx publisher credentials and pushed malicious images to the checkmarx/kics repository. Five existing tags were overwritten to malicious digests (latest, v2.1.20, v2.1.20-debian, alpine, debian) and two new tags (v2.1.21, v2.1.21-debian) were created. The images were built from an attacker-controlled source repository, not from Checkmarx’s.

The poisoned binary kept the legitimate scanning surface intact and added a quiet exfiltration path. Scan output was collected, encrypted, and sent to attacker-controlled infrastructure at audit.checkmarx[.]cx, with the User-Agent KICS-Telemetry/2.0. Because KICS scans Terraform, CloudFormation, Kubernetes and similar configuration files, its output routinely contains secrets, credentials, cloud resource names, and internal topology. 

Affected malicious digests (any one of these in your pull history should be treated as malicious):

For alpine, v2.1.20, v2.1.21 -&gt; Index manifest digest: sha256:2588a44890263a8185bd5d9fadb6bc9220b60245dbcbc4da35e1b62a6f8c230d

Image digest (amd64): sha256:d186161ae8e33cd7702dd2a6c0337deb14e2b178542d232129c0da64b1af06e4
Image digest (arm64): sha256:415610a42c5b51347709e315f5efb6fffa588b6ebc1b95b24abf28088347791b

For debian, v2.1.20-debian, v2.1.21-debian -&gt; Index manifest digest: sha256:222e6bfed0f3bb1937bf5e719a2342871ccd683ff1c0cb967c8e31ea58beaf7b

Image digest (amd64): sha256:a6871deb0480e1205c1daff10cedf4e60ad951605fd1a4efaca0a9c54d56d1cb
Image digest (arm64): sha256:ff7b0f114f87c67402dfc2459bb3d8954dd88e537b0e459482c04cffa26c1f07

For latest -&gt; Index manifest digest: sha256:a0d9366f6f0166dcbf92fcdc98e1a03d2e6210e8d7e8573f74d50849130651a0

Image digest (amd64): sha256:26e8e9c5e53c972997a278ca6e12708b8788b70575ca013fd30bfda34ab5f48f

Image digest (arm64): sha256:7391b531a07fccbbeaf59a488e1376cfe5b27aef757430a36d6d3a087c610322

If your CI ran kics against any repository with credentials in scope during the exposure window, rotate those credentials now. Re-pull checkmarx/kics by digest, not tag, and pin your CI to the digest so a future overwrite cannot silently affect you again. Purge the malicious digests from local caches, CI runners, pull-through registries, and mirrors: a clean pull won’t remove what’s already been cached. Check egress logs for connections to audit.checkmarx[.]cx, or outbound traffic with the KICS-Telemetry/2.0 User-Agent, which are strong indicators that exfiltration occurred on your infrastructure.

The affected digests are disabled, the repository has been restored to its last known-good state, and pulls of checkmarx/kics today return the legitimate March 3, 2026 image. The publisher account used to push the malicious images has been suspended, and we’ve notified the small number of users our telemetry shows pulled the compromised digests.Socket’s technical analysis of the issue is here. Their post also covers what appears to be a broader Checkmarx compromise, including recent VS Code extension releases, which is worth reading if your developers use those extensions.

How we caught this breach

Within about half an hour of the push, a new image on a repository we monitor triggered a review. A check against the upstream source found no matching release, and the provenance showed the image had been built from a different source repository created one day before the push. That was enough to quarantine the repository and start forensics with Socket and Checkmarx.

The defense is in correlation, not any single signal. In this episode, we found a new tag without an upstream release, provenance from an unfamiliar source, and a timing pattern that did not appear to match normal publishing behavior. Since we happened to see these signals together, they bought us a narrow window in which to act. It has to be noted that layered defense shortens the window between push and takedown, it does not prevent the push.

The bar for this kind of attack has collapsed

The uncomfortable thing about this incident, and Trivy before it, is how little sophistication incidents such as these require these days. A stolen credential from an IDE extension compromise, a target chosen from a public profile, a push through the normal publishing flow, and the attacker is inside the software supply chain of every organization that pulls that tag. Our assumption is this attack did not require any zero-days, novel tradecraft, or nation-state level budgets. The ingredients are stolen credentials and time, and both are abundant right now.

Every registry, every package manager, and every publisher of any consequence is in the firing line, including Docker. This isn’t a Checkmarx problem or a Hub problem or an npm problem. It’s the new baseline, and defenders who aren’t planning for it as the default case are already behind.

There are two implications for our ecosystem.

Credential hygiene at the publishing boundary matters more than it used to: fine-grained tokens scoped to a single registry, shorter credential lifetimes, clean separation between personal and publisher identities.

And that no single layer will catch all of this. Publishing-time verification, provenance, signatures, registry-side monitoring, deep package inspection (the kind Socket does to catch malicious behavior in dependencies), runtime egress controls, and cross-registry signal correlation each have to do some of the work, because any of them alone will miss cases the others catch.

A note on where this is structurally harder

In the Docker Hardened Images catalog, images are built by Docker from source, with verified provenance and signed releases produced through a hardened build pipeline. The class of attack described above, where a valid publisher credential pushes a tag that diverges from its upstream source, is structurally much harder to execute against an image built this way. There is no external credential that can substitute its way in; the provenance and the signatures have to match, or the image doesn’t ship. The DHI catalog is expanding, and we’re investing in this layer precisely because of the scenario and reasons explored in this blog. 

No one catches this alone

The reason this incident got caught quickly, the reason Socket was able to produce a technical analysis within hours, and the reason Checkmarx’s response could move in parallel with ours, is that all three teams shared signals and samples in real time. The Trivy response looked the same, as did the rapid notification to GitHub about the attacker-controlled source repository.

This is the posture the ecosystem needs more of, not less. Supply chain attackers are routing  across registries, IDE marketplaces, source hosts, and CI systems in hours. Defenders who don’t share signals across those same boundaries are operating from a point of disadvantage.  Formal standards for cross-registry coordination are still emerging, and they will matter eventually. What’s kept the windows short so far has been teams working with a spirit of openness, willingly sharing what they’re discovering, in real time.

Docker will keep investing in layered defenses on Hub, keep extending publishing-time verification to more of the catalog, and keep showing up to share signals, whether this is across a partner’s incident channel, a peer registry’s investigation, or the rooms where a more durable framework for coordination eventually takes shape.

We want to thank the Socket research team for fast, independent analysis, and to Checkmarx for moving alongside us on a tight timeline for this one.

Further reading

Socket blog: https://socket.dev/blog/checkmarx-supply-chain-compromise

Docker Hardened Images on Docker Hub: https://hub.docker.com/hardened-images/catalog

Quelle: https://blog.docker.com/feed/

Why MicroVMs: The Architecture Behind Docker Sandboxes

Last week, we launched Docker Sandboxes with a bold goal: to deliver the strongest agent isolation in the market.

This post unpacks that claim, how microVMs enable it, and some of the architectural choices we made in this approach.

The Problem With Every Other Approach

Every sandboxing model asks you to give something up. We looked at the top four approaches.

Full VMs offer strong isolation, but general-purpose VMs weren’t designed for ephemeral, session-heavy agent workflows. Some VMs built for specific workloads can spin up more effectively on modern hardware, but the general-purpose VM experience (slow cold starts, heavy resource overhead) pushes developers toward skipping isolation entirely.

Containers are fast and are the way modern applications are built. But for an autonomous agent that needs to build and run its own Docker containers, which coding agents routinely do, you hit Docker-in-Docker, which requires elevated privileges that undermine the isolation you set up in the first place. Agents need a real Docker environment to do development work, and containers alone don’t give you that cleanly.

WASM / V8 isolates are fast to spin up, but the isolation model is fundamentally different. You’re running isolates, not operating systems. Even providers of isolate-based sandboxes have acknowledged that hardening V8 is difficult, and that security bugs in the V8 engine surface more frequently than in mature hypervisors. Beyond the security model, there’s a practical gap: your agent can’t install system packages or run arbitrary shell commands. For a coding agent that needs a real development environment, WASM isn’t one.

Not using any sandboxing is fast, obviously. It’s also a liability. One rm -rf, one leaked .env, one rogue network call, and the blast radius is your entire machine.

Why MicroVMs

Docker Sandboxes run each agent session inside a dedicated microVM with a private Docker daemon isolated by the VM boundary, and no path back to the host.

That one sentence contains three architectural decisions worth unpacking.

Dedicated microVM. Each sandbox gets its own kernel. It’s hardware-boundary isolation, the same kind you get from a full VM. A compromised or runaway agent can’t reach the host, other sandboxes, or anything outside its environment. If it tries to escape, it hits a wall.

Private, VM-isolated Docker daemon. This is the key differentiator for coding agents. AI is going to result in more container workloads, not fewer. Containers are how applications are developed, and agents need a Docker environment to do that development. Docker Sandboxes give each agent its own Docker daemon running inside a microVM, fully isolated by the VM boundary. Your agent gets full docker build, docker run, and docker compose support with no socket mounting, no host-level privileges, none of the security compromises other approaches require. This means we treat agents as we would a human developer, giving them a true developer environment so they can actually complete tasks across the SDLC.

No path back to the host. File access, network policies, and secrets are defined before the agent runs, not enforced by the agent itself. This is an important distinction. An LLM deciding its own security boundaries is not a security model. The bounding box has to come from infrastructure, not from a system prompt.

Why We Built a New VMM

Choosing microVMs was the easy part. Running them where developers actually work was the hard part.

We looked hard at existing options, but none of them were designed for what we needed. Firecracker, the most well-known microVM runtime, was designed for cloud infrastructure, specifically Linux/KVM environments like AWS Lambda. It has no native support for macOS or Windows, full stop. That’s fine for server-side workloads, but coding agents don’t run in the cloud. They run on developer laptops, across macOS, Windows, and Linux. 

We could have shimmed an existing VMM into working across platforms, creating translation layers on macOS and workarounds on Windows, but bolting cross-platform support onto a Linux-first VMM means fighting abstractions that were never designed for it. That’s how you end up with fragile, layered workarounds that break the “it just works” promise and create the friction that makes developers skip sandboxing altogether.

So we built a new VMM, purpose-built for where coding agents actually run.

It runs natively on all three platforms using each OS’s native hypervisor: Apple’s Hypervisor.framework, Windows Hypervisor Platform, and Linux KVM. A single codebase for three platforms and zero translation layers.

This matters because it means agents get kernel-level isolation optimized for each specific OS. Cold starts are fast because there’s no abstraction tax. A developer on a MacBook gets the same isolation guarantees and startup performance as a developer on a Linux workstation or a Windows machine.

Building a VMM from scratch is not a small undertaking. But the alternative, asking developers to accept slower starts, degraded compatibility, or platform-specific caveats, is exactly the kind of asterisk that makes people run agents on the host instead. Our approach removes that asterisk at the hypervisor level.

Fast Cold Starts

We rebuilt the virtualization layer from scratch, optimizing for fast spin up and fast tear downs. Cold starts are fast. This matters for one reason: if the sandbox is slow, developers skip it. Every friction point between “start agent” and “agent is running” is a reason to run on the host instead. With near-instant starts, there is no performance reason to run outside it.

What This Means In Practice

Here’s the concrete version of what this architecture gives you:

Full development environment. Agents can clone repos, install dependencies, run test suites, build Docker images, spin up multi-container services, and open pull requests, all inside the sandbox. Nothing is stubbed out or simulated. Agents are treated as developers and given what they need to complete tasks end to end. 

Scoped access, not all-or-nothing. You define the boundary: exactly which files and directories the agent can see, which network endpoints it can reach, and which secrets it receives. Credentials are injected at runtime and outside the MicroVM boundary, never baked into the environment.

Disposable by design. If an agent goes off track, delete the sandbox and start fresh in seconds. There is no state to clean up and nothing to roll back on your host.

Works with every major agent. Claude Code, Codex, OpenCode, GitHub Copilot, Gemini CLI, Kiro, Docker Agent, and next-generation autonomous systems like OpenClaw and NanoClaw. Same isolation, same speed, one sandbox model across all of them.

For Teams

Individual developers can install and run Docker Sandboxes today, standalone, no Docker Desktop license required. 

For teams that want centralized filesystem and network policies that can be enforced across an organization and scale sandboxed execution, get in touch to learn about enterprise deployment.

The Tradeoff That Isn’t

The pitch for sandboxing has always come with an asterisk: yes, it’s safer, but you’ll pay for it in speed, compatibility, or workflow friction.

MicroVMs eliminate that asterisk. You get VM-grade isolation with cold starts fast enough that there’s no reason to skip it, and full Docker support inside the sandbox. There is no tradeoff.

Your agents should be running autonomously. They just shouldn’t be running without any guardrails.

Use Sandboxes in Seconds

Install Sandboxes with a single command.

macOSbrew install docker/tap/sbx   

Windows winget install Docker.sbx  

Read the docs to learn more.

Quelle: https://blog.docker.com/feed/