Cloud Computing Köln - Seite 112 von 6913 - Neues zu Cloud Computing, Internet of Things und Technologien

Cloud operations have reached an inflection point. For more than a decade, the industry has focused on scale—more infrastructure, more data, more services, more dashboards to build and manage both infrastructure and applications. While today’s cloud delivers extraordinary flexibility, the rapid growth of modern applications and AI workloads has introduced levels of scale and complexity that traditional operations were not designed for.

See how you can run agents with Azure Copilot

As modern applications and AI workloads expand in scale, speed, and interconnectedness, operational demands are evolving just as quickly. Organizations are now looking for an operating model that builds on their existing practices—one that brings intelligence into the flow of work and translates the constant stream of signals into coordinated action across the cloud lifecycle.

A new operating model for a dynamic cloud

Macro trends are pointing towards major shifts in operations. In the era of AI, workloads can move from experimentation to full production in weeks, making constant change the new norm. Infrastructure and applications are continuously updated, scaled, and reconfigured. Telemetry now streams from every layer—health, configuration, cost, performance, and security—while programmable infrastructure enables action at machine speed. At the same time, AI agents are emerging as practical operational partners—able to correlate signals, understand context, and take action within defined guardrails. Together, these shifts are driving the need for a new operating model—one where operations are dynamic, context-aware, and continuously optimized rather than reactive and manual.

Introducing agentic cloud operations

Agentic cloud operations brings this model to life by enabling teams to harness AI-powered agents that infuse contextual intelligence into everyday workflow. These agents help accelerate development, migration, and optimization by connecting operational signals directly to coordinated action across the lifecycle. They bring people, tools, and data together, so insights don’t stay passive—they become execution. The result is faster performance, reduced risk, and cloud operations that improve over time instead of falling behind as complexity grows.

Azure Copilot: The agentic interface

Azure Copilot brings agentic cloud operations to life as the agentic interface for Azure. Rather than adding yet another dashboard, it delivers a unified, immersive experience grounded in a customer’s real environment—subscriptions, resources, policies, and operational history. Teams can work through natural language, chat, console, or CLI, invoking agents directly within their workflows. A centralized management environment brings observability, configuration, resiliency, optimization, and security together—enabling operators to move seamlessly from insight to action in one place.

Full-lifecycle agents, working in context

At Ignite, we unveiled the agentic capabilities of Azure Copilot. These capabilities span key operational domains—migration, deployment, optimization, observability, resiliency, and troubleshooting—each designed to bring contextual intelligence into the flow of work. Azure Copilot correlates signals, understands operational context, and takes governed action where it matters. Rather than functioning as discrete bots, they operate as a coordinated, context-aware system that continuously strengthens cloud operations.

Plan and prepare

Azure Copilot and agents helps teams start with clarity and confidence. Copilot migration agent can assist with discovering existing environments, mapping application and infrastructure dependencies, and identifying modernization paths before workloads move. Deployment agent then guides well-architected design and generate infrastructure as code artifacts that set strong operational patterns from the outset. In parallel, resiliency agent identifies gaps across availability, recovery, backup, and continuity—so reliability is designed in, not pathed later.

Deploy and launch

When teams are ready to go live, Copilot deployment agent support governed, repeatable deployment workflows that validate both infrastructure and application rollout. Observability agent establishes baseline health from the moment production traffic hits, while troubleshooting agent accelerates early-life issue resolution by diagnosing root causes, recommending fixes, and initiating support actions if needed. Throughout this phase, resiliency agent verifies that recovery and failover configurations hold up under real world conditions.

Operate, optimize, and evolve

In ongoing operations, Azure Copilot’s agentic capabilities deliver compounding value. Observability agent provides continuous, full stack visibility and diagnosis across applications and infrastructure. Optimization agent identify and execute improvements across cost, performance, and sustainability—often comparing financial and carbon impact in real time. Resiliency agent moves from validation to proactive posture management, continuously strengthening protection against emerging risks such as ransomware. Troubleshooting agent helps make the shift from reactive firefighting to rapid, context aware incident resolution. Last but not least, migration agent reenters the lifecycle to identify new opportunities to refactor or evolve workloads—not as a onetime event, but as continuous modernization.

In ongoing operations, Azure Copilot’s agentic capabilities deliver compounding value. Observability agent provides continuous, full stack visibility and diagnosis across applications and infrastructure. Optimization agent identifies and executes improvements across cost, performance, and sustainability—often comparing financial and carbon impact in real time. Resiliency agent moves from validation to proactive posture management, continuously strengthening protection against emerging risks such as ransomware. Troubleshooting agent helps make the shift from reactive firefighting to rapid, context aware incident resolution. Last but not least, migration agent reenters the lifecycle to identify new opportunities to refactor or evolve workloads—not as a onetime event, but as continuous modernization.

A connected system, not disparate tools

These capabilities don’t operate as isolated bots. They work within connected, context-aware workflows—correlating real time signals, understanding operational context, and taking governed action where it matters most. This allows teams to anticipate issues earlier, resolve them faster, and continuously improve their cloud posture across development, migration, and operations. The outcome isn’t fewer tools—it’s better flow, where people, data, and automation operate as a unified system.

Governance and human oversight by design

Agentic cloud operations are built for mission-critical systems, where governance and control are nonnegotiable. Azure Copilot embeds governance at every layer, allowing enterprises to define boundaries, apply policies consistently, and maintain clear oversight. Features such as Bring Your Own Storage (BYOS) for conversation history give customers even greater control—keeping operational data within their own Azure environment to ensure sovereignty, compliance, and visibility on their terms. All of this is grounded in Microsoft’s Responsible AI principles, ensuring autonomy and safety advance together. Every agent-initiated action honors existing policy, security, and RBAC controls. Actions are always reviewable, traceable, and auditable, ensuring human oversight remains central to automated workflows—not removed from them.

Operating with confidence as the cloud evolves

As cloud environments grow more dynamic and complex, operational models must evolve to match them. With Azure Copilot and agentic cloud operations, Microsoft is enabling organizations to operate mission-critical environments with greater speed, clarity, and control—providing the confidence to move forward as the cloud continues to change.

Explore more resources to deepen your understanding of agentic cloud operations

Access white paper on Intelligent Operations: How Agentic AI Is Aiming to Reshape IT.

Find resources, use cases, and get started with Azure Copilot.

From cloud to edge, see how Azure Copilot can help
Gain new insights, discover more benefits of the cloud, and orchestrate data across both the cloud and the edge.

Start here

The post Agentic cloud operations: A new way to run the cloud appeared first on Microsoft Azure Blog.
Quelle: Azure

14. Februar 2026

da Agency

How to solve the context size issues with context packing with Docker Model Runner and Agentic Compose

If you’ve worked with local language models, you’ve probably run into the context window limit, especially when using smaller models on less powerful machines. While it’s an unavoidable constraint, techniques like context packing make it surprisingly manageable.

Hello, I’m Philippe, and I am a Principal Solutions Architect helping customers with their usage of Docker. In my previous blog post, I wrote about how to make a very small model useful by using RAG. I had limited the message history to 2 to keep the context length short.

But in some cases, you’ll need to keep more messages in your history. For example, a long conversation to generate code:

– generate an http server server in golang
– add a human structure and a list of humans
– add a handler to add a human to the list
– add a handler to list all humans
– add a handler to get a human by id
– etc…

Let’s imagine we have a conversation for which we want to keep 10 messages in the history. Moreover, we’re using a very verbose model (which a lot of tokens), so we’ll quickly encounter this type of error:

error: {
code: 400,
message: 'request (8860 tokens) exceeds the available context size (8192 tokens), try increasing it',
type: 'exceed_context_size_error',
n_prompt_tokens: 8860,
n_ctx: 8192
},
code: 400,
param: undefined,
type: 'exceed_context_size_error'
}

What happened?

Understanding context windows and their limits in local LLMs

Our LLM has a context window, which has a limited size. This means that if the conversation becomes too long… It will bug out.

This window is the total number of tokens the model can process at once, like a short-term working memory. Read this IBM article for a deep dive on context window

In our example in the code snippet above, this size was set to 8192 tokens for LLM engines that power local LLM, like Docker Model Runner, Ollama, Llamacpp, …

This window includes everything: system prompt, user message, history, injected documents, and the generated response. Refer to this Redis post for more info.

Example: if the model has 32k context, the sum (input + history + generated output) must remain ≤ 32k tokens. Learn more here.

It’s possible to change the default context size (up or down) in the compose.yml file:

models:
chat-model:
model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m
# Increased context size for better handling of larger inputs
context_size: 16384

You can also do this with Docker with the following command: docker model configure –context-size 8192 ai/qwen2.5-coder `

And so we solve the problem, but only part of the problem. Indeed, it’s not guaranteed that your model supports a larger context size (like 16384), and even if it does, it can very quickly degrade the model’s performance.

Thus, with hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m, when the number of tokens in the context approaches 16384 tokens, generation can become (much) slower (at least on my machine). Again, this will depend on the model’s capacity (read its documentation). And remember, the smaller the model, the harder it will be to handle a large context and stay focused.

Tips: always provide an option (a /clear command for example) in your application to empty the message list, or to reduce it. Automatic or manual. Keep the initial system instructions though.

So we’re at an impasse. How can we go further with our small models?

Well, there is still a solution, which is called context packing.

Using context packing to fit more information into limited context windows

We can’t indefinitely increase the context size. To still manage to fit more information in the context, we can use a technique called “context packing”, which consists of having the model itself summarize previous messages (or entrust the task to another model), and replace the history with this summary and thus free up space in the context.So we decide that from a certain token limit, we’ll have the history of previous messages summarized, and replace this history with the generated summary.I’ve therefore modified my example to add a context packing step. For the exercise, I decided to use another model to do the summarization.

Modification of the compose.yml file

I added a new model in the compose.yml file: ai/qwen2.5:1.5B-F16

models:
chat-model:
model: hf.co/qwen/qwen2.5-coder-3b-instruct-gguf:q4_k_m

embedding-model:
model: ai/embeddinggemma:latest

context-packing-model:
model: ai/qwen2.5:1.5B-F16

Then:

I added the model in the models section of the service that runs our program.

I increased the number of messages in the history to 10 (instead of 2 previously).

I set a token limit at 5120 before triggering context compression.

And finally, I defined instructions for the “context packing” model, asking it to summarize previous messages.

excerpt from the service:

golang-expert-v3:
build:
context: .
dockerfile: Dockerfile
environment:

HISTORY_MESSAGES: 10
TOKEN_LIMIT: 5120
# …

configs:
– source: system.instructions.md
target: /app/system.instructions.md
– source: context-packing.instructions.md
target: /app/context-packing.instructions.md

models:
chat-model:
endpoint_var: MODEL_RUNNER_BASE_URL
model_var: MODEL_RUNNER_LLM_CHAT

context-packing-model:
endpoint_var: MODEL_RUNNER_BASE_URL
model_var: MODEL_RUNNER_LLM_CONTEXT_PACKING

embedding-model:
endpoint_var: MODEL_RUNNER_BASE_URL
model_var: MODEL_RUNNER_LLM_EMBEDDING

You’ll find the complete version of the file here: compose.yml

System instructions for the context packing model

Still in the compose.yml file, I added a new system instruction for the “context packing” model, in a context-packing.instructions.md file:

context-packing.instructions.md:
content: |
You are a context packing assistant.
Your task is to condense and summarize provided content to fit within token limits while preserving essential information.
Always:
– Retain key facts, figures, and concepts
– Remove redundant or less important details
– Ensure clarity and coherence in the condensed output
– Aim to reduce the token count significantly without losing critical information

The goal is to help fit more relevant information into a limited context window for downstream processing.

All that’s left is to implement the context packing logic in the assistant’s code.

Applying context packing to the assistant’s code

First, I define the connection with the context packing model in the Setup part of my assistant:

const contextPackingModel = new ChatOpenAI({
model: process.env.MODEL_RUNNER_LLM_CONTEXT_PACKING || `ai/qwen2.5:1.5B-F16`,
apiKey: "",
configuration: {
baseURL: process.env.MODEL_RUNNER_BASE_URL || "http://localhost:12434/engines/llama.cpp/v1/",
},
temperature: 0.0,
top_p: 0.9,
presencePenalty: 2.2,
});

I also retrieve the system instructions I defined for this model, as well as the token limit:

let contextPackingInstructions = fs.readFileSync('/app/context-packing.instructions.md', 'utf8');

let tokenLimit = parseInt(process.env.TOKEN_LIMIT) || 7168

Once in the conversation loop, I’ll estimate the number of tokens consumed by previous messages, and if this number exceeds the defined limit, I’ll call the context packing model to summarize the history of previous messages and replace this history with the generated summary (the assistant-type message: [“assistant”, summary]). Then I continue generating the response using the main model.

excerpt from the conversation loop:

let estimatedTokenCount = messages.reduce((acc, [role, content]) => acc + Math.ceil(content.length / 4), 0);
console.log(` Estimated token count for messages: ${estimatedTokenCount} tokens`);

if (estimatedTokenCount >= tokenLimit) {
console.log(` Warning: Estimated token count (${estimatedTokenCount}) exceeds the model's context limit (${tokenLimit}). Compressing conversation history…`);

// Calculate original history size
const originalHistorySize = history.reduce((acc, [role, content]) => acc + Math.ceil(content.length / 4), 0);

// Prepare messages for context packing
const contextPackingMessages = [
["system", contextPackingInstructions],
…history,
["user", "Please summarize the above conversation history to reduce its size while retaining important information."]
];

// Generate summary using context packing model
console.log(" Generating summary with context packing model…");
let summary = '';
const summaryStream = await contextPackingModel.stream(contextPackingMessages);
for await (const chunk of summaryStream) {
summary += chunk.content;
process.stdout.write('x1b[32m' + chunk.content + 'x1b[0m');
}
console.log();

// Calculate compressed size
const compressedSize = Math.ceil(summary.length / 4);
const reductionPercentage = ((originalHistorySize – compressedSize) / originalHistorySize * 100).toFixed(2);

console.log(` History compressed: ${originalHistorySize} tokens → ${compressedSize} tokens (${reductionPercentage}% reduction)`);

// Replace all history with the summary
conversationMemory.set("default-session-id", [["assistant", summary]]);

estimatedTokenCount = compressedSize

// Rebuild messages with compressed history
messages = [
["assistant", summary],
["system", systemInstructions],
["system", knowledgeBase],
["user", userMessage]
];
}

You’ll find the complete version of the code here: index.js

All that’s left is to test our assistant and have it hold a long conversation, to see context packing in action.

docker compose up –build -d
docker compose exec golang-expert-v3 node index.js

And after a while in the conversation, you should see the warning message about the token limit, followed by the summary generated by the context packing model, and finally, the reduction in the number of tokens in the history:

Estimated token count for messages: 5984 tokens
Warning: Estimated token count (5984) exceeds the model's context limit (5120). Compressing conversation history…
Generating summary with context packing model…
Sure, here's a summary of the conversation:

1. The user asked for an example in Go of creating an HTTP server.
2. The assistant provided a simple example in Go that creates an HTTP server and handles GET requests to display "Hello, World!".
3. The user requested an equivalent example in Java.
4. The assistant presented a Java implementation that uses the `java.net.http` package to create an HTTP server and handle incoming requests.

The conversation focused on providing examples of creating HTTP servers in both Go and Java, with the goal of reducing the token count while retaining essential information.
History compressed: 4886 tokens → 153 tokens (96.87% reduction)

This way, we ensure that our assistant can handle a long conversation while maintaining good generation performance.

Summary

The context window is an unavoidable constraint when working with local language models, particularly with small models and on machines with limited resources. However, by using techniques like context packing, you can easily work around this limitation. Using Docker Model Runner and Agentic Compose, you can implement this pattern to support long, verbose conversations without overwhelming your model.

All the source code is available on Codeberg: context-packing. Give it a try!
Quelle: https://blog.docker.com/feed/