22 October 2025
·AI · Architecture
·10 min
Agentic runtimes: from orchestration to control
Orchestration frameworks give agents capability. Control planes give operators confidence. The gap between them is where production incidents live. A framework for thinking about runtime control in agentic systems.
The dominant framing of agentic systems focuses on capability: what can the agent do, how many tools does it have, how does it plan. This framing is useful for demos and insufficient for production.
Production agentic systems require a second framing: what can the operator see, interrupt, and constrain at runtime.
Orchestration vs. control
Orchestration is the internal logic of an agent: how it selects tools, sequences steps, manages context, and recovers from failures. Most frameworks (LangGraph, CrewAI, AutoGen) are primarily orchestration frameworks.
Control is the external capacity to observe, constrain, and intervene in agent execution from outside the agent's own logic. This is almost always an afterthought.
The distinction matters because agent failures in production are rarely failures of orchestration (the agent didn't pick the right tool). They are failures of control: the agent did something unexpected and nobody could stop it before it propagated.
Four control plane primitives
1. Execution visibility
At any moment, an operator should be able to answer: - What is the agent currently doing? - What has it done in the last N steps? - What tool calls are pending or in flight?
This requires structured emission of execution events — not free-text logs, but typed events with step identifiers, tool call parameters, and outcome records.
2. Interrupt conditions
Rules evaluated at each step that can pause or terminate execution:
- ·Budget limits (token spend, API call count, wall time)
- ·Anomaly conditions (unexpected tool call sequence, out-of-distribution inputs)
- ·Policy violations (tool call to restricted endpoint)
Interrupts should be declarative — defined in configuration, not hardcoded in agent logic — so they can be updated without deploying new agent code.
3. Approval gates
For high-stakes actions (writes, external API calls, irreversible operations), require explicit human or automated approval before execution:
- ·Gate triggers defined per tool or action type
- ·Approval captured as a signed record (who approved, when, with what context)
- ·Timeout behavior (auto-reject or auto-approve after N seconds, based on action risk level)
4. Replay and attribution
Every completed execution should be replayable: given the same initial state, inputs, and tool outputs, the agent should produce the same trajectory. This requires:
- ·Deterministic prompt construction (no random context injection)
- ·Recorded tool outputs at the time of execution
- ·Captured model responses before any post-processing
Replayability is the foundation of post-incident analysis. Without it, "what happened" is a narrative reconstruction, not a verified trace.
The operational contract
A production-grade agentic runtime should expose a contract to operators: these are the things you can observe, these are the things you can constrain, these are the conditions under which the system will stop itself, and these are the records that prove what happened.
Building this contract after the fact is expensive. The frameworks make it easy to build capable agents. Building the control plane alongside, from the start, is the discipline that separates deployed systems from demos.