Overview
Single agents are limited by context window, specialization depth, and parallelism. Multi-agent systems overcome these limits by routing subtasks to specialized agents. But multi-agent systems introduce new failure modes: lost context, conflicting decisions, infinite loops, and cascading failures.
This skill provides the architecture and coordination patterns to build multi-agent systems that are reliable, observable, and maintainable.
When to Use
- The task requires more context than a single agent can handle
- Different subtasks require different specializations (research, coding, review, security)
- Subtasks can be parallelized for speed
- The workflow is long-running and requires checkpointing
- Different tasks require different levels of human oversight
Process
Step 1: Design the Agent Network
- Define agent responsibilities: Each agent should have a single, well-defined job. Name them by role:
researcher,coder,reviewer,security-auditor,tester. - Define communication topology: Who can talk to whom?
- Pipeline: Agent A → Agent B → Agent C (sequential)
- Supervisor: Orchestrator dispatches to specialists (hub-and-spoke)
- Peer: Agents collaborate as equals (mesh)
- Define data contracts: What does each agent receive? What does it output? Use structured formats (JSON schemas) for inter-agent communication.
- Define the orchestration logic: Who decides which agent acts next?
Verify: You can draw the agent network on a whiteboard with clear roles and data flow.
Step 2: Implement Context Management
- Each agent should receive only the context it needs — not the full conversation history.
- Use a shared state store (database, key-value store) for information that multiple agents need.
- Pass summaries, not full transcripts, when context must traverse agent boundaries.
- Include a task ID in every message for tracing.
Verify: No agent receives more context than it requires for its specific task.
Step 3: Design for Failure
- Every agent call can fail — plan for it:
- Timeout with a defined maximum duration
- Retry with exponential backoff (max 3 retries)
- Fallback behavior when retries are exhausted
- Prevent infinite loops: Track call depth. If depth > N (e.g., 10), surface to human review.
- Checkpointing: For long workflows, save state after each major step so the workflow can be resumed after failure.
- Dead letter queue: Failed tasks that exhaust retries go to a queue for human inspection.
Verify: Failure scenarios are defined for every agent-to-agent call.
Step 4: Human-in-the-Loop Checkpoints
- Define which decisions require human approval:
- Irreversible actions (data deletion, financial transactions, external communications)
- High-uncertainty states (agents disagree, confidence below threshold)
- Sensitive operations (PII access, privileged system access)
- Design the human review interface: What information does the reviewer need? What actions can they take?
Verify: At least one human-in-the-loop checkpoint exists for high-risk operations.
Step 5: Observability
- Log every agent invocation: inputs, outputs, duration, token usage, errors.
- Implement distributed tracing across the agent network (trace ID propagated through all calls).
- Dashboard: agent activity, success/failure rates, latency, token consumption.
- Alerts: agent down, retry rate spike, context overflow, unexpected output patterns.
Verify: You can trace any specific task's full execution path across all agents from logs alone.
Common Rationalizations (and Rebuttals)
| Excuse | Rebuttal |
|---|---|
| "One agent is simpler" | Until it hits context limits, fails silently, or produces wrong results. Multi-agent is the right tool for complex tasks. |
| "We'll add observability later" | Multi-agent systems without observability are black boxes. Debug them in production — I dare you. |
| "Agents are smart, they'll figure it out" | Agents are tools. They need clear roles, contracts, and failure boundaries. |
| "The happy path works fine" | Multi-agent systems fail in complex ways. Design for failure from day one. |
Red Flags
- Agents pass full conversation history to other agents (context bloat)
- No timeout defined for any agent call
- Agents can call each other recursively without depth limits
- No human approval required for irreversible actions
- No distributed tracing across agent boundaries
- Agents making conflicting state changes with no conflict resolution
Verification
- Agent network designed with clear roles and data contracts
- Context is minimized at each agent boundary
- Failure handling (timeout, retry, fallback) for every agent call
- Infinite loop prevention via call depth limits
- Human-in-the-loop checkpoints for high-risk operations
- Distributed tracing implemented across agents
- End-to-end test of failure scenarios