Cortivex Multi-Agent Orchestration
You are an orchestration controller that manages multi-agent Cortivex pipelines. This skill covers how to use SwarmCoordinator and AgentMonitor nodes to run parallel agent workloads with leader election, health monitoring, and automatic recovery.
Overview
Standard Cortivex pipelines execute nodes sequentially or in simple parallel DAGs. Orchestration extends this by introducing coordination primitives that allow agents to self-organize, monitor each other, and recover from failures without manual intervention.
The two primary orchestration node types are:
- SwarmCoordinator -- Manages a pool of worker agents, assigns tasks from a shared queue, elects a leader for coordination, and handles scaling.
- AgentMonitor -- Continuously tracks agent health via heartbeats, token consumption, and progress signals. Triggers recovery actions when agents stall or die.
When to Use
- Running more than 3 agents in parallel against the same repository
- Long-running pipelines (over 10 minutes) where agent failures are likely
- Workloads that require dynamic task assignment rather than fixed DAG ordering
- Pipelines where agents share state and need a coordinator to prevent conflicts
- Situations requiring automatic recovery when agents crash or exhaust their context window
How It Works
Orchestration Lifecycle
Phase 1: Bootstrap
SwarmCoordinator starts, elects itself leader (single-node) or runs election (multi-node)
AgentMonitor begins heartbeat tracking
Phase 2: Agent Pool
SwarmCoordinator spawns the configured number of worker agents
Each agent registers with the monitor and begins accepting tasks
Phase 3: Task Distribution
SwarmCoordinator pulls tasks from the pipeline queue
Tasks are assigned to idle agents based on priority and agent capability
Phase 4: Monitoring
AgentMonitor checks heartbeats every 15 seconds
Token consumption is tracked per agent
Stalled agents are flagged after 2 missed heartbeats (30s)
Phase 5: Recovery
Dead agents are removed from the pool
Their in-progress tasks are requeued with status "ready"
A replacement agent is spawned if the pool drops below the configured minimum
Phase 6: Completion
When all tasks reach "done" status, the coordinator collects results
AgentMonitor produces a health report
Pipeline returns aggregated output
Agent States
| State | Meaning | Monitor Action |
|---|---|---|
idle | Agent is registered and waiting for work | None -- agent is healthy |
working | Agent is actively processing a task | Track progress and token usage |
stalled | Agent missed 2+ heartbeats while working | Send ping; if no response in 15s, mark dead |
dead | Agent process terminated or unresponsive | Requeue tasks, spawn replacement |
rotating | Agent is near context limit, finishing current task | Do not assign new tasks; replace after completion |
Token Budget Management
| Token Range | Status | Coordinator Action |
|---|---|---|
| 0 -- 50K | Healthy | Normal task assignment |
| 50K -- 80K | Caution | Assign only short tasks |
| 80K -- 95K | Warning | Finish current task, then rotate agent |
| 95K+ | Critical | Kill agent, requeue task, spawn replacement |
Pipeline Configuration
Basic Orchestrated Pipeline
name: orchestrated-review
version: "1.0"
description: Multi-agent code review with health monitoring
orchestration:
mode: swarm
min_agents: 2
max_agents: 5
auto_scale: true
nodes:
- id: coordinator
type: SwarmCoordinator
config:
pool_size: 3
runtime: auto
task_strategy: priority-queue
heartbeat_interval_seconds: 15
heartbeat_timeout_seconds: 30
token_rotation_threshold: 80000
on_agent_death: respawn
on_all_complete: collect_results
- id: monitor
type: AgentMonitor
depends_on: [coordinator]
config:
check_interval_seconds: 15
stall_threshold_seconds: 60
token_alert_threshold: 80000
auto_recovery: true
report_on_complete: true
- id: security_scan
type: SecurityScanner
depends_on: [coordinator]
config:
scan_depth: deep
managed_by: coordinator
- id: code_review
type: CodeReviewer
depends_on: [coordinator]
config:
review_scope: changed_files
managed_by: coordinator
- id: bug_hunt
type: BugHunter
depends_on: [coordinator]
config:
hunt_scope: changed_files
managed_by: coordinator
- id: collect
type: Orchestrator
depends_on: [security_scan, code_review, bug_hunt]
config:
strategy: fan-in
collect_results: true
Long-Running Server Mode
For pipelines that run continuously or handle streaming workloads, use cortivex serve:
name: continuous-review-server
version: "1.0"
description: Persistent review server that processes incoming tasks
orchestration:
mode: server
port: 9100
min_agents: 2
max_agents: 8
idle_timeout_minutes: 30
nodes:
- id: coordinator
type: SwarmCoordinator
config:
pool_size: 3
runtime: auto
task_strategy: priority-queue
accept_external_tasks: true
api_endpoint: /api/tasks
- id: monitor
type: AgentMonitor
depends_on: [coordinator]
config:
check_interval_seconds: 10
auto_recovery: true
metrics_endpoint: /api/metrics
Start the server:
cortivex serve --port 9100 --agents 3 --runtime auto
Submit tasks to the running server:
cortivex_run({
pipeline: "continuous-review-server",
params: {
task: "Review PR #42",
priority: 8
}
})
Running Orchestrated Pipelines
Using MCP Tools
cortivex_run({
pipeline: "orchestrated-review",
repo: "/path/to/repo",
options: {
verbose: true,
max_parallel: 5,
timeout_minutes: 30,
on_failure: "retry"
}
})
Using CLI
/cortivex run orchestrated-review --repo /path/to/repo --agents 3 --monitor --verbose
Monitoring an Orchestrated Run
/cortivex status ctx-a1b2c3 --agents --health
Output:
Pipeline: orchestrated-review (run_id: ctx-a1b2c3)
Mode: swarm | Leader: agent-coordinator-1
============================================
Agent Pool (3/5):
agent-worker-1 [WORKING] task: security_scan tokens: 12,400 health: OK
agent-worker-2 [WORKING] task: code_review tokens: 34,200 health: OK
agent-worker-3 [IDLE] waiting for task tokens: 8,100 health: OK
Tasks:
[1/4] SecurityScanner [RUNNING] assigned to: agent-worker-1
[2/4] CodeReviewer [RUNNING] assigned to: agent-worker-2
[3/4] BugHunter [READY] next assignment: agent-worker-3
[4/4] Orchestrator [WAITING] depends on: 1, 2, 3
Monitor:
Heartbeats: all responding (last check: 3s ago)
Deaths: 0 | Recoveries: 0 | Rotations: 0
Total tokens: 54,700 | Estimated cost: $0.014
Recovery Scenarios
Agent Death
When an agent process terminates unexpectedly:
- AgentMonitor detects missed heartbeats (30s timeout)
- Agent is marked
deadin the registry - In-progress task is set back to
readystatus - SwarmCoordinator spawns a replacement agent
- Replacement agent picks up the requeued task
- A
recoveryevent is broadcast to all nodes
Context Window Exhaustion
When an agent approaches its token limit:
- AgentMonitor detects token usage above the rotation threshold
- Agent is marked
rotating-- no new tasks are assigned - Agent completes its current task
- Agent is gracefully terminated
- A fresh agent is spawned in its place
Stalled Agent
When an agent stops making progress but has not crashed:
- AgentMonitor detects no progress updates for the stall threshold period
- A ping is sent to the agent
- If the agent responds, monitoring continues
- If no response within 15 seconds, the agent is treated as dead