Cortivex Multi-Agent Orchestration

You are an orchestration controller that manages multi-agent Cortivex pipelines. This skill covers how to use SwarmCoordinator and AgentMonitor nodes to run parallel agent workloads with leader election, health monitoring, and automatic recovery.

Overview

Standard Cortivex pipelines execute nodes sequentially or in simple parallel DAGs. Orchestration extends this by introducing coordination primitives that allow agents to self-organize, monitor each other, and recover from failures without manual intervention.

The two primary orchestration node types are:

SwarmCoordinator -- Manages a pool of worker agents, assigns tasks from a shared queue, elects a leader for coordination, and handles scaling.
AgentMonitor -- Continuously tracks agent health via heartbeats, token consumption, and progress signals. Triggers recovery actions when agents stall or die.

When to Use

Running more than 3 agents in parallel against the same repository
Long-running pipelines (over 10 minutes) where agent failures are likely
Workloads that require dynamic task assignment rather than fixed DAG ordering
Pipelines where agents share state and need a coordinator to prevent conflicts
Situations requiring automatic recovery when agents crash or exhaust their context window

How It Works

Orchestration Lifecycle

Phase 1: Bootstrap
  SwarmCoordinator starts, elects itself leader (single-node) or runs election (multi-node)
  AgentMonitor begins heartbeat tracking

Phase 2: Agent Pool
  SwarmCoordinator spawns the configured number of worker agents
  Each agent registers with the monitor and begins accepting tasks

Phase 3: Task Distribution
  SwarmCoordinator pulls tasks from the pipeline queue
  Tasks are assigned to idle agents based on priority and agent capability

Phase 4: Monitoring
  AgentMonitor checks heartbeats every 15 seconds
  Token consumption is tracked per agent
  Stalled agents are flagged after 2 missed heartbeats (30s)

Phase 5: Recovery
  Dead agents are removed from the pool
  Their in-progress tasks are requeued with status "ready"
  A replacement agent is spawned if the pool drops below the configured minimum

Phase 6: Completion
  When all tasks reach "done" status, the coordinator collects results
  AgentMonitor produces a health report
  Pipeline returns aggregated output

Agent States

State	Meaning	Monitor Action
`idle`	Agent is registered and waiting for work	None -- agent is healthy
`working`	Agent is actively processing a task	Track progress and token usage
`stalled`	Agent missed 2+ heartbeats while working	Send ping; if no response in 15s, mark dead
`dead`	Agent process terminated or unresponsive	Requeue tasks, spawn replacement
`rotating`	Agent is near context limit, finishing current task	Do not assign new tasks; replace after completion

Token Budget Management

Token Range	Status	Coordinator Action
0 -- 50K	Healthy	Normal task assignment
50K -- 80K	Caution	Assign only short tasks
80K -- 95K	Warning	Finish current task, then rotate agent
95K+	Critical	Kill agent, requeue task, spawn replacement

Pipeline Configuration

Basic Orchestrated Pipeline

name: orchestrated-review
version: "1.0"
description: Multi-agent code review with health monitoring
orchestration:
  mode: swarm
  min_agents: 2
  max_agents: 5
  auto_scale: true

nodes:
  - id: coordinator
    type: SwarmCoordinator
    config:
      pool_size: 3
      runtime: auto
      task_strategy: priority-queue
      heartbeat_interval_seconds: 15
      heartbeat_timeout_seconds: 30
      token_rotation_threshold: 80000
      on_agent_death: respawn
      on_all_complete: collect_results

  - id: monitor
    type: AgentMonitor
    depends_on: [coordinator]
    config:
      check_interval_seconds: 15
      stall_threshold_seconds: 60
      token_alert_threshold: 80000
      auto_recovery: true
      report_on_complete: true

  - id: security_scan
    type: SecurityScanner
    depends_on: [coordinator]
    config:
      scan_depth: deep
      managed_by: coordinator

  - id: code_review
    type: CodeReviewer
    depends_on: [coordinator]
    config:
      review_scope: changed_files
      managed_by: coordinator

  - id: bug_hunt
    type: BugHunter
    depends_on: [coordinator]
    config:
      hunt_scope: changed_files
      managed_by: coordinator

  - id: collect
    type: Orchestrator
    depends_on: [security_scan, code_review, bug_hunt]
    config:
      strategy: fan-in
      collect_results: true

Long-Running Server Mode

For pipelines that run continuously or handle streaming workloads, use cortivex serve:

name: continuous-review-server
version: "1.0"
description: Persistent review server that processes incoming tasks
orchestration:
  mode: server
  port: 9100
  min_agents: 2
  max_agents: 8
  idle_timeout_minutes: 30

nodes:
  - id: coordinator
    type: SwarmCoordinator
    config:
      pool_size: 3
      runtime: auto
      task_strategy: priority-queue
      accept_external_tasks: true
      api_endpoint: /api/tasks

  - id: monitor
    type: AgentMonitor
    depends_on: [coordinator]
    config:
      check_interval_seconds: 10
      auto_recovery: true
      metrics_endpoint: /api/metrics

Start the server:

cortivex serve --port 9100 --agents 3 --runtime auto

Submit tasks to the running server:

cortivex_run({
  pipeline: "continuous-review-server",
  params: {
    task: "Review PR #42",
    priority: 8
  }
})

Running Orchestrated Pipelines

Using MCP Tools

cortivex_run({
  pipeline: "orchestrated-review",
  repo: "/path/to/repo",
  options: {
    verbose: true,
    max_parallel: 5,
    timeout_minutes: 30,
    on_failure: "retry"
  }
})

Using CLI

/cortivex run orchestrated-review --repo /path/to/repo --agents 3 --monitor --verbose

Monitoring an Orchestrated Run

/cortivex status ctx-a1b2c3 --agents --health

Output:

Pipeline: orchestrated-review (run_id: ctx-a1b2c3)
Mode: swarm | Leader: agent-coordinator-1
============================================
Agent Pool (3/5):
  agent-worker-1   [WORKING]  task: security_scan   tokens: 12,400  health: OK
  agent-worker-2   [WORKING]  task: code_review      tokens: 34,200  health: OK
  agent-worker-3   [IDLE]     waiting for task        tokens: 8,100   health: OK

Tasks:
  [1/4] SecurityScanner    [RUNNING]   assigned to: agent-worker-1
  [2/4] CodeReviewer       [RUNNING]   assigned to: agent-worker-2
  [3/4] BugHunter          [READY]     next assignment: agent-worker-3
  [4/4] Orchestrator       [WAITING]   depends on: 1, 2, 3

Monitor:
  Heartbeats: all responding (last check: 3s ago)
  Deaths: 0 | Recoveries: 0 | Rotations: 0
  Total tokens: 54,700 | Estimated cost: $0.014

Recovery Scenarios

Agent Death

When an agent process terminates unexpectedly:

AgentMonitor detects missed heartbeats (30s timeout)
Agent is marked dead in the registry
In-progress task is set back to ready status
SwarmCoordinator spawns a replacement agent
Replacement agent picks up the requeued task
A recovery event is broadcast to all nodes

Context Window Exhaustion

When an agent approaches its token limit:

AgentMonitor detects token usage above the rotation threshold
Agent is marked rotating -- no new tasks are assigned
Agent completes its current task
Agent is gracefully terminated
A fresh agent is spawned in its place

Stalled Agent

When an agent stops making progress but has not crashed:

AgentMonitor detects no progress updates for the stall threshold period
A ping is sent to the agent
If the agent responds, monitoring continues
If no response within 15 seconds, the agent is treated as dead

cortivex-orchestration

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday