Overview
Observability is the ability to understand the internal state of a system from its external outputs. For AI systems this is especially critical: agents make decisions that are hard to interpret without detailed telemetry.
The three pillars: Logs (what happened), Traces (how long and where), Metrics (aggregate health).
When to Use
- Before deploying any new service to production
- When adding AI agent capabilities to an existing system
- When debugging production issues
- When designing multi-agent pipelines
Process
Step 1: Structured Logging
- All logs must be structured (JSON, not free text). Fields:
timestamp,level,service,traceId,message,context. - Log levels used correctly:
ERROR: Something failed that requires immediate attentionWARN: Something unexpected happened but the system recoveredINFO: Normal significant events (requests received, jobs completed)DEBUG: Detailed diagnostic information (off in production by default)
- Never log secrets, PII, or auth tokens.
- For AI systems, log: prompt inputs (sanitized), model outputs, token counts, latency, model version.
Verify: Logs are structured JSON. No secrets in logs. AI interactions logged.
Step 2: Distributed Tracing
- Every request gets a unique
traceIdgenerated at the entry point. traceIdis propagated through all downstream calls (HTTP headers, message queues, agent calls).- Each service/agent creates a span for its work, with: start time, end time, parent span ID.
- Use OpenTelemetry as the standard instrumentation library.
Verify: You can trace a single request across all services/agents in a single view.
Step 3: Metrics
- Define and track key metrics:
- RED metrics: Rate (requests/sec), Errors (error rate %), Duration (latency p50/p95/p99)
- AI-specific: Token usage, prompt cost, model latency, hallucination rate, retrieval precision
- Dashboards: one dashboard per service with RED metrics, one dashboard for AI system health.
Verify: RED metrics are tracked for every service. AI-specific metrics tracked for AI systems.
Step 4: Alerting
- Alerts must be actionable — every alert should have a runbook.
- Alert on symptoms (high error rate, high latency), not just causes.
- AI-specific alerts: token budget exceeded, model error rate spike, retrieval failure rate spike.
- On-call rotation: someone is responsible for every alert at all times.
Verify: Every alert has a runbook. On-call rotation defined.
Common Rationalizations (and Rebuttals)
| Excuse | Rebuttal |
|---|---|
| "We'll add monitoring after launch" | You'll be fighting fires blind. Add it before. |
| "Console.log is enough" | In production, console.log is noise. Structured logs with context are signals. |
| "The AI model handles it internally" | Model internals are a black box. You must observe the inputs and outputs. |
Verification
- Structured JSON logging on all services
- No secrets in logs
- Distributed tracing with trace ID propagation
- RED metrics tracked for all services
- AI-specific metrics tracked (tokens, cost, latency)
- Alerts configured with runbooks
References
- production-deployment skill
- multi-agent-orchestration skill
- OpenTelemetry documentation