Production Observability with OpenTelemetry
Purpose
Implement production-grade observability using OpenTelemetry as the 2025 industry standard. Covers the three pillars (metrics, logs, traces), LGTM stack deployment, and critical log-trace correlation patterns.
When to Use
Use when:
- Building production systems requiring visibility into performance and errors
- Debugging distributed systems with multiple services
- Setting up monitoring, logging, or tracing infrastructure
- Implementing structured logging with trace correlation
- Configuring alerting rules for production systems
Skip if:
- Building proof-of-concept without production deployment
- System has < 100 requests/day (console logging may suffice)
The OpenTelemetry Standard (2025)
OpenTelemetry is the CNCF graduated project unifying observability:
┌────────────────────────────────────────────────────────┐
│ OpenTelemetry: The Unified Standard │
├────────────────────────────────────────────────────────┤
│ │
│ ONE SDK for ALL signals: │
│ ├── Metrics (Prometheus-compatible) │
│ ├── Logs (structured, correlated) │
│ ├── Traces (distributed, standardized) │
│ └── Context (propagates across services) │
│ │
│ Language SDKs: │
│ ├── Python: opentelemetry-api, opentelemetry-sdk │
│ ├── Rust: opentelemetry, tracing-opentelemetry │
│ ├── Go: go.opentelemetry.io/otel │
│ └── TypeScript: @opentelemetry/api │
│ │
│ Export to ANY backend: │
│ ├── LGTM Stack (Loki, Grafana, Tempo, Mimir) │
│ ├── Prometheus + Jaeger │
│ ├── Datadog, New Relic, Honeycomb (SaaS) │
│ └── Custom backends via OTLP protocol │
│ │
└────────────────────────────────────────────────────────┘
Context7 Reference: /websites/opentelemetry_io (Trust: High, Snippets: 5,888, Score: 85.9)
The Three Pillars of Observability
1. Metrics (What is happening?)
Track system health and performance over time.
Metric Types: Counters (always increase), Gauges (up/down), Histograms (distributions), Summaries (percentiles).
Brief Example (Python):
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})
2. Logs (What happened?)
Record discrete events with context.
CRITICAL: Always inject trace_id/span_id for log-trace correlation.
Brief Example (Python + structlog):
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"processing_request",
trace_id=format(ctx.trace_id, '032x'),
span_id=format(ctx.span_id, '016x'),
user_id=user_id
)
See: references/structured-logging.md for complete configuration.
3. Traces (Where did time go?)
Track request flow across distributed services.
Key Concepts: Trace (end-to-end journey), Span (individual operation), Parent-Child (nested operations).
Brief Example (Python + FastAPI):
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
app = FastAPI()
FastAPIInstrumentor.instrument_app(app) # Auto-traces all HTTP requests
See: references/opentelemetry-setup.md for SDK installation by language.
The LGTM Stack (Self-Hosted Observability)
LGTM = Loki (Logs) + Grafana (Visualization) + Tempo (Traces) + Mimir (Metrics)
┌────────────────────────────────────────────────────────┐
│ LGTM Architecture │
├────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ Grafana Dashboard (Port 3000) │ │
│ │ Unified UI for Logs, Metrics, Traces │ │
│ └──────┬──────────────┬─────────────┬─────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Loki │ │ Tempo │ │ Mimir │ │
│ │ (Logs) │ │ (Traces) │ │(Metrics) │ │
│ │Port 3100 │ │Port 3200 │ │Port 9009 │ │
│ └────▲─────┘ └────▲─────┘ └────▲─────┘ │
│ │ │ │ │
│ └──────────────┴─────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Grafana Alloy │ │
│ │ (Collector) │ │
│ │ Port 4317/8 │ ← OTLP gRPC/HTTP │
│ └───────▲────────┘ │
│ │ │
│ OpenTelemetry Instrumented Apps │
│ │
└────────────────────────────────────────────────────────┘
Quick Start: Run examples/lgtm-docker-compose/docker-compose.yml for a complete LGTM stack.
See: references/lgtm-stack.md for production deployment guide.
Critical Pattern: Log-Trace Correlation
The Problem: Logs and traces live in separate systems. You see an error log but can't find the related trace.
The Solution: Inject trace_id and span_id into every log record.
Python (structlog)
import structlog
from opentelemetry import trace
logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()
logger.info(
"request_processed",
trace_id=format(ctx.trace_id, '032x'), # 32-char hex
span_id=format(ctx.span_id, '016x'), # 16-char hex
user_id=user_id
)
Rust (tracing)
use tracing::{info, instrument};
#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
// trace_id/span_id automatically included
info!(user_id = user_id, "processing request");
Ok(result)
}
See: references/trace-context.md for Go and TypeScript patterns.
Query in Grafana
{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"
Quick Setup Guide
1. Choose Your Stack
Decision Tree:
- Greenfield: OpenTelemetry SDK + LGTM Stack (self-hosted) or Grafana Cloud (managed)
- Existing Prometheus: Add Loki (logs) + Tempo (traces)
- Kubernetes: LGTM via Helm, Alloy DaemonSet
- Zero-ops: Managed SaaS (Grafana Cloud, Datadog, New Relic)
2. Install OpenTelemetry SDK
Bootstrap Script:
python scripts/setup_otel.py --language python --framework fastapi
Manual (Python):
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi \
opentelemetry-exporter-otlp
See: references/opentelemetry-setup.md for Rust, Go, TypeScript installation.
3. Deploy LGTM Stack
Docker Compose (development):
cd examples/lgtm-docker-compose
docker-compose up -d
# Grafana: http://localhost:3000 (admin/admin)
# OTLP: localhost:4317 (gRPC), localhost:4318 (HTTP)
See: references/lgtm-stack.md for production Kubernetes deployment.
4. Configure Structured Logging
See: references/structured-logging.md for complete setup (Python, Rust, Go, TypeScript).
5. Set Up Alerting
See: references/alerting-rules.md