Production Observability with OpenTelemetry

Purpose

Implement production-grade observability using OpenTelemetry as the 2025 industry standard. Covers the three pillars (metrics, logs, traces), LGTM stack deployment, and critical log-trace correlation patterns.

When to Use

Use when:

Building production systems requiring visibility into performance and errors
Debugging distributed systems with multiple services
Setting up monitoring, logging, or tracing infrastructure
Implementing structured logging with trace correlation
Configuring alerting rules for production systems

Skip if:

Building proof-of-concept without production deployment
System has < 100 requests/day (console logging may suffice)

The OpenTelemetry Standard (2025)

OpenTelemetry is the CNCF graduated project unifying observability:

┌────────────────────────────────────────────────────────┐
│          OpenTelemetry: The Unified Standard           │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ONE SDK for ALL signals:                              │
│  ├── Metrics (Prometheus-compatible)                   │
│  ├── Logs (structured, correlated)                     │
│  ├── Traces (distributed, standardized)                │
│  └── Context (propagates across services)              │
│                                                         │
│  Language SDKs:                                         │
│  ├── Python: opentelemetry-api, opentelemetry-sdk      │
│  ├── Rust: opentelemetry, tracing-opentelemetry        │
│  ├── Go: go.opentelemetry.io/otel                      │
│  └── TypeScript: @opentelemetry/api                    │
│                                                         │
│  Export to ANY backend:                                │
│  ├── LGTM Stack (Loki, Grafana, Tempo, Mimir)          │
│  ├── Prometheus + Jaeger                               │
│  ├── Datadog, New Relic, Honeycomb (SaaS)              │
│  └── Custom backends via OTLP protocol                 │
│                                                         │
└────────────────────────────────────────────────────────┘

Context7 Reference: /websites/opentelemetry_io (Trust: High, Snippets: 5,888, Score: 85.9)

The Three Pillars of Observability

1. Metrics (What is happening?)

Track system health and performance over time.

Metric Types: Counters (always increase), Gauges (up/down), Histograms (distributions), Summaries (percentiles).

Brief Example (Python):

from opentelemetry import metrics

meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})

2. Logs (What happened?)

Record discrete events with context.

CRITICAL: Always inject trace_id/span_id for log-trace correlation.

Brief Example (Python + structlog):

import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "processing_request",
    trace_id=format(ctx.trace_id, '032x'),
    span_id=format(ctx.span_id, '016x'),
    user_id=user_id
)

See: references/structured-logging.md for complete configuration.

3. Traces (Where did time go?)

Track request flow across distributed services.

Key Concepts: Trace (end-to-end journey), Span (individual operation), Parent-Child (nested operations).

Brief Example (Python + FastAPI):

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-traces all HTTP requests

See: references/opentelemetry-setup.md for SDK installation by language.

The LGTM Stack (Self-Hosted Observability)

LGTM = Loki (Logs) + Grafana (Visualization) + Tempo (Traces) + Mimir (Metrics)

┌────────────────────────────────────────────────────────┐
│                  LGTM Architecture                      │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────────────────────────────────────┐      │
│  │           Grafana Dashboard (Port 3000)      │      │
│  │  Unified UI for Logs, Metrics, Traces       │      │
│  └──────┬──────────────┬─────────────┬─────────┘      │
│         │              │             │                 │
│         ▼              ▼             ▼                 │
│  ┌──────────┐   ┌──────────┐  ┌──────────┐            │
│  │   Loki   │   │  Tempo   │  │  Mimir   │            │
│  │  (Logs)  │   │ (Traces) │  │(Metrics) │            │
│  │Port 3100 │   │Port 3200 │  │Port 9009 │            │
│  └────▲─────┘   └────▲─────┘  └────▲─────┘            │
│       │              │             │                   │
│       └──────────────┴─────────────┘                   │
│                      │                                 │
│              ┌───────▼────────┐                        │
│              │ Grafana Alloy  │                        │
│              │  (Collector)   │                        │
│              │  Port 4317/8   │ ← OTLP gRPC/HTTP       │
│              └───────▲────────┘                        │
│                      │                                 │
│         OpenTelemetry Instrumented Apps                │
│                                                         │
└────────────────────────────────────────────────────────┘

Quick Start: Run examples/lgtm-docker-compose/docker-compose.yml for a complete LGTM stack.

See: references/lgtm-stack.md for production deployment guide.

Critical Pattern: Log-Trace Correlation

The Problem: Logs and traces live in separate systems. You see an error log but can't find the related trace.

The Solution: Inject trace_id and span_id into every log record.

Python (structlog)

import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "request_processed",
    trace_id=format(ctx.trace_id, '032x'),  # 32-char hex
    span_id=format(ctx.span_id, '016x'),    # 16-char hex
    user_id=user_id
)

Rust (tracing)

use tracing::{info, instrument};

#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
    // trace_id/span_id automatically included
    info!(user_id = user_id, "processing request");
    Ok(result)
}

See: references/trace-context.md for Go and TypeScript patterns.

Query in Grafana

{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"

Quick Setup Guide

1. Choose Your Stack

Decision Tree:

Greenfield: OpenTelemetry SDK + LGTM Stack (self-hosted) or Grafana Cloud (managed)
Existing Prometheus: Add Loki (logs) + Tempo (traces)
Kubernetes: LGTM via Helm, Alloy DaemonSet
Zero-ops: Managed SaaS (Grafana Cloud, Datadog, New Relic)

2. Install OpenTelemetry SDK

Bootstrap Script:

python scripts/setup_otel.py --language python --framework fastapi

Manual (Python):

pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi \
    opentelemetry-exporter-otlp

See: references/opentelemetry-setup.md for Rust, Go, TypeScript installation.

3. Deploy LGTM Stack

Docker Compose (development):

cd examples/lgtm-docker-compose
docker-compose up -d
# Grafana: http://localhost:3000 (admin/admin)
# OTLP: localhost:4317 (gRPC), localhost:4318 (HTTP)

See: references/lgtm-stack.md for production Kubernetes deployment.

4. Configure Structured Logging

See: references/structured-logging.md for complete setup (Python, Rust, Go, TypeScript).

5. Set Up Alerting

See: references/alerting-rules.md

implementing-observability

Como adicionar

Cole no README do seu repo

Skills relacionadas

webapp-testing

brand-guidelines

frontend-design

web-artifacts-builder

Receba novas skills de Design e Frontend toda segunda