AI/ML: Building Production AI Applications

Build, review, and architect applications that use AI models - from single-API calls to multi-agent systems with RAG pipelines. The goal is production-grade AI apps that are reliable, cost-effective, and don't hallucinate their way into an incident.

Target versions: May 2026 snapshot. Read references/target-versions.md before pinning model IDs (Claude/OpenAI families), SDKs, runtimes, vector stores, or evaluation tools.

When to use

Integrating LLM APIs (Anthropic, OpenAI, etc.) into applications
Building RAG pipelines (chunking, embedding, retrieval, generation)
Designing agent systems (tool use, loops, state, multi-agent)
Choosing between fine-tuning, RAG, and prompt engineering
Setting up vector stores for semantic search
Implementing structured output and tool use / function calling
Building evaluation and testing harnesses for AI features
Optimizing token costs, latency, and model routing
Setting up local inference with Ollama or vLLM
Adding safety guardrails (content filtering, PII handling, output validation)

When NOT to use

Building MCP servers or tools (use mcp - it handles the protocol layer)
Writing or refining individual prompts (use prompt-generator)
General database configuration, schema design, or migrations (use databases)
Security auditing AI application code (use security-audit)
Reviewing code quality unrelated to AI/ML patterns (use code-review)
Building AI-powered HTTP APIs (use backend-api for the API layer; return here for the LLM integration within it)
Reviewing AI-generated application code for slop, hallucinated APIs, or over-abstraction (use anti-slop)

AI Self-Check

AI tools consistently produce the same mistakes when generating AI application code. Before returning any generated AI/ML code, verify against this list:

Performance

Batch embeddings and eval runs; avoid one request per row when the provider offers batch or bulk APIs.
Cache deterministic retrieval, tool metadata, and prompt templates, but never cache tenant-specific model outputs without a data-retention decision.
Track token, latency, and retry budgets separately for interactive, background, and eval traffic.

Best Practices

Prefer raw provider SDKs until orchestration complexity justifies LangGraph, LlamaIndex, or LangChain.
Keep model, tool, retrieval, and safety decisions configurable per environment; avoid hardcoding preview model names in application logic.
Treat model output as untrusted input: validate structure, refusal states, tool arguments, and downstream side effects.

Workflow

Step 1: Determine the architecture pattern

Need	Pattern	Start with
Single model call	Direct API integration	Provider SDK
Knowledge-grounded answers	RAG pipeline	Vector store + retrieval
Multi-step reasoning	Agent with tools	LangGraph, OpenAI Agents SDK, or custom loop
Multiple specialized models	Model routing / chain	Custom router or Vercel AI SDK
Offline / air-gapped	Local inference	Ollama or vLLM
Existing data enrichment	Batch processing	Provider batch APIs

Step 2: Choose the right abstraction level

Pick the lightest tool that solves the problem:

Raw SDK - direct Anthropic/OpenAI SDK calls. Best for simple integrations, maximum control, minimum dependencies. Start here unless you have a specific reason not to.
Vercel AI SDK - unified provider interface with streaming primitives. Good for TypeScript apps that need provider-agnostic code or React/Next.js streaming UI.
LangChain / LlamaIndex - orchestration frameworks. Use when you need complex chains, built-in document loaders, or 300+ pre-built integrations. Don't use for simple API calls - the abstraction overhead isn't worth it.
LangGraph / OpenAI Agents SDK - stateful agent frameworks. Use when you need cycles, persistence, human-in-the-loop, or multi-agent coordination.

The anti-pattern: importing LangChain to make a single API call. That's like importing Django to serve a static HTML file.

Step 3: Implement

Follow the domain-specific sections below. Read the appropriate reference file for detailed patterns and code examples.

Step 4: Evaluate and validate

Every AI feature needs evaluation. Not "run it once and eyeball the output" - structured evals with datasets, metrics, and regression detection.

Minimum viable eval: create a promptfooconfig.yaml with 20+ test cases, use contains, llm-rubric, and cost assertions, run npx promptfoo eval in CI on every PR that touches prompts. Track pass rate over time - any regression blocks the merge.

Read references/evaluation.md for promptfoo setup, assertion types, CI integration (GitHub Actions example), RAG-specific evals, agent evals, and red teaming patterns.

LLM Integration Patterns

Streaming

Always stream for user-facing responses. Buffer for background processing.

# Anthropic streaming (Python)
import anthropic

client = anthropic.Anthropic()

with client.messages.stream(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{"role": "user", "content": prompt}],
) as stream:
    for text in stream.text_stream:
        yield text

Structured output

Use native provider mechanisms, not regex parsing of free-text responses.

Anthropic: tool_use with JSON schema, or response_format with json_schema

ai-ml

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday