AI/ML: Building Production AI Applications
Build, review, and architect applications that use AI models - from single-API calls to multi-agent systems with RAG pipelines. The goal is production-grade AI apps that are reliable, cost-effective, and don't hallucinate their way into an incident.
Target versions: May 2026 snapshot. Read references/target-versions.md before
pinning model IDs (Claude/OpenAI families), SDKs, runtimes, vector stores, or evaluation tools.
When to use
- Integrating LLM APIs (Anthropic, OpenAI, etc.) into applications
- Building RAG pipelines (chunking, embedding, retrieval, generation)
- Designing agent systems (tool use, loops, state, multi-agent)
- Choosing between fine-tuning, RAG, and prompt engineering
- Setting up vector stores for semantic search
- Implementing structured output and tool use / function calling
- Building evaluation and testing harnesses for AI features
- Optimizing token costs, latency, and model routing
- Setting up local inference with Ollama or vLLM
- Adding safety guardrails (content filtering, PII handling, output validation)
When NOT to use
- Building MCP servers or tools (use mcp - it handles the protocol layer)
- Writing or refining individual prompts (use prompt-generator)
- General database configuration, schema design, or migrations (use databases)
- Security auditing AI application code (use security-audit)
- Reviewing code quality unrelated to AI/ML patterns (use code-review)
- Building AI-powered HTTP APIs (use backend-api for the API layer; return here for the LLM integration within it)
- Reviewing AI-generated application code for slop, hallucinated APIs, or over-abstraction (use anti-slop)
AI Self-Check
AI tools consistently produce the same mistakes when generating AI application code. Before returning any generated AI/ML code, verify against this list:
- API keys loaded from environment variables, never hardcoded
- Streaming responses handled with proper error boundaries and cleanup
- Token limits respected - input truncation or chunking for long contexts
- Structured output uses the provider's native schema enforcement (Anthropic tool_use, OpenAI response_format), not post-hoc parsing with regex
- Tool use / function calling validates tool results before passing back to the model
- Retry logic uses exponential backoff with jitter, not fixed delays
- Rate limit errors (429) handled distinctly from server errors (5xx)
- Vector store queries include a relevance threshold - don't blindly pass low-similarity results to the model
- Embedding model matches between indexing and querying (mixing models = garbage results)
- Prompt templates use parameterized injection, not string concatenation
- Model responses validated before use (check for refusals, empty content, malformed JSON)
- Cost estimation done before batch operations (token count * price * volume)
- No synchronous LLM calls in request handlers - always async with timeouts
- PII stripped or masked before sending to external model APIs
- Temperature set intentionally (0 for deterministic tasks, higher for creative)
- Current source checked: dated versions, CLI flags, API names, and support windows are verified against primary docs before repeating them
- Hidden state identified: local config, credentials, caches, contexts, branches, cluster targets, or previous runs are made explicit before acting
- Verification is real: final checks exercise the actual runtime, parser, service, or integration point instead of only linting prose or happy paths
- Routing overlap checked: overlapping skills, trigger terms, and "When NOT to use" boundaries are checked before returning guidance
- Spec claims verified: claims about tool behavior, output contracts, or repo conventions are checked against current docs, scripts, or skill files
- Provider drift checked: Responses/Agents/SDK examples use current provider surfaces, not deprecated patterns - specifically verify no use of
openai.beta.assistants.create(Assistants API, superseded by Responses/Agents API) or other Assistants-era surfaces - RAG evidence bounded: retrieval thresholds, citations, and empty-result behavior are defined before generation
Performance
- Batch embeddings and eval runs; avoid one request per row when the provider offers batch or bulk APIs.
- Cache deterministic retrieval, tool metadata, and prompt templates, but never cache tenant-specific model outputs without a data-retention decision.
- Track token, latency, and retry budgets separately for interactive, background, and eval traffic.
Best Practices
- Prefer raw provider SDKs until orchestration complexity justifies LangGraph, LlamaIndex, or LangChain.
- Keep model, tool, retrieval, and safety decisions configurable per environment; avoid hardcoding preview model names in application logic.
- Treat model output as untrusted input: validate structure, refusal states, tool arguments, and downstream side effects.
Workflow
Step 1: Determine the architecture pattern
| Need | Pattern | Start with |
|---|---|---|
| Single model call | Direct API integration | Provider SDK |
| Knowledge-grounded answers | RAG pipeline | Vector store + retrieval |
| Multi-step reasoning | Agent with tools | LangGraph, OpenAI Agents SDK, or custom loop |
| Multiple specialized models | Model routing / chain | Custom router or Vercel AI SDK |
| Offline / air-gapped | Local inference | Ollama or vLLM |
| Existing data enrichment | Batch processing | Provider batch APIs |
Step 2: Choose the right abstraction level
Pick the lightest tool that solves the problem:
- Raw SDK - direct Anthropic/OpenAI SDK calls. Best for simple integrations, maximum control, minimum dependencies. Start here unless you have a specific reason not to.
- Vercel AI SDK - unified provider interface with streaming primitives. Good for TypeScript apps that need provider-agnostic code or React/Next.js streaming UI.
- LangChain / LlamaIndex - orchestration frameworks. Use when you need complex chains, built-in document loaders, or 300+ pre-built integrations. Don't use for simple API calls - the abstraction overhead isn't worth it.
- LangGraph / OpenAI Agents SDK - stateful agent frameworks. Use when you need cycles, persistence, human-in-the-loop, or multi-agent coordination.
The anti-pattern: importing LangChain to make a single API call. That's like importing Django to serve a static HTML file.
Step 3: Implement
Follow the domain-specific sections below. Read the appropriate reference file for detailed patterns and code examples.
Step 4: Evaluate and validate
Every AI feature needs evaluation. Not "run it once and eyeball the output" - structured evals with datasets, metrics, and regression detection.
Minimum viable eval: create a promptfooconfig.yaml with 20+ test cases, use contains,
llm-rubric, and cost assertions, run npx promptfoo eval in CI on every PR that touches
prompts. Track pass rate over time - any regression blocks the merge.
Read references/evaluation.md for promptfoo setup, assertion types, CI integration (GitHub
Actions example), RAG-specific evals, agent evals, and red teaming patterns.
LLM Integration Patterns
Streaming
Always stream for user-facing responses. Buffer for background processing.
# Anthropic streaming (Python)
import anthropic
client = anthropic.Anthropic()
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
) as stream:
for text in stream.text_stream:
yield text
Structured output
Use native provider mechanisms, not regex parsing of free-text responses.
- Anthropic:
tool_usewith JSON schema, orresponse_formatwithjson_schema