When this skill is activated, always start your first response with the 🧢 emoji.

LLM App Development

Building production LLM applications requires more than prompt engineering - it demands the same reliability, observability, and safety thinking applied to any critical system. This skill covers the full stack: architecture, guardrails, evaluation pipelines, RAG, function calling, streaming, and cost optimization. It emphasizes when patterns apply and what to do when they fail, not just happy-path implementation.

When to use this skill

Trigger this skill when the user:

Designs the architecture for a new LLM-powered application or feature
Implements content filtering, PII detection, or schema validation on model I/O
Builds or improves an evaluation pipeline (automated evals, human review, A/B tests)
Sets up a RAG pipeline (chunking, embedding, retrieval, reranking)
Adds function calling or tool use to an agent or chat interface
Streams LLM responses to a client (SSE, token-by-token rendering)
Optimizes inference cost or latency (caching, model routing, prompt compression)
Decides whether to fine-tune a model or improve prompting instead

Do NOT trigger this skill for:

Pure ML research, model training from scratch, or academic benchmarking
Questions about a specific AI framework API (use the framework's own skill, e.g., mastra)

Key principles

Evaluate before you ship - A feature without evals is a feature you cannot safely iterate on. Define success metrics and build automated checks before the first production deployment.
Guardrails are non-negotiable - Validate both input and output on every production request. Content filtering, PII scrubbing, and schema validation belong in your request path, not as optional post-processing.
Start with prompting before fine-tuning - Fine-tuning is expensive, slow to iterate, and hard to maintain. Exhaust systematic prompt engineering, few-shot examples, and RAG before considering fine-tuning.
Design for failure and fallback - LLM calls fail: timeouts, rate limits, malformed outputs, hallucinations. Every integration needs retry logic, output validation, and a fallback response.
Cost-optimize from day one - Track token usage per feature. Cache deterministic outputs. Route cheap queries to smaller models. Set hard budget limits.

Core concepts

LLM app stack

User input
    -> Input guardrails (safety, PII, token limits)
    -> Prompt construction (system prompt, context, few-shots, retrieved docs)
    -> Model call (streaming or batch)
    -> Output guardrails (schema validation, content check, hallucination detection)
    -> Post-processing (formatting, citations, structured extraction)
    -> Response to user

Every layer is an independent failure point and must be observable.

Embedding / vector DB architecture

Documents are chunked into overlapping segments, embedded into dense vectors, and stored in a vector database. At query time the user message is embedded, similar chunks are retrieved via ANN search, optionally reranked by a cross-encoder, and injected into the context window. Chunk quality determines retrieval quality more than model choice.

Caching strategies

Layer	What to cache	TTL
Exact cache	Identical prompt+params hash	Hours to days
Semantic cache	Fuzzy-match on embedding similarity	Minutes to hours
Embedding cache	Vectors for known documents	Until doc changes
KV prefix cache	Shared system prompt prefix (provider-side)	Session

Common tasks

Design LLM app architecture

Key decisions before writing code:

Decision	Options	Guide
Context strategy	Long context vs RAG	RAG if >50% of context is static documents
Output mode	Free text, structured JSON, tool calls	Use structured output for any downstream processing
State	Stateless, session, persistent memory	Default stateless; add memory only when proven necessary

import OpenAI from 'openai'

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })

async function callLLM(systemPrompt: string, userMessage: string, model = 'gpt-4o-mini'): Promise<string> {
  const controller = new AbortController()
  const timeout = setTimeout(() => controller.abort(), 30_000)
  try {
    const res = await client.chat.completions.create(
      { model, max_tokens: 1024, messages: [{ role: 'system', content: systemPrompt }, { role: 'user', content: userMessage }] },
      { signal: controller.signal },
    )
    return res.choices[0].message.content ?? ''
  } finally {
    clearTimeout(timeout)
  }
}

Implement input/output guardrails

import { z } from 'zod'

const PII_PATTERNS = [
  /\b\d{3}-\d{2}-\d{4}\b/g,                              // SSN
  /\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi,        // email
  /\b(?:\d{4}[ -]?){3}\d{4}\b/g,                         // credit card
]

function scrubPII(text: string): string {
  return PII_PATTERNS.reduce((t, re) => t.replace(re, '[REDACTED]'), text)
}

function validateInput(text: string): { ok: boolean; reason?: string } {
  if (text.split(/\s+/).length > 4000) return { ok: false, reason: 'Input too long' }
  return { ok: true }
}

const SummarySchema = z.object({
  summary: z.string().min(10).max(500),
  keyPoints: z.array(z.string()).min(1).max(10),
  confidence: z.number().min(0).max(1),
})

async function getSummaryWithGuardrails(text: string) {
  const v = validateInput(text)
  if (!v.ok) throw new Error(`Input rejected: ${v.reason}`)
  const raw = await callLLM('Respond only with valid JSON.', `Summarize as JSON: ${scrubPII(text)}`)
  return SummarySchema.parse(JSON.parse(raw))  // throws ZodError if schema invalid
}

Build an evaluation pipeline

interface EvalCase {
  id: string
  input: string
  expectedContains?: string[]
  expectedNotContains?: string[]
  scoreThreshold?: number  // 0-1 for LLM-as-judge
}

async function runEval(ec: EvalCase, modelFn: (input: string) => Promise<string>) {
  const output = await modelFn(ec.input)
  for (const s of ec.expectedContains ?? [])
    if (!output.includes(s)) return { id: ec.id, passed: false, details: `Missing: "${s}"` }
  for (const s of ec.expectedNotContains ?? [])
    if (output.includes(s)) return { id: ec.id, passed: false, details: `Forbidden: "${s}"` }
  if (ec.scoreThreshold !== undefined) {
    const score = await judgeOutput(ec.input, output)
    if (score < ec.scoreThreshold) return { id: ec.id, passed: false, details: `Score ${score} < ${ec.scoreThreshold}` }
  }
  return { id: ec.id, passed: true, details: 'OK' }
}

async function judgeOutput(input: string, output: string): Promise<number> {
  const score = await callLLM(
    'You are a strict evaluator. Reply with only a number from 0.0 to 1.0.',
    `Input: ${input}\n\nOutput: ${output}\n\nScore quality (0.0=poor, 1.0=excellent):`,
    'gpt-4o',
  )
  return Math.min(1, Math.max(0, parseFloat(score)))
}

Load references/evaluation-framework.md for metrics, benchmarks, and human-in-the-loop protocols.

Implement RAG with vector search

import OpenAI from 'openai'

const client = new OpenAI()

function chunkText(text: string, size = 512, overlap = 64): string[] {
  const words = text.split(/\s+/)
  const chunks: string[] = []
  for (let i = 0; i < words.length; i += size - overlap) {
    chunks.push(words.slice(i, i + size).join(' '))
    if (i + size >= words.length) break
  }
  return chunks
}

async function embedTexts(texts: string[]): Promise<number[][]> {
  const res = await client.embeddings.create({ model: 'text-embedding-3-small', input: texts })
  return res.data.map(d => d.embedding)
}

function cosine(a: number[], b: number[]): number {
  const dot = a.reduce((s, v, i) => s + v * b[i], 0)
  return dot / (Math.sqrt(a.reduce((s, v) => s + v * v, 0))

llm-app-development

Cómo agregar

Pega en el README de tu repo

Skills relacionadas

webapp-testing

brand-guidelines

frontend-design

web-artifacts-builder

Recibe nuevas skills de Design e Frontend todos los lunes