When this skill is activated, always start your first response with the 🧢 emoji.
LLM App Development
Building production LLM applications requires more than prompt engineering - it demands the same reliability, observability, and safety thinking applied to any critical system. This skill covers the full stack: architecture, guardrails, evaluation pipelines, RAG, function calling, streaming, and cost optimization. It emphasizes when patterns apply and what to do when they fail, not just happy-path implementation.
When to use this skill
Trigger this skill when the user:
- Designs the architecture for a new LLM-powered application or feature
- Implements content filtering, PII detection, or schema validation on model I/O
- Builds or improves an evaluation pipeline (automated evals, human review, A/B tests)
- Sets up a RAG pipeline (chunking, embedding, retrieval, reranking)
- Adds function calling or tool use to an agent or chat interface
- Streams LLM responses to a client (SSE, token-by-token rendering)
- Optimizes inference cost or latency (caching, model routing, prompt compression)
- Decides whether to fine-tune a model or improve prompting instead
Do NOT trigger this skill for:
- Pure ML research, model training from scratch, or academic benchmarking
- Questions about a specific AI framework API (use the framework's own skill, e.g.,
mastra)
Key principles
-
Evaluate before you ship - A feature without evals is a feature you cannot safely iterate on. Define success metrics and build automated checks before the first production deployment.
-
Guardrails are non-negotiable - Validate both input and output on every production request. Content filtering, PII scrubbing, and schema validation belong in your request path, not as optional post-processing.
-
Start with prompting before fine-tuning - Fine-tuning is expensive, slow to iterate, and hard to maintain. Exhaust systematic prompt engineering, few-shot examples, and RAG before considering fine-tuning.
-
Design for failure and fallback - LLM calls fail: timeouts, rate limits, malformed outputs, hallucinations. Every integration needs retry logic, output validation, and a fallback response.
-
Cost-optimize from day one - Track token usage per feature. Cache deterministic outputs. Route cheap queries to smaller models. Set hard budget limits.
Core concepts
LLM app stack
User input
-> Input guardrails (safety, PII, token limits)
-> Prompt construction (system prompt, context, few-shots, retrieved docs)
-> Model call (streaming or batch)
-> Output guardrails (schema validation, content check, hallucination detection)
-> Post-processing (formatting, citations, structured extraction)
-> Response to user
Every layer is an independent failure point and must be observable.
Embedding / vector DB architecture
Documents are chunked into overlapping segments, embedded into dense vectors, and stored in a vector database. At query time the user message is embedded, similar chunks are retrieved via ANN search, optionally reranked by a cross-encoder, and injected into the context window. Chunk quality determines retrieval quality more than model choice.
Caching strategies
| Layer | What to cache | TTL |
|---|---|---|
| Exact cache | Identical prompt+params hash | Hours to days |
| Semantic cache | Fuzzy-match on embedding similarity | Minutes to hours |
| Embedding cache | Vectors for known documents | Until doc changes |
| KV prefix cache | Shared system prompt prefix (provider-side) | Session |
Common tasks
Design LLM app architecture
Key decisions before writing code:
| Decision | Options | Guide |
|---|---|---|
| Context strategy | Long context vs RAG | RAG if >50% of context is static documents |
| Output mode | Free text, structured JSON, tool calls | Use structured output for any downstream processing |
| State | Stateless, session, persistent memory | Default stateless; add memory only when proven necessary |
import OpenAI from 'openai'
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY })
async function callLLM(systemPrompt: string, userMessage: string, model = 'gpt-4o-mini'): Promise<string> {
const controller = new AbortController()
const timeout = setTimeout(() => controller.abort(), 30_000)
try {
const res = await client.chat.completions.create(
{ model, max_tokens: 1024, messages: [{ role: 'system', content: systemPrompt }, { role: 'user', content: userMessage }] },
{ signal: controller.signal },
)
return res.choices[0].message.content ?? ''
} finally {
clearTimeout(timeout)
}
}
Implement input/output guardrails
import { z } from 'zod'
const PII_PATTERNS = [
/\b\d{3}-\d{2}-\d{4}\b/g, // SSN
/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b/gi, // email
/\b(?:\d{4}[ -]?){3}\d{4}\b/g, // credit card
]
function scrubPII(text: string): string {
return PII_PATTERNS.reduce((t, re) => t.replace(re, '[REDACTED]'), text)
}
function validateInput(text: string): { ok: boolean; reason?: string } {
if (text.split(/\s+/).length > 4000) return { ok: false, reason: 'Input too long' }
return { ok: true }
}
const SummarySchema = z.object({
summary: z.string().min(10).max(500),
keyPoints: z.array(z.string()).min(1).max(10),
confidence: z.number().min(0).max(1),
})
async function getSummaryWithGuardrails(text: string) {
const v = validateInput(text)
if (!v.ok) throw new Error(`Input rejected: ${v.reason}`)
const raw = await callLLM('Respond only with valid JSON.', `Summarize as JSON: ${scrubPII(text)}`)
return SummarySchema.parse(JSON.parse(raw)) // throws ZodError if schema invalid
}
Build an evaluation pipeline
interface EvalCase {
id: string
input: string
expectedContains?: string[]
expectedNotContains?: string[]
scoreThreshold?: number // 0-1 for LLM-as-judge
}
async function runEval(ec: EvalCase, modelFn: (input: string) => Promise<string>) {
const output = await modelFn(ec.input)
for (const s of ec.expectedContains ?? [])
if (!output.includes(s)) return { id: ec.id, passed: false, details: `Missing: "${s}"` }
for (const s of ec.expectedNotContains ?? [])
if (output.includes(s)) return { id: ec.id, passed: false, details: `Forbidden: "${s}"` }
if (ec.scoreThreshold !== undefined) {
const score = await judgeOutput(ec.input, output)
if (score < ec.scoreThreshold) return { id: ec.id, passed: false, details: `Score ${score} < ${ec.scoreThreshold}` }
}
return { id: ec.id, passed: true, details: 'OK' }
}
async function judgeOutput(input: string, output: string): Promise<number> {
const score = await callLLM(
'You are a strict evaluator. Reply with only a number from 0.0 to 1.0.',
`Input: ${input}\n\nOutput: ${output}\n\nScore quality (0.0=poor, 1.0=excellent):`,
'gpt-4o',
)
return Math.min(1, Math.max(0, parseFloat(score)))
}
Load
references/evaluation-framework.mdfor metrics, benchmarks, and human-in-the-loop protocols.
Implement RAG with vector search
import OpenAI from 'openai'
const client = new OpenAI()
function chunkText(text: string, size = 512, overlap = 64): string[] {
const words = text.split(/\s+/)
const chunks: string[] = []
for (let i = 0; i < words.length; i += size - overlap) {
chunks.push(words.slice(i, i + size).join(' '))
if (i + size >= words.length) break
}
return chunks
}
async function embedTexts(texts: string[]): Promise<number[][]> {
const res = await client.embeddings.create({ model: 'text-embedding-3-small', input: texts })
return res.data.map(d => d.embedding)
}
function cosine(a: number[], b: number[]): number {
const dot = a.reduce((s, v, i) => s + v * b[i], 0)
return dot / (Math.sqrt(a.reduce((s, v) => s + v * v, 0))