Ollama Patterns

Quick Guide: Use the ollama npm package to run LLMs locally. Use ollama.chat() for conversations and ollama.generate() for single prompts. Enable streaming with stream: true and iterate with for await. Use format with a JSON schema (via zodToJsonSchema) for structured outputs. Use tools array for function calling. Use ollama.embed() for embeddings. Models run on your machine -- no API keys required for local use, but be aware of model loading time and memory usage.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST use ollama.chat() for conversations and ollama.generate() for single-prompt completions -- they have different parameter shapes)

(You MUST handle model loading delays -- the first request after a model is loaded takes significantly longer due to model initialization)

(You MUST use zodToJsonSchema() from zod-to-json-schema for structured outputs -- do NOT manually construct JSON schemas)

(You MUST accumulate streamed thinking, content, and tool_calls fields to maintain conversation history in multi-turn interactions)

(You MUST never assume a model is already pulled -- check with ollama.list() or handle errors from missing models gracefully)

</critical_requirements>

Auto-detection: Ollama, ollama, ollama.chat, ollama.generate, ollama.embed, ollama.pull, ollama.list, ollama.show, ollama.delete, ollama.ps, ollama.abort, ollama.create, keep_alive, zodToJsonSchema, OLLAMA_HOST, llama3, mistral, qwen, gemma, phi, deepseek, local LLM

When to use:

Running LLMs locally for development, testing, or privacy-sensitive workloads
Building chat applications with local models (Llama, Mistral, Qwen, Gemma, etc.)
Extracting structured data from text or images using local models with JSON schemas
Implementing tool calling / function calling with locally-hosted models
Generating embeddings for RAG or semantic search without cloud API costs
Managing local model lifecycle (pull, list, show, delete, copy)
Prototyping AI features before committing to a cloud provider

Key patterns covered:

Client setup (default and custom instances)
Chat completions (ollama.chat) and text generation (ollama.generate)
Streaming with for await and accumulated state
Structured output with format + zodToJsonSchema
Tool calling with tools array and multi-turn tool loops
Vision / multimodal inputs with images parameter
Embeddings with ollama.embed()
Model management (pull, list, show, delete, copy, ps)
OpenAI-compatible endpoint for drop-in migration

When NOT to use:

Production workloads requiring guaranteed uptime and SLAs -- use a cloud LLM provider
Multi-provider applications where you need to switch between OpenAI, Anthropic, Google -- use a unified provider SDK
Applications requiring the latest proprietary models (GPT-5, Claude) -- those are cloud-only

Examples Index

Core: Setup, Chat & Generate -- Client init, chat, generate, streaming, error handling
Tool Calling -- Tool definitions, single/parallel calls, multi-turn agent loops
Structured Output -- JSON schema via Zod, vision extraction
Embeddings & Vision -- Embeddings, image analysis, multimodal
Model Management -- Pull, list, show, delete, copy, ps
Quick API Reference -- Method signatures, options, response types, model names

Philosophy

The Ollama JavaScript library is a thin client over Ollama's local REST API (default http://127.0.0.1:11434). It provides direct access to locally-running open-source LLMs with zero cloud dependencies.

Core principles:

Local-first -- Models run on your hardware. No API keys required for local use, complete data privacy, no per-token costs. Trade-off: you need sufficient GPU/CPU memory.
Simple API -- ollama.chat() and ollama.generate() are the two primary methods. The default import is a pre-configured singleton client; create custom instances with new Ollama() for non-default hosts.
Streaming by default in REST, opt-in in SDK -- The REST API streams by default. The SDK returns full responses by default; set stream: true to get an AsyncGenerator.
Model-agnostic -- The same API works with any Ollama-supported model (Llama, Mistral, Qwen, Gemma, Phi, DeepSeek, etc.). Model capabilities (vision, tool calling, structured output) depend on the model.
OpenAI-compatible -- Ollama exposes /v1/chat/completions and /v1/embeddings endpoints, allowing the OpenAI SDK to connect with baseURL: 'http://localhost:11434/v1'.

</philosophy>

Core Patterns

Pattern 1: Client Setup

The default import is a pre-configured singleton pointing to http://127.0.0.1:11434.

// lib/ollama.ts -- default client (most common)
import ollama from "ollama";

// Use directly -- connects to localhost:11434
const response = await ollama.chat({
  model: "llama3.1",
  messages: [{ role: "user", content: "Hello" }],
});

// lib/ollama.ts -- custom client for non-default host
import { Ollama } from "ollama";

const ollama = new Ollama({
  host: "http://192.168.1.100:11434",
});

export { ollama };

Why good: Minimal setup, default client requires zero configuration, custom client for remote servers

// BAD: Hardcoding host inline everywhere
import { Ollama } from "ollama";
const response = await new Ollama({ host: "http://192.168.1.100:11434" }).chat({
  model: "llama3.1",
  messages: [{ role: "user", content: "Hello" }],
});

Why bad: Creates a new client instance per request, no reuse, host scattered across codebase

See: examples/core.md for cloud API setup, custom headers, browser usage

Pattern 2: Chat Completions

Multi-turn conversations with message history. You manage the messages array.

import ollama from "ollama";

const response = await ollama.chat({
  model: "llama3.1",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Explain TypeScript generics." },
  ],
});

console.log(response.message.content);

Why good: Clear message roles, system message for behavior control, direct content access

// BAD: Not checking response, no system message
const res = await ollama.chat({
  model: "llama3.1",
  messages: [{ role: "user", content: "do something" }],
});

Why bad: No system instruction means unpredictable behavior, vague prompt

See: examples/core.md for multi-turn conversations, model options

Pattern 3: Text Generation

Single-prompt completions without message history. Simpler than chat for one-shot tasks.

import ollama from "ollama";

const response = await ollama.generate({
  model: "llama3.1",
  prompt: "Write a haiku about TypeScript.",
  system: "You are a creative writer.",
});

console.log(response.response);

Why good: Simpler API for one-shot tasks, system parameter instead of message array

See: examples/core.md for generate with images, suffix, raw mode

Pattern 4: Streaming

Set stream: true to get an AsyncGenerator. Iterate with for await.

import ollama from "ollama";

const stream = await ollama.chat({
  model: "llama3.1",
  messages: [{ role: "user", content: "Explain async/await." }],
  stream: true,
});

for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}
console.log(); // newline

Why good: Progressive output for better UX, memory-efficient for

ai-infrastructure-ollama

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday