Hugging Face Inference Patterns

Quick Guide: Use @huggingface/inference (v4+) to access 200k+ ML models on the Hugging Face Hub. Use InferenceClient with chatCompletion() for OpenAI-compatible chat, textGeneration() for raw text completion, chatCompletionStream() for streaming, featureExtraction() for embeddings, textToImage() for image generation, and automaticSpeechRecognition() for audio transcription. Set provider to route through inference providers (Cerebras, Together, Groq, etc.) or use endpointUrl for dedicated Inference Endpoints.

<critical_requirements>

CRITICAL: Before Using This Skill

All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering, import type, named constants)

(You MUST always pass an access token to InferenceClient -- never deploy without authentication)

(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)

(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)

(You MUST specify a model parameter for every inference call -- there is no default model)

(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)

</critical_requirements>

Auto-detection: Hugging Face, huggingface, @huggingface/inference, InferenceClient, HfInference, hf.chatCompletion, hf.textGeneration, hf.featureExtraction, hf.textToImage, hf.automaticSpeechRecognition, hf.translation, hf.summarization, hf.textToSpeech, chatCompletionStream, textGenerationStream, HF_TOKEN, inference provider, Inference Endpoints

When to use:

Accessing any of the 200k+ models hosted on the Hugging Face Hub
Running chat completion with open-source LLMs (Qwen, Mistral, Llama, etc.)
Generating embeddings with sentence-transformer models for semantic search
Generating images from text prompts (FLUX, Stable Diffusion)
Transcribing audio with automatic speech recognition models
Running translation, summarization, text classification, or NER tasks
Deploying models on dedicated Inference Endpoints for production use
Using third-party inference providers (Cerebras, Together, Groq, Replicate, etc.) through a unified API

Key patterns covered:

InferenceClient initialization and configuration
Chat Completion API (OpenAI-compatible messages format, streaming)
Text generation (raw completion, streaming)
Embeddings via feature extraction
Image generation (text-to-image)
Audio transcription (automatic speech recognition)
Translation, summarization, and text classification
Inference Endpoints (dedicated deployments)
Inference Providers (routing through third-party services)
Error handling with typed error classes

When NOT to use:

If you only use OpenAI models -- use the OpenAI SDK directly
If you need a provider-agnostic unified SDK with structured outputs and tool calling -- use a higher-level AI SDK
If you need to fine-tune or train models -- use the @huggingface/hub package or Python transformers

Examples Index

Core: Setup, Chat & Text Generation -- Client init, chat completion, text generation, streaming, error handling
Tasks: Embeddings, Vision, Audio & NLP -- Feature extraction, image generation, speech recognition, translation, summarization, classification
Quick API Reference -- Method signatures, error types, provider list, model recommendations

Philosophy

The @huggingface/inference SDK provides a unified TypeScript client for accessing hundreds of thousands of ML models through multiple backends: serverless Inference Providers, dedicated Inference Endpoints, and local servers.

Core principles:

Model-agnostic access -- One client, any model on the Hub. Swap models by changing the model parameter without code changes.
Provider flexibility -- Route inference through 20+ providers (Cerebras, Together, Groq, Replicate, etc.) with a single provider parameter, or deploy your own Inference Endpoints.
Task-oriented API -- Methods map to ML tasks (chatCompletion, textToImage, automaticSpeechRecognition), not raw HTTP endpoints.
OpenAI-compatible chat -- chatCompletion() uses the OpenAI message format (role + content), making migration between providers easy.
Streaming as async generators -- chatCompletionStream() and textGenerationStream() return AsyncGenerator, consumed with for await...of.

</philosophy>

Core Patterns

Pattern 1: Client Setup

Initialize with your Hugging Face access token. The token is required for authenticated access.

// lib/hf-client.ts -- basic setup
import { InferenceClient } from "@huggingface/inference";

const client = new InferenceClient(process.env.HF_TOKEN);

export { client };

// lib/hf-client.ts -- with custom endpoint
const ENDPOINT_URL =
  "https://your-endpoint.us-east-1.aws.endpoints.huggingface.cloud/v1/";

const client = new InferenceClient(process.env.HF_TOKEN, {
  endpointUrl: ENDPOINT_URL,
});

export { client };

Why good: Token from env var, named constant for endpoint URL, named export

// BAD: Hardcoded token, no named export
const hf = new InferenceClient("hf_abc123xyz");
export default hf;

Why bad: Hardcoded token is a security risk, default export violates conventions

See: examples/core.md for provider routing, local endpoints, and endpoint helper

Pattern 2: Chat Completion (OpenAI-Compatible)

Use chatCompletion() for conversational LLM tasks. Follows the OpenAI message format.

const MAX_TOKENS = 512;
const TEMPERATURE = 0.1;

const response = await client.chatCompletion({
  model: "Qwen/Qwen3-32B",
  provider: "cerebras",
  messages: [
    { role: "system", content: "You are a helpful coding assistant." },
    { role: "user", content: "Explain TypeScript generics." },
  ],
  max_tokens: MAX_TOKENS,
  temperature: TEMPERATURE,
});

console.log(response.choices[0].message.content);

Why good: Named constants for parameters, explicit model and provider, system message for behavior

// BAD: No model specified, magic numbers, no system message
const response = await client.chatCompletion({
  messages: [{ role: "user", content: "do something" }],
  max_tokens: 512,
  temperature: 0.1,
});

Why bad: Missing required model, magic numbers, vague prompt, no system instruction

See: examples/core.md for multi-turn conversations and provider selection

Pattern 3: Streaming Chat Completion

Use chatCompletionStream() for streaming responses. Returns an AsyncGenerator.

const MAX_TOKENS = 512;
let fullResponse = "";

for await (const chunk of client.chatCompletionStream({
  model: "Qwen/Qwen3-32B",
  provider: "cerebras",
  messages: [{ role: "user", content: "Explain async/await in TypeScript." }],
  max_tokens: MAX_TOKENS,
})) {
  if (chunk.choices && chunk.choices.length > 0) {
    const content = chunk.choices[0].delta.content;
    if (content) {
      process.stdout.write(content);
      fullResponse += content;
    }
  }
}
console.log(); // newline

Why good: Async generator consumed with for await, progressive output, null checks on chunk data

// BAD: Not checking chunk.choices, ignoring null content
for await (const chunk of client.chatCompletionStream({
  model: "...",
  messages: [],
})) {
  process.stdout.write(chunk.choices[0].delta.content); // May throw on null
}

Why bad: No null check -- choices may be empty, content may be null between chunks

See: examples/core.md for text generation streaming

ai-infrastructure-huggingface-inference

How to add

Drop this on your repo README

Related skills

internal-comms

babysit

do

smart-explore

Get new DevOps e Infra skills every Monday