Hugging Face Inference Patterns
Quick Guide: Use
@huggingface/inference(v4+) to access 200k+ ML models on the Hugging Face Hub. UseInferenceClientwithchatCompletion()for OpenAI-compatible chat,textGeneration()for raw text completion,chatCompletionStream()for streaming,featureExtraction()for embeddings,textToImage()for image generation, andautomaticSpeechRecognition()for audio transcription. Setproviderto route through inference providers (Cerebras, Together, Groq, etc.) or useendpointUrlfor dedicated Inference Endpoints.
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST always pass an access token to InferenceClient -- never deploy without authentication)
(You MUST use chatCompletion() / chatCompletionStream() for conversational LLM tasks -- these follow the OpenAI-compatible message format)
(You MUST handle errors using InferenceClientError and its subclasses -- never use bare catch blocks without error type checking)
(You MUST specify a model parameter for every inference call -- there is no default model)
(You MUST never hardcode access tokens -- always use environment variables via process.env.HF_TOKEN)
</critical_requirements>
Auto-detection: Hugging Face, huggingface, @huggingface/inference, InferenceClient, HfInference, hf.chatCompletion, hf.textGeneration, hf.featureExtraction, hf.textToImage, hf.automaticSpeechRecognition, hf.translation, hf.summarization, hf.textToSpeech, chatCompletionStream, textGenerationStream, HF_TOKEN, inference provider, Inference Endpoints
When to use:
- Accessing any of the 200k+ models hosted on the Hugging Face Hub
- Running chat completion with open-source LLMs (Qwen, Mistral, Llama, etc.)
- Generating embeddings with sentence-transformer models for semantic search
- Generating images from text prompts (FLUX, Stable Diffusion)
- Transcribing audio with automatic speech recognition models
- Running translation, summarization, text classification, or NER tasks
- Deploying models on dedicated Inference Endpoints for production use
- Using third-party inference providers (Cerebras, Together, Groq, Replicate, etc.) through a unified API
Key patterns covered:
- InferenceClient initialization and configuration
- Chat Completion API (OpenAI-compatible messages format, streaming)
- Text generation (raw completion, streaming)
- Embeddings via feature extraction
- Image generation (text-to-image)
- Audio transcription (automatic speech recognition)
- Translation, summarization, and text classification
- Inference Endpoints (dedicated deployments)
- Inference Providers (routing through third-party services)
- Error handling with typed error classes
When NOT to use:
- If you only use OpenAI models -- use the OpenAI SDK directly
- If you need a provider-agnostic unified SDK with structured outputs and tool calling -- use a higher-level AI SDK
- If you need to fine-tune or train models -- use the
@huggingface/hubpackage or Pythontransformers
Examples Index
- Core: Setup, Chat & Text Generation -- Client init, chat completion, text generation, streaming, error handling
- Tasks: Embeddings, Vision, Audio & NLP -- Feature extraction, image generation, speech recognition, translation, summarization, classification
- Quick API Reference -- Method signatures, error types, provider list, model recommendations
<philosophy>
Philosophy
The @huggingface/inference SDK provides a unified TypeScript client for accessing hundreds of thousands of ML models through multiple backends: serverless Inference Providers, dedicated Inference Endpoints, and local servers.
Core principles:
- Model-agnostic access -- One client, any model on the Hub. Swap models by changing the
modelparameter without code changes. - Provider flexibility -- Route inference through 20+ providers (Cerebras, Together, Groq, Replicate, etc.) with a single
providerparameter, or deploy your own Inference Endpoints. - Task-oriented API -- Methods map to ML tasks (
chatCompletion,textToImage,automaticSpeechRecognition), not raw HTTP endpoints. - OpenAI-compatible chat --
chatCompletion()uses the OpenAI message format (role+content), making migration between providers easy. - Streaming as async generators --
chatCompletionStream()andtextGenerationStream()returnAsyncGenerator, consumed withfor await...of.
<patterns>
Core Patterns
Pattern 1: Client Setup
Initialize with your Hugging Face access token. The token is required for authenticated access.
// lib/hf-client.ts -- basic setup
import { InferenceClient } from "@huggingface/inference";
const client = new InferenceClient(process.env.HF_TOKEN);
export { client };
// lib/hf-client.ts -- with custom endpoint
const ENDPOINT_URL =
"https://your-endpoint.us-east-1.aws.endpoints.huggingface.cloud/v1/";
const client = new InferenceClient(process.env.HF_TOKEN, {
endpointUrl: ENDPOINT_URL,
});
export { client };
Why good: Token from env var, named constant for endpoint URL, named export
// BAD: Hardcoded token, no named export
const hf = new InferenceClient("hf_abc123xyz");
export default hf;
Why bad: Hardcoded token is a security risk, default export violates conventions
See: examples/core.md for provider routing, local endpoints, and endpoint helper
Pattern 2: Chat Completion (OpenAI-Compatible)
Use chatCompletion() for conversational LLM tasks. Follows the OpenAI message format.
const MAX_TOKENS = 512;
const TEMPERATURE = 0.1;
const response = await client.chatCompletion({
model: "Qwen/Qwen3-32B",
provider: "cerebras",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain TypeScript generics." },
],
max_tokens: MAX_TOKENS,
temperature: TEMPERATURE,
});
console.log(response.choices[0].message.content);
Why good: Named constants for parameters, explicit model and provider, system message for behavior
// BAD: No model specified, magic numbers, no system message
const response = await client.chatCompletion({
messages: [{ role: "user", content: "do something" }],
max_tokens: 512,
temperature: 0.1,
});
Why bad: Missing required model, magic numbers, vague prompt, no system instruction
See: examples/core.md for multi-turn conversations and provider selection
Pattern 3: Streaming Chat Completion
Use chatCompletionStream() for streaming responses. Returns an AsyncGenerator.
const MAX_TOKENS = 512;
let fullResponse = "";
for await (const chunk of client.chatCompletionStream({
model: "Qwen/Qwen3-32B",
provider: "cerebras",
messages: [{ role: "user", content: "Explain async/await in TypeScript." }],
max_tokens: MAX_TOKENS,
})) {
if (chunk.choices && chunk.choices.length > 0) {
const content = chunk.choices[0].delta.content;
if (content) {
process.stdout.write(content);
fullResponse += content;
}
}
}
console.log(); // newline
Why good: Async generator consumed with for await, progressive output, null checks on chunk data
// BAD: Not checking chunk.choices, ignoring null content
for await (const chunk of client.chatCompletionStream({
model: "...",
messages: [],
})) {
process.stdout.write(chunk.choices[0].delta.content); // May throw on null
}
Why bad: No null check -- choices may be empty, content may be null between chunks
See: examples/core.md for text generation streaming