Together AI SDK Patterns
Quick Guide: Use the
together-ainpm package to access 200+ open-source models (Llama, Qwen, Mistral, DeepSeek) via Together AI's fast inference API. The SDK mirrors the OpenAI API shape --client.chat.completions.create()for chat,client.images.generate()for images,client.embeddings.create()for embeddings. Useresponse_format: { type: "json_schema" }with Zod-generated schemas for structured output. Function calling uses the sametoolsparameter shape as OpenAI. You can also use the OpenAI SDK directly by pointingbaseURLtohttps://api.together.xyz/v1.
<critical_requirements>
CRITICAL: Before Using This Skill
All code must follow project conventions in CLAUDE.md (kebab-case, named exports, import ordering,
import type, named constants)
(You MUST use the together-ai package (import Together from "together-ai") -- NOT the OpenAI SDK -- unless explicitly building an OpenAI-compatible integration)
(You MUST include the JSON schema in BOTH the response_format parameter AND the system prompt when using structured output -- the model needs both)
(You MUST handle errors using Together.APIError and its subclasses -- never use bare catch blocks without error type checking)
(You MUST never hardcode API keys -- always use environment variables via process.env.TOGETHER_API_KEY)
</critical_requirements>
Auto-detection: Together AI, together-ai, together.ai, TOGETHER_API_KEY, client.chat.completions (together), client.images.generate, client.embeddings.create (together), Llama-3, Qwen3, Mistral, DeepSeek, FLUX, together.images, together.chat, together.embeddings, together.fineTuning, api.together.xyz
When to use:
- Running open-source LLMs (Llama, Qwen, Mistral, DeepSeek) via serverless inference
- Generating images with FLUX or Stable Diffusion models
- Creating embeddings for RAG pipelines with open-source embedding models
- Using function calling / tool use with open-source models
- Extracting structured JSON output from LLM responses
- Fine-tuning open-source models on custom data
- Migrating from OpenAI to open-source models with minimal code changes
Key patterns covered:
- Client initialization and configuration (retries, timeouts, logging)
- Chat completions with open-source models (Llama, Qwen, Mistral, DeepSeek)
- Streaming with
stream: trueandfor await...of - Structured output with
response_format: { type: "json_schema" }and Zod - Function calling / tool use with
toolsparameter - Image generation with FLUX and Stable Diffusion models
- Embeddings API with open-source embedding models
- Fine-tuning API (file upload, job creation, monitoring)
- OpenAI SDK compatibility (base URL swap)
- Error handling, retries, timeouts
When NOT to use:
- You need OpenAI-specific features (Responses API, Batch API, Realtime API) -- use the OpenAI SDK directly
- You want framework-specific chat UI hooks -- use a framework-integrated AI SDK
- You only use OpenAI models and never plan to use open-source models
Examples Index
- Core: Setup & Configuration -- Client init, production config, error handling, OpenAI compatibility
- Chat Completions -- Basic chat, multi-turn, model selection, vision
- Streaming -- Async iteration, stream cancellation
- Tool/Function Calling -- Tool definitions, multi-step tool loops
- Structured Output -- JSON mode, Zod schemas, regex mode
- Images & Embeddings -- FLUX image generation, embedding models, semantic search
- Quick API Reference -- Model IDs, method signatures, error types
<philosophy>
Philosophy
Together AI provides fast serverless inference for open-source models. The TypeScript SDK (together-ai) is auto-generated with Stainless and mirrors the OpenAI API shape, making migration straightforward.
Core principles:
- OpenAI-compatible API shape -- Same
client.chat.completions.create()pattern, samemessagesarray, sametoolsparameter. Switching from OpenAI is often just changing the import and model name. - Open-source model access -- Run Llama, Qwen, Mistral, DeepSeek, and 200+ other models without managing infrastructure. Models are identified by their Hugging Face-style IDs (e.g.,
meta-llama/Llama-3.3-70B-Instruct-Turbo). - Multi-modal support -- Chat completions, image generation (FLUX, Stable Diffusion), embeddings, audio, and video -- all through one SDK.
- Structured output via JSON Schema -- Pass a JSON schema in
response_formatand include it in the system prompt. Use Zod'sz.toJSONSchema()to generate schemas from TypeScript types. - Fine-tuning open-source models -- Upload JSONL data, create LoRA or full fine-tuning jobs, and deploy custom models -- all via the API.
When to use Together AI:
- You want to use open-source models with fast serverless inference
- You need cost-effective inference (often cheaper than proprietary APIs)
- You want to fine-tune open-source models on your data
- You need image generation with FLUX models
- You want OpenAI API compatibility for easy migration
When NOT to use:
- You need OpenAI-specific features (Responses API, Batch API, Realtime) -- use the OpenAI SDK
- You need Anthropic or Google-specific features -- use their respective SDKs
- You want a provider-agnostic SDK -- use a unified provider framework
<patterns>
Core Patterns
Pattern 1: Client Setup
Initialize the Together client. It reads TOGETHER_API_KEY from the environment.
// lib/together.ts -- basic setup
import Together from "together-ai";
const client = new Together();
export { client };
// lib/together.ts -- production configuration
const TIMEOUT_MS = 30_000;
const MAX_RETRIES = 3;
const client = new Together({
apiKey: process.env.TOGETHER_API_KEY,
timeout: TIMEOUT_MS,
maxRetries: MAX_RETRIES,
});
export { client };
Why good: Minimal setup, env var auto-detected, named constants for production settings
// BAD: Hardcoded API key
const client = new Together({
apiKey: "sk-abc123...",
});
Why bad: Hardcoded keys get leaked in version control, security breach risk
See: examples/core.md for error handling, OpenAI compatibility, per-request overrides
Pattern 2: Chat Completions
Stateless text generation with open-source models.
const completion = await client.chat.completions.create({
model: "meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages: [
{ role: "system", content: "You are a helpful coding assistant." },
{ role: "user", content: "Explain TypeScript generics." },
],
});
console.log(completion.choices[0].message.content);
Why good: Clear message roles, system message for behavior control, direct content access
// BAD: No system message, no model specified
const res = await client.chat.completions.create({
messages: [{ role: "user", content: "do something" }],
});
Why bad: Missing model field will error, no system instruction means unpredictable behavior
See: examples/chat.md for multi-turn, vision models, model selection guide
Pattern 3: Streaming
Use streaming for user-facing responses.
const stream = await client.chat.completions.create({
model: "meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages: [{ role: "user", content: "Explain async/await." }],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) process.stdout.write(content);
}
Why good: Progressive output for better UX, standard async iterator pattern
// BAD: Not consuming the stream
const stream = await client.chat.completions.create({