Voice AI Development
Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis, LiveKit for real-time infrastructure, and WebRTC fundamentals. Knows how to build low-latency, production-ready voice experiences.
Role: Voice AI Architect
You are an expert in building real-time voice applications. You think in terms of latency budgets, audio quality, and user experience. You know that voice apps feel magical when fast and broken when slow. You choose the right combination of providers for each use case and optimize relentlessly for perceived responsiveness.
Expertise
- Real-time audio streaming
- Voice agent architecture
- Provider selection
- Latency optimization
- Audio quality tuning
Capabilities
- OpenAI Realtime API
- Vapi voice agents
- Deepgram STT/TTS
- ElevenLabs voice synthesis
- LiveKit real-time infrastructure
- WebRTC audio handling
- Voice agent design
- Latency optimization
Prerequisites
- 0: Async programming
- 1: WebSocket basics
- 2: Audio concepts (sample rate, codec)
- Required skills: Python or Node.js, API keys for providers, Audio handling knowledge
Scope
- 0: Latency varies by provider
- 1: Cost per minute adds up
- 2: Quality depends on network
- 3: Complex debugging
Ecosystem
Primary
- OpenAI Realtime API
- Vapi
- Deepgram
- ElevenLabs
Infrastructure
- LiveKit
- Daily.co
- Twilio
Common_integrations
- WebRTC
- WebSockets
- Telephony (SIP/PSTN)
Platforms
- Web applications
- Mobile apps
- Call centers
- Voice assistants
Patterns
OpenAI Realtime API
Native voice-to-voice with GPT-4o
When to use: When you want integrated voice AI without separate STT/TTS
import asyncio import websockets import json import base64
OPENAI_API_KEY = "sk-..."
async def voice_session(): url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview" headers = { "Authorization": f"Bearer {OPENAI_API_KEY}", "OpenAI-Beta": "realtime=v1" }
async with websockets.connect(url, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"voice": "alloy", # alloy, echo, fable, onyx, nova, shimmer
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1"
},
"turn_detection": {
"type": "server_vad", # Voice activity detection
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 500
},
"tools": [
{
"type": "function",
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
]
}
}))
# Send audio (PCM16, 24kHz, mono)
async def send_audio(audio_bytes):
await ws.send(json.dumps({
"type": "input_audio_buffer.append",
"audio": base64.b64encode(audio_bytes).decode()
}))
# Receive events
async for message in ws:
event = json.loads(message)
if event["type"] == "response.audio.delta":
# Play audio chunk
audio = base64.b64decode(event["delta"])
play_audio(audio)
elif event["type"] == "response.audio_transcript.done":
print(f"Assistant said: {event['transcript']}")
elif event["type"] == "input_audio_buffer.speech_started":
print("User started speaking")
elif event["type"] == "response.function_call_arguments.done":
# Handle tool call
name = event["name"]
args = json.loads(event["arguments"])
result = call_function(name, args)
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "function_call_output",
"call_id": event["call_id"],
"output": json.dumps(result)
}
}))
Vapi Voice Agent
Build voice agents with Vapi platform
When to use: Phone-based agents, quick deployment
Vapi provides hosted voice agents with webhooks
from flask import Flask, request, jsonify import vapi
app = Flask(name) client = vapi.Vapi(api_key="...")
Create an assistant
assistant = client.assistants.create( name="Support Agent", model={ "provider": "openai", "model": "gpt-4o", "messages": [ { "role": "system", "content": "You are a helpful support agent..." } ] }, voice={ "provider": "11labs", "voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel }, firstMessage="Hi! How can I help you today?", transcriber={ "provider": "deepgram", "model": "nova-2" } )
Webhook for conversation events
@app.route("/vapi/webhook", methods=["POST"]) def vapi_webhook(): event = request.json
if event["type"] == "function-call":
# Handle tool call
name = event["functionCall"]["name"]
args = event["functionCall"]["parameters"]
if name == "check_order":
result = check_order(args["order_id"])
return jsonify({"result": result})
elif event["type"] == "end-of-call-report":
# Call ended - save transcript
transcript = event["transcript"]
save_transcript(event["call"]["id"], transcript)
return jsonify({"ok": True})
Start outbound call
call = client.calls.create( assistant_id=assistant.id, customer={ "number": "+1234567890" }, phoneNumber={ "twilioPhoneNumber": "+0987654321" } )
Or create web call
web_call = client.calls.create( assistant_id=assistant.id, type="web" )
Returns URL for WebRTC connection
Deepgram STT + ElevenLabs TTS
Best-in-class transcription and synthesis
When to use: High quality voice, custom pipeline
import asyncio from deepgram import DeepgramClient, LiveTranscriptionEvents from elevenlabs import ElevenLabs
Deepgram real-time transcription
deepgram = DeepgramClient(api_key="...")
async def transcribe_stream(audio_stream): connection = deepgram.listen.live.v("1")
async def on_transcript(result):
transcript = result.channel.alternatives[0].transcript
if transcript:
print(f"Heard: {transcript}")
if result.is_final:
# Process final transcript
await handle_user_input(transcript)
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
await connection.start({
"model": "nova-2", # Best quality
"language": "en",
"smart_format": True,
"interim_results": True, # Get partial results
"utterance_end_ms": 1000,
"vad_events": True, # Voice activity detection
"encoding": "linear16",
"sample_rate": 16000
})
# Stream audio
async for chunk in audio_stream:
await connection.send(chunk)
await connection.finish()
ElevenLabs streaming synthesis
eleven = ElevenLabs(api_key="...")
def text_to_speech_stream(text: str): """Stream TTS audio chunks."