Add Voice Message Transcription

This skill adds automatic voice message transcription using OpenAI's Whisper API. When users send voice notes in WhatsApp, they'll be transcribed and the agent can read and respond to the content.

UX Note: When asking the user questions, prefer using the AskUserQuestion tool instead of just outputting text. This integrates with Claude's built-in question/answer system for a better experience.

Prerequisites

USER ACTION REQUIRED

Use the AskUserQuestion tool to present this:

You'll need an OpenAI API key for Whisper transcription.

Get one at: https://platform.openai.com/api-keys

Cost: ~~$0.006 per minute of audio (~~$0.003 per typical 30-second voice note)

Once you have your API key, we'll configure it securely.

Wait for user to confirm they have an API key before continuing.

Implementation

Step 1: Add OpenAI Dependency

Read package.json and add the openai package to dependencies:

"dependencies": {
  ...existing dependencies...
  "openai": "^4.77.0"
}

Then install it. IMPORTANT: The OpenAI SDK requires Zod v3 as an optional peer dependency, but NanoClaw uses Zod v4. This conflict is guaranteed, so always use --legacy-peer-deps:

npm install --legacy-peer-deps

Step 2: Create Transcription Configuration

Create a configuration file for transcription settings (without the API key):

Write to .transcription.config.json:

{
  "provider": "openai",
  "openai": {
    "apiKey": "",
    "model": "whisper-1"
  },
  "enabled": true,
  "fallbackMessage": "[Voice Message - transcription unavailable]"
}

Add this file to .gitignore to prevent committing API keys:

echo ".transcription.config.json" >> .gitignore

Use the AskUserQuestion tool to confirm:

I've created .transcription.config.json in the project root. You'll need to add your OpenAI API key to it manually:

Open .transcription.config.json

Replace the empty "apiKey": "" with your key: "apiKey": "sk-proj-..." 

Save the file

Let me know when you've added it.

Wait for user confirmation.

Step 3: Create Transcription Module

Create src/transcription.ts:

import { downloadMediaMessage } from '@whiskeysockets/baileys';
import { WAMessage, WASocket } from '@whiskeysockets/baileys';
import fs from 'fs';
import path from 'path';
import { fileURLToPath } from 'url';
import { dirname } from 'path';

// Get __dirname equivalent in ES modules
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);

// Configuration interface
interface TranscriptionConfig {
  provider: string;
  openai?: {
    apiKey: string;
    model: string;
  };
  enabled: boolean;
  fallbackMessage: string;
}

// Load configuration
function loadConfig(): TranscriptionConfig {
  const configPath = path.join(__dirname, '../.transcription.config.json');
  try {
    const configData = fs.readFileSync(configPath, 'utf-8');
    return JSON.parse(configData);
  } catch (err) {
    console.error('Failed to load transcription config:', err);
    return {
      provider: 'openai',
      enabled: false,
      fallbackMessage: '[Voice Message - transcription unavailable]'
    };
  }
}

// Transcribe audio using OpenAI Whisper API
async function transcribeWithOpenAI(audioBuffer: Buffer, config: TranscriptionConfig): Promise<string | null> {
  if (!config.openai?.apiKey || config.openai.apiKey === '') {
    console.warn('OpenAI API key not configured');
    return null;
  }

  try {
    // Dynamic import of openai
    const openaiModule = await import('openai');
    const OpenAI = openaiModule.default;
    const toFile = openaiModule.toFile;

    const openai = new OpenAI({
      apiKey: config.openai.apiKey
    });

    // Use OpenAI's toFile helper to create a proper file upload
    const file = await toFile(audioBuffer, 'voice.ogg', {
      type: 'audio/ogg'
    });

    // Call Whisper API
    const transcription = await openai.audio.transcriptions.create({
      file: file,
      model: config.openai.model || 'whisper-1',
      response_format: 'text'
    });

    // Type assertion needed: OpenAI SDK types response_format='text' as Transcription object,
    // but it actually returns a plain string when response_format is 'text'
    return transcription as unknown as string;
  } catch (err) {
    console.error('OpenAI transcription failed:', err);
    return null;
  }
}

// Main transcription function
export async function transcribeAudioMessage(
  msg: WAMessage,
  sock: WASocket
): Promise<string | null> {
  const config = loadConfig();

  // Check if transcription is enabled
  if (!config.enabled) {
    console.log('Transcription disabled in config');
    return config.fallbackMessage;
  }

  try {
    // Download the audio message
    const buffer = await downloadMediaMessage(
      msg,
      'buffer',
      {},
      {
        logger: console as any,
        reuploadRequest: sock.updateMediaMessage
      }
    ) as Buffer;

    if (!buffer || buffer.length === 0) {
      console.error('Failed to download audio message');
      return config.fallbackMessage;
    }

    console.log(`Downloaded audio message: ${buffer.length} bytes`);

    // Transcribe based on provider
    let transcript: string | null = null;

    switch (config.provider) {
      case 'openai':
        transcript = await transcribeWithOpenAI(buffer, config);
        break;
      default:
        console.error(`Unknown transcription provider: ${config.provider}`);
        return config.fallbackMessage;
    }

    if (!transcript) {
      return config.fallbackMessage;
    }

    return transcript.trim();
  } catch (err) {
    console.error('Transcription error:', err);
    return config.fallbackMessage;
  }
}

// Helper to check if a message is a voice note
export function isVoiceMessage(msg: WAMessage): boolean {
  return msg.message?.audioMessage?.ptt === true;
}

Step 4: Update Database to Handle Transcribed Content

Read src/db.ts and find the storeMessage function. Update its signature and implementation to accept transcribed content:

Change the function signature from:

export function storeMessage(msg: proto.IWebMessageInfo, chatJid: string, isFromMe: boolean, pushName?: string): void

To:

export function storeMessage(msg: proto.IWebMessageInfo, chatJid: string, isFromMe: boolean, pushName?: string, transcribedContent?: string): void

Update the content extraction to use transcribed content if provided:

const content = transcribedContent ||
  msg.message?.conversation ||
  msg.message?.extendedTextMessage?.text ||
  msg.message?.imageMessage?.caption ||
  msg.message?.videoMessage?.caption ||
  (msg.message?.audioMessage?.ptt ? '[Voice Message]' : '') ||
  '';

Step 5: Integrate Transcription into Message Handler

Note: Voice messages are transcribed for all messages in registered groups, regardless of the trigger word. This is because:

Voice notes can't easily include a trigger word
Users expect voice notes to work the same as text messages
The transcribed content is stored in the database for context, even if it doesn't trigger the agent

Read src/index.ts and find the sock.ev.on('messages.upsert', ...) event handler.

Change the callback from synchronous to async:

sock.ev.on('messages.upsert', async ({ messages }) => {

Inside the loop where messages are stored, add voice message detection and transcription:

// Only store full message content for registered groups
if (registeredGroups[chatJid]) {
  // Check if this is a voice message
  if (msg.message.audioMessage?.ptt) {
    try {
      // Import transcription module
      const { transcribeAudioMessage } = await import('./transcription.js');
      const transcript = await transcribeAudioMessage(msg, sock);

      if (transcript) {

add-voice-transcription

How to add

Drop this on your repo README

Related skills

learn-codebase

remove-deadcode

apify-competitor-intelligence

ad-creative

Get new Marketing skills every Monday