WhisperX
Speech-to-text with word-level timestamps, speaker diarization, and forced alignment — built on faster-whisper with batched inference for up to 70x realtime transcription speed.
WhisperX extends Whisper with three key capabilities that faster-whisper alone doesn't provide:
- Forced alignment — precise word-level timestamps via phoneme ASR models (wav2vec2)
- Speaker diarization — label who said what (via pyannote.audio)
- Batched inference — process au
[Description truncada. Veja o README completo no GitHub.]