WhisperX

Hugging Face / Open SourceSpeech-to-TextMultilingualGenerally AvailableBSD-4-Clausevm-hf-002

About

Enhanced Whisper pipeline adding word-level timestamps via forced alignment, voice activity detection for accurate segmentation, and speaker diarization. Significantly faster than vanilla Whisper through batched inference.

Capabilities (5)

Word-level timestamps

Speaker diarization

Voice activity detection

Batched inference (70x faster)

Forced alignment

Transcript will appear here in real-time as you speak…

Key Highlights

70x faster than vanilla Whisper through batched inference pipeline

Accurate word-level timestamps via phoneme-based forced alignment

Built-in speaker diarization labels who spoke each segment

Use Cases

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// WhisperX — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-hf-002",
  vendor: "huggingface",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);

Related Models

NeMo ASR

NVIDIA

Riva

NVIDIA

Parakeet

NVIDIA

gpt-4o-realtime

OpenAI

gpt-4o-mini-realtime

OpenAI

gpt-4o-mini-transcribe

OpenAI

Quick Stats

Languages97 supported

LicenseBSD-4-Clause

PricingOpen-source / self-hosted

StatusGenerally Available

Vendor

Hugging Face / Open Source

Community-driven open-source speech models and toolkits

View all Hugging Face / Open Source models

GitHub Repository

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// WhisperX — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-hf-002",
  vendor: "huggingface",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);