Wav2Vec 2.0

Hugging Face / Open SourceSpeech-to-TextCustom TrainingMultilingualGenerally AvailableMITvm-hf-008

About

Self-supervised speech representation model from Meta that learns powerful features from unlabeled audio. Fine-tunable for ASR with as little as 10 minutes of labeled data, enabling rapid adaptation to new languages and domains.

Capabilities (5)

Self-supervised pre-training

10-minute fine-tuning

Low-resource adaptation

Feature extraction

Cross-lingual transfer

Transcript will appear here in real-time as you speak…

Key Highlights

Fine-tune competitive ASR from just 10 minutes of labeled data

Self-supervised pre-training leverages vast unlabeled audio corpora

Cross-lingual transfer enables rapid new language bootstrapping

Use Cases

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Wav2Vec 2.0 — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-hf-008",
  vendor: "huggingface",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);

Related Models

NeMo ASR

NVIDIA

NeMo TTS

NVIDIA

Riva

NVIDIA

Parakeet

NVIDIA

gpt-4o-realtime

OpenAI

gpt-4o-mini-realtime

OpenAI

Quick Stats

Languages53 supported

LicenseMIT

PricingOpen-source / self-hosted

StatusGenerally Available

Vendor

Hugging Face / Open Source

Community-driven open-source speech models and toolkits

View all Hugging Face / Open Source models

GitHub Repository

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Wav2Vec 2.0 — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-hf-008",
  vendor: "huggingface",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);