Whisper

OpenAISpeech-to-TextMultilingualEdge / On-DeviceGenerally AvailableMITvm-oai-004

About

Open-source multilingual speech recognition model trained on 680,000 hours of web-scale data. Delivers robust transcription across accents, background noise, and technical jargon with automatic language detection.

Capabilities (5)

Multilingual transcription

Language detection

Translation to English

Noise robustness

Multiple model sizes

Transcript will appear here in real-time as you speak…

Key Highlights

Trained on 680K hours of multilingual web audio data

Robust to background noise, accents, and domain jargon

MIT license with sizes from 39M to 1.5B parameters

Use Cases

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Whisper — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-oai-004",
  vendor: "openai",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);

Related Models

NeMo ASR

NVIDIA

Riva

NVIDIA

Parakeet

NVIDIA

gpt-4o-realtime

OpenAI

gpt-4o-mini-realtime

OpenAI

gpt-4o-mini-transcribe

OpenAI

Quick Stats

Languages97 supported

LicenseMIT

PricingOpen-source / self-hosted

StatusGenerally Available

Vendor

OpenAI

Foundation models for real-time voice and transcription

View all OpenAI models

GitHub Repository

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Whisper — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-oai-004",
  vendor: "openai",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);