Voxtral Small 24B

MistralSpeech-to-TextMultilingualGenerally AvailableApache 2.0vm-mst-001

About

State-of-the-art multilingual ASR model at 24B parameters that outperforms Whisper large-v3 on common benchmarks. Features 32K context window for processing long-form audio with a single forward pass and structured output support.

Capabilities (5)

Beats Whisper large-v3

32K context window

Long-form audio processing

Structured output

Multilingual transcription

Transcript will appear here in real-time as you speak…

Key Highlights

Outperforms Whisper large-v3 across standard ASR benchmarks

32K context window processes 4+ hours of audio in a single pass

Apache 2.0 license with open weights for unrestricted use

Use Cases

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Voxtral Small 24B — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-mst-001",
  vendor: "mistral",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);

Related Models

NeMo ASR

NVIDIA

Riva

NVIDIA

Parakeet

NVIDIA

gpt-4o-realtime

OpenAI

gpt-4o-mini-realtime

OpenAI

gpt-4o-mini-transcribe

OpenAI

Quick Stats

Languages40 supported

LicenseApache 2.0

PricingOpen-weight / self-hosted

StatusGenerally Available

Vendor

Mistral

Open-weight multilingual speech models from Europe

View all Mistral models

GitHub Repository

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Voxtral Small 24B — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-mst-001",
  vendor: "mistral",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);