Voxtral Mini 3B

MistralSpeech-to-TextEdge / On-DeviceMultilingualGenerally AvailableApache 2.0vm-mst-002

About

Compact 3B-parameter ASR model optimized for edge and local deployment. Achieves strong multilingual accuracy at a fraction of the compute cost of larger models, making it viable for on-device and embedded use cases.

Capabilities (5)

3B parameter efficiency

Edge deployment

Low compute requirements

Multilingual ASR

Quantization support

Transcript will appear here in real-time as you speak…

Key Highlights

3B parameters enable real-time transcription on consumer hardware

Strong multilingual accuracy despite compact model size

Quantization-friendly architecture for INT4/INT8 edge deployment

Use Cases

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Voxtral Mini 3B — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-mst-002",
  vendor: "mistral",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);

Related Models

NeMo ASR

NVIDIA

Riva

NVIDIA

Parakeet

NVIDIA

gpt-4o-realtime

OpenAI

gpt-4o-mini-realtime

OpenAI

gpt-4o-mini-transcribe

OpenAI

Quick Stats

Latency<200ms on-device

Languages30 supported

LicenseApache 2.0

PricingOpen-weight / self-hosted

StatusGenerally Available

Vendor

Mistral

Open-weight multilingual speech models from Europe

View all Mistral models

GitHub Repository

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// Voxtral Mini 3B — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-mst-002",
  vendor: "mistral",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);