USM (Universal Speech Model)

GoogleSpeech-to-TextMultilingualGenerally AvailableProprietaryvm-ggl-006

About

Google's 2B-parameter universal speech model trained on 12 million hours of audio spanning 300+ languages. Achieves state-of-the-art results on low-resource language recognition and serves as the backbone for YouTube auto-captions.

Capabilities (5)

300+ language support

12M hours training data

Low-resource language ASR

YouTube caption backbone

Transfer learning

Transcript will appear here in real-time as you speak…

Key Highlights

Trained on 12 million hours of audio across 300+ languages

Powers YouTube automatic captioning at global scale

State-of-the-art on low-resource and endangered language benchmarks

Use Cases

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// USM (Universal Speech Model) — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-ggl-006",
  vendor: "google",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);

Related Models

NeMo ASR

NVIDIA

Riva

NVIDIA

Parakeet

NVIDIA

gpt-4o-realtime

OpenAI

gpt-4o-mini-realtime

OpenAI

gpt-4o-mini-transcribe

OpenAI

Quick Stats

Languages300+

LicenseProprietary

PricingNot publicly available

StatusGenerally Available

Vendor

Google

Cloud-scale speech services with multilingual reach

View all Google models

Documentation

View on Google Site

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// USM (Universal Speech Model) — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-ggl-006",
  vendor: "google",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);