NeMo ASR

NVIDIASpeech-to-TextCustom TrainingEdge / On-DeviceGenerally AvailableApache 2.0vm-nv-002

About

Production-grade automatic speech recognition framework within the NVIDIA NeMo toolkit. Provides CTC, RNN-T, and hybrid attention-based architectures with streaming support, speaker diarization, and punctuation restoration.

Capabilities (5)

Streaming recognition

Speaker diarization

Punctuation restoration

Custom vocabulary

GPU-accelerated training

Transcript will appear here in real-time as you speak…

Key Highlights

End-to-end GPU-optimized training and inference pipeline

Modular architecture supports CTC, RNN-T, and hybrid models

Production-proven at scale across enterprise deployments

Use Cases

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// NeMo ASR — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-nv-002",
  vendor: "nvidia",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);

Related Models

PersonaPlex 7B

NVIDIA

NeMo TTS

NVIDIA

Riva

NVIDIA

Parakeet

NVIDIA

ACE (Avatar Cloud Engine)

NVIDIA

gpt-4o-mini-transcribe

OpenAI

Quick Stats

Latency<200ms streaming

Languages35 supported

LicenseApache 2.0

PricingOpen-source / self-hosted

StatusGenerally Available

Vendor

NVIDIA

GPU-accelerated speech AI and conversational frameworks

View all NVIDIA models

Documentation

View on NVIDIA Site

Meeting Transcription

Transcribe meetings in real-time with speaker identification and punctuation.

Call Center Analytics

Analyze customer calls at scale with sentiment detection and keyword spotting.

Content Indexing

Convert audio and video libraries into searchable text archives.

Live Captioning

Provide real-time captions for broadcasts, presentations, and live events.

Code Example

// NeMo ASR — Speech-to-Text
import { transcribe } from "@arkitekton/voice";

const result = await transcribe({
  model: "vm-nv-002",
  vendor: "nvidia",
  audio: audioFile,
  language: "en",
  options: {
    punctuate: true,
    diarize: true,
    smart_format: true,
  },
});

console.log("Transcript:", result.text);
console.log("Confidence:", result.confidence);