Speech recognition is a critical building block for real-time AI voice agents. To deliver fast, accurate, and production-ready transcription, VideoSDK integrates with Nvidia Speech-to-Text (STT) a high‑performance, low‑latency speech recognition solution designed for real‑time applications.

In this blog, we’ll walk through how Nvidia STT works with the VideoSDK Agents SDK and how you can quickly integrate it into your AI voice pipeline.

Why Nvidia STT?

Nvidia STT is built for speed and accuracy. It is well suited for real-time voice agents where low latency, streaming transcription, and stable performance are essential.

With VideoSDK’s plugin-based architecture, you can easily swap or test STT providers making Nvidia STT a strong choice for production-grade voice experiences.

Installation

To get started, install the Nvidia-enabled VideoSDK Agents plugin:

pip install "videosdk-plugins-nvidia"

This package adds native support for Nvidia STT inside the VideoSDK Agents ecosystem.

Authentication

  1. The Nvidia STT plugin requires an Nvidia API key. Set the API key as an environment variable in your .env file:
  2. Sign up at VideoSDK for authentication token
NVIDIA_API_KEY=your-nvidia-api-key
VIDEOSDK_AUTH_TOKEN = token

When using environment variables, you don’t need to pass the API key directly in your code. The SDK automatically picks it up at runtime.

Importing Nvidia STT

Once installed, import the Nvidia STT plugin into your project:

from videosdk.plugins.nvidia import NvidiaSTT

Example: Using Nvidia STT in a Cascading Pipeline

from videosdk.plugins.nvidia import NvidiaSTT
from videosdk.agents import CascadingPipeline

# Initialize the Nvidia STT model
stt = NvidiaSTT(
    model="parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer",
    language_code="en-US",
    profanity_filter=False,
    automatic_punctuation=True
)

# Add STT to the cascading pipeline
pipeline = CascadingPipeline(stt=stt)

Configuration Options

Nvidia STT exposes several configuration options so you can fine-tune transcription behavior:

  • api_key: Nvidia API key (optional if set via environment variable)
  • model: Nvidia Riva STT model to use
  • server: Riva server address (default: grpc.nvcf.nvidia.com:443)
  • function_id: Nvidia service function ID
  • language_code: Language for transcription (default: en-US)
  • sample_rate: Audio sample rate in Hz (default: 16000)
  • profanity_filter: Enable or disable profanity filtering
  • automatic_punctuation: Enable automatic punctuation
  • use_ssl: Enable SSL connection

These options make it easy to adapt Nvidia STT to different real-world voice scenarios.

Conclusion

By integrating Nvidia STT with VideoSDK Agents, you get a powerful, flexible speech recognition layer that fits naturally into real-time AI voice workflows. Whether you’re testing individual components or deploying a full voice agent pipeline, Nvidia STT gives you the speed and reliability required for modern conversational experiences.

Resources and Next Steps