Introducing the Nvidia Speech to Text Plugin in VideoSDK

Speech recognition is a critical building block for real-time AI voice agents. To deliver fast, accurate, and production-ready transcription, VideoSDK integrates with Nvidia Speech-to-Text (STT) a high‑performance, low‑latency speech recognition solution designed for real‑time applications.

In this blog, we’ll walk through how Nvidia STT works with the VideoSDK Agents SDK and how you can quickly integrate it into your AI voice pipeline.

Why Nvidia STT?

Nvidia STT is built for speed and accuracy. It is well suited for real-time voice agents where low latency, streaming transcription, and stable performance are essential.

With VideoSDK’s plugin-based architecture, you can easily swap or test STT providers making Nvidia STT a strong choice for production-grade voice experiences.

Installation

To get started, install the Nvidia-enabled VideoSDK Agents plugin:

pip install "videosdk-plugins-nvidia"

This package adds native support for Nvidia STT inside the VideoSDK Agents ecosystem.

Authentication

The Nvidia STT plugin requires an Nvidia API key. Set the API key as an environment variable in your .env file:
Sign up at VideoSDK for authentication token

NVIDIA_API_KEY=your-nvidia-api-key
VIDEOSDK_AUTH_TOKEN = token

When using environment variables, you don’t need to pass the API key directly in your code. The SDK automatically picks it up at runtime.

Importing Nvidia STT

Once installed, import the Nvidia STT plugin into your project:

from videosdk.plugins.nvidia import NvidiaSTT

Example: Using Nvidia STT in a Cascading Pipeline

from videosdk.plugins.nvidia import NvidiaSTT
from videosdk.agents import CascadingPipeline

# Initialize the Nvidia STT model
stt = NvidiaSTT(
    model="parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer",
    language_code="en-US",
    profanity_filter=False,
    automatic_punctuation=True
)

# Add STT to the cascading pipeline
pipeline = CascadingPipeline(stt=stt)

Configuration Options

Nvidia STT exposes several configuration options so you can fine-tune transcription behavior:

api_key: Nvidia API key (optional if set via environment variable)
model: Nvidia Riva STT model to use
server: Riva server address (default: grpc.nvcf.nvidia.com:443)
function_id: Nvidia service function ID
language_code: Language for transcription (default: en-US)
sample_rate: Audio sample rate in Hz (default: 16000)
profanity_filter: Enable or disable profanity filtering
automatic_punctuation: Enable automatic punctuation
use_ssl: Enable SSL connection

These options make it easy to adapt Nvidia STT to different real-world voice scenarios.

Conclusion

By integrating Nvidia STT with VideoSDK Agents, you get a powerful, flexible speech recognition layer that fits naturally into real-time AI voice workflows. Whether you’re testing individual components or deploying a full voice agent pipeline, Nvidia STT gives you the speed and reliability required for modern conversational experiences.

Resources and Next Steps

Read more information on Nvidia Riva STT
Check out full code implementation on github
Explore more : Read docs on Nvidia STT Plugin
Learn how to deploy your AI Agents.
👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!