At VideoSDK, Our mission is to build infrastructure of digital humans that runs on every device. We’re expanding into real-time speech-to-speech and vision AI agents that run privately on-device or in a hybrid cloud setting—delivering up to 98% of GPT-4’s accuracy at just 17.5% of its cost.
Unified SDK for On-Device SSLM + Cloud SSLM Collaboration
Today, we are launching NAMO-SSLM (Small Speech Language Model) hybrid model with state of the art hybrid inferencing engine.
Unified SDK for On-Device SSLM + Cloud SSLM
We’ve developed an inference engine that couples a streamlined, on-device with a more powerful, cloud-based SSLM, working in tandem for real-time speech use cases. Our engine ensures that most of the heavy lifting—long-context processing and advanced reasoning—occurs only when needed, thereby reducing cloud inference costs without sacrificing performance.
We’ve tackled three key challenges in end-to-end speech models: enabling multi-turn RAG for rich contextual responses, silent function calling for real-time tool use without delays, and client-side tool support like canvas and whiteboard via our Agent SDK.
Achieving 20× Cost Reduction with 98% Cloud Performance
Our engine orchestrates a local-remote workflow, delivering real-time speech capabilities on low-latency devices while leveraging cloud infrastructure for complex tasks. This results in a 20× cost reduction while maintaining 98% of cloud performance, ensuring a cost-effective, privacy-friendly, and high-performance solution.
Enterprise-Grade Privacy & Compliance
NAMO-SSLM keeps sensitive data on-device by default, ensuring documents and conversations never leave the device unless explicitly required. This hybrid design minimizes cloud dependency, reduces risk exposure, and delivers GDPR, HIPAA, and SOC2 compliance—making it more secure and privacy-first than traditional cloud-only models.
This approach not only meets the pressing need for robust compliance in highly regulated sectors like BFSI and Healthcare, but also benefits any organization demanding rapid, private, and cost-effective AI-driven customer experiences.
Open Sourcing NAMO-SSLM
We are excited to open-source NAMO-SSLM, small yet powerful real-time multi-modal. The AI landscape is shifting from massive, resource-intensive models to lightweight, optimized small models—and for good reason. Small models (like NAMO-SSLM) offer a compelling mix of efficiency, speed, and cost-effectiveness, making them the smarter choice for real-world applications.
Key research includes:
- Run on CPU: Run model real-time on consumer CPU devices.
- Multimodal (voice + vision): Native support for real-time speech and vision and OCR capabilites.
- Low Latency, Real-Time Processing: Real-time streaming support with end to end latency as low as 80ms.
- Multilingual Support: Enable multi-lang and hybrid language capabilities such as hing-lish.
- Multi-turn RAG: Supports multi-turn RAG to retrive rich context from while keeping conversation real-time
- Voiced + Silent Function / Tools Calling: Function calling support with silent voice with text as well as voice output.
- Client Side Tool Support: Tool support for building client side UI/UX such as canvas, whiteboard etc.
NAMO-SSLM Github: https://github.com/videosdk-live/namo-sslm
The future of AI is efficient, real-time, and accessible on every device. With NAMO-SSLM, we're pioneering a hybrid on-device + cloud approach that delivers 5.7× cost efficiency while maintaining 98% of cloud-level performance.
Our hybrid architecture enables low-latency, private AI for a wide range of industries:
- Healthcare – Real-time AI assistants for clinical support
- Banking – Secure and seamless voice authentication
- Insurance – AI-powered automated claims processing
- Social – Instant multilingual voice translation
- Education – Personalized AI tutors for interactive learning
- Smart Glasses – Hands-free, AI-driven digital assistants
- Robots – Conversational AI for intelligent automation
- Gaming – Realistic, AI-powered NPC interactions
By open-sourcing NAMO-SSLM, we are empowering developers and businesses to build privacy-first, low-latency AI applications across industries—from healthcare to gaming, smart glasses to education.
The AI revolution is shifting from large, resource-heavy models to lightweight, real-time, multimodal intelligence. NAMO-SSLM is the future—blending speech and vision AI with real-time processing, multilingual capabilities, and CPU-optimized performance.
Join us in building the next generation of digital humans. 🚀