Google just launched Gemini 3.1 Flash Live Preview its most capable real-time voice and audio model yet. If you're building AI voice agents, conversational apps, or anything that needs low-latency audio intelligence, this model is a big deal. And with VideoSDK's Python SDK, plugging it into your app takes just a few minutes.
In this blog, we'll walk through what the new model can do, and then build a working voice agent step by step using VideoSDK.
What's New in Gemini 3.1 Flash Live Preview
Google describes this as its "highest-quality audio and voice model yet," and there are a few things that actually back that up.
It's built for real-time, audio-first experiences. Unlike models that convert speech to text and then process it, Gemini 3.1 Flash Live works audio-to-audio meaning it hears you and responds as audio, keeping the conversation feeling natural and fast.
Here's what stands out:
- Lower latency than before. Compared to 2.5 Flash Native Audio, this model is noticeably faster. Fewer awkward pauses, snappier responses. That matters a lot when you're building voice agents where delays break the experience.
- It actually understands how you say things. The model picks up on acoustic nuances, pitch, pace, tone. So it can tell when you're asking a casual question vs. when you sound urgent or confused.
- Better background noise handling. It filters out noise more effectively, which means it works in real environments, not just quiet studios.
- Multilingual out of the box. Over 90 languages supported for real-time conversations.
- Longer conversation memory. It can follow the thread of a conversation for twice as long as the previous generation. So your agent won't "forget" what was said earlier in a long session.
- Tool use during live conversations. This one is huge for agent builders. The model can now trigger external tools (APIs, functions, searches) while a live conversation is happening not just at the end of a turn.
- Multimodal awareness. It handles audio and video inputs together, so you can build agents that respond to what they see and hear at the same time.
The model ID is: gemini-3.1-flash-live-preview
Building a Voice Agent with VideoSDK
VideoSDK gives you everything you need to wire Gemini 3.1 Flash Live into a real voice application. Here's how to get set up from scratch.
Step 1 : Create and Activate a Python Virtual Environment
First, create a clean Python environment so your project dependencies stay isolated.
python3 -m venv venvActivate it:
macOS/Linux
source venv/bin/activateWindows
venv\Scripts\activateYou should see (venv) in your terminal, which means you're good to go.
Step 2 : Set Up Your Environment Variables
Create a .env file in your project root and add your API keys:
VIDEOSDK_AUTH_TOKEN=your_videosdk_token_here
GOOGLE_API_KEY=your_google_api_key_hereYou can get your VideoSDK auth token from the VideoSDK dashboard and your Google API key from Google AI Studio.
Important: when GOOGLE_API_KEY is set in your .env file, do not pass api_key as a parameter in your code the SDK picks it up automatically.
Step 3 : Install the Required Packages
Install VideoSDK's agents SDK along with the Google plugin:
pip install "videosdk-agents[google]"Step 4 : Create Your Agent (main.py)
Create a file called main.py in your project folder and paste in the following code:
from videosdk.agents import Agent, AgentSession, Pipeline, JobContext, RoomOptions, WorkerJob
from videosdk.plugins.google import GeminiRealtime, GeminiLiveConfig
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s", handlers=[logging.StreamHandler()])
class MyVoiceAgent(Agent):
def __init__(self):
super().__init__(
instructions="You Are VideoSDK's Voice Agent.You are a helpful voice assistant that can answer questions and help with tasks.",
)
async def on_enter(self) -> None:
await self.session.say("Hello, how can I help you today?")
async def on_exit(self) -> None:
await self.session.say("Goodbye!")
async def start_session(context: JobContext):
agent = MyVoiceAgent()
model = GeminiRealtime(
model="gemini-3.1-flash-live-preview",
# When GOOGLE_API_KEY is set in .env - DON'T pass api_key parameter
# api_key="AIXXXXXXXXXXXXXXXXXXXX",
config=GeminiLiveConfig(
voice="Leda", # Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, and Zephyr.
response_modalities=["AUDIO"]
)
)
pipeline = Pipeline(llm=model)
session = AgentSession(
agent=agent,
pipeline=pipeline
)
await session.start(wait_for_participant=True, run_until_shutdown=True)
def make_context() -> JobContext:
room_options = RoomOptions(
# room_id="<room_id>", # Replace it with your actual room_id
name="Gemini Realtime Agent",
playground=True,
)
return JobContext(room_options=room_options)
if __name__ == "__main__":
job = WorkerJob(entrypoint=start_session, jobctx=make_context)
job.start()To run the agent:
python main.pyOnce you run this command, a playground URL will appear in your terminal. You can use this URL to interact with your AI agent.
What Can You Build With This?
Gemini 3.1 Flash Live + VideoSDK opens up a pretty wide range of real-world use cases:
- Customer support voice bots. Replace or supplement your call center with agents that actually understand tone and can handle multilingual customers in real time.
- AI meeting assistants. Agents that join calls, take notes, answer questions from participants, and trigger follow-up actions mid-conversation.
- Healthcare intake agents. Voice-based triage agents that collect patient information, ask follow-up questions, and route to the right department all in a natural spoken conversation.
- Language tutors. Real-time conversation partners that catch pronunciation issues, adjust their pace based on the learner, and respond naturally.
- Voice-controlled IoT and home automation. Agents that listen continuously, understand context, and trigger device actions through tool use all in sub-second response times.
- Live interview prep tools. Candidates practice answering questions aloud and get spoken feedback instantly.
Conclusion
Gemini 3.1 Flash Live Preview is a meaningful step forward for real-time voice AI. The improvements in latency, noise handling, multilingual support, and especially live tool use make it a strong foundation for production voice agents.
VideoSDK wraps all of that into a clean Python SDK that gets you from zero to a running agent in a handful of lines. Whether you're prototyping or building something you intend to ship, the setup here gives you a solid starting point.
Next Steps and Resources
- Check Gemini3.1 implementation docs
- Learn how to deploy your agents
- 👉 Share your thoughts, roadblocks, or success stories in the comments or join our Discord community ↗. We’re excited to learn from your journey and help you build even better AI-powered communication tools!