Introducing "NAMO" Real-Time Speech AI Model: On-Device & Hybrid Cloud 📢PRESS RELEASE

Conversational AI: Building Intelligent Voice Agents with VideoSDK

Learn how to build intelligent conversational AI voice agents with VideoSDK. This developer guide includes code examples for creating AI assistants that can join meetings, analyze screens, and respond naturally.

In this guide, we'll explore how to build sophisticated conversational AI systems using VideoSDK's real-time communication infrastructure. We'll dive into practical code examples that demonstrate how to create AI agents that can join meetings, process audio streams, analyze shared screens, and respond naturally—all with surprisingly manageable code.
Conversational AI Architecture with VideoSDK
Architecture diagram showing : How Audio streams from participants are collected, processed by Deepgram's Speech-to-Text (STT) or any other STT, then transcribed text is processed and broadcast back to meeting participants.

Understanding Modern Conversational AI Architecture

Modern conversational AI systems have evolved far beyond the simple rule-based chatbots of the past. Today's systems combine multiple sophisticated components into a seamless architecture that enables natural, context-aware interactions.
At its core, conversational AI is built on technologies that enable computers to understand, process, and respond to human language naturally and intuitively. But the most advanced systems go beyond just text or speech processing—they integrate with real-time communication platforms, analyze visual content, and maintain context across complex interactions.
Let's break down the architecture of a modern conversational AI agent, as implemented in our example using VideoSDK:
1class AIAgent:
2    """
3    An AI agent that can join video meetings, process audio from participants,
4    and analyze shared screens using OpenAI and Gemini APIs.
5    """
6    def __init__(self, meeting_id: str, authToken: str, name: str):
7        # Set up custom audio track for the agent to speak
8        self.audio_track = CustomAudioStreamTrack(
9            loop=self.loop,
10            handle_interruption=True
11        )
12        
13        # Configure meeting settings
14        self.meeting_config = MeetingConfig(
15            name=name,
16            meeting_id=meeting_id,
17            token=authToken,
18            mic_enabled=True,
19            webcam_enabled=False,
20            custom_microphone_audio_track=self.audio_track,
21        )
22        
23        # Initialize models for intelligence
24        self.vision_model = genai.GenerativeModel('gemini-1.5-flash')
25        self.intelligence = OpenAIIntelligence(
26            loop=self.loop,
27            api_key=openai_api_key,
28            base_url="api.openai.com",
29            audio_track=self.audio_track,
30        )
31
This code snippet demonstrates the foundation of our agent: a custom audio track, meeting configuration via VideoSDK, and multiple AI models working together to provide intelligence.

How Advanced Conversational AI Works with VideoSDK

Building an AI agent that can join meetings, listen to participants, analyze shared screens, and respond naturally requires orchestrating several technologies. Let's break down the key components as implemented in our VideoSDK-based solution:

1. Real-time Communication Infrastructure

At the foundation is VideoSDK, which provides the real-time communication infrastructure. This enables our AI agent to:
  • Join virtual meetings alongside human participants
  • Access audio and video streams
  • Process and send audio in real-time
Here's how our agent joins a meeting:
1async def join(self):
2    """Join the meeting asynchronously."""
3    await self.agent.async_join()
4
5# In main.py - how we expose this as an API endpoint
6@app.post("/join-player")
7async def join_player(req: MeetingReqConfig, bg_tasks: BackgroundTasks):
8    bg_tasks.add_task(server_operations, req)
9    return {"message": "AI agent joined"}
10

2. Audio Processing Pipeline

For the agent to understand what humans are saying, we need an audio processing pipeline:
1async def add_audio_listener(self, stream: Stream, peer_name: str):
2    """Process audio from a participant and send to OpenAI for transcription."""
3    while True:
4        # Get audio frame from VideoSDK stream
5        frame = await stream.track.recv()      
6        audio_data = frame.to_ndarray()[0]
7        
8        # Convert to float and process for optimal quality
9        audio_data_float = (audio_data.astype(np.float32) / np.iinfo(np.int16).max)
10        audio_mono = librosa.to_mono(audio_data_float.T)
11        audio_resampled = librosa.resample(audio_mono, orig_sr=48000, target_sr=16000)
12        
13        # Convert to PCM for OpenAI
14        pcm_frame = ((audio_resampled * np.iinfo(np.int16).max).astype(np.int16).tobytes())
15        
16        # Send to OpenAI for processing
17        await self.intelligence.send_audio_data(pcm_frame)
18
This code captures audio from VideoSDK streams, processes it to ensure quality, and sends it to OpenAI for transcription and understanding.

3. Screen Analysis Capabilities

A unique feature of our agent is the ability to analyze shared screens, giving it visual context:
1async def handle_function_call(self, function_call):
2    """Handle screen analysis requests from the LLM."""
3    if function_call.name == "analyze_screen":
4        if not self.latest_frame:
5            return "No screen content available"
6        
7        # Convert frame to image
8        image_data = self.latest_frame.to_ndarray()
9        image = Image.fromarray(image_data)
10        
11        # Request analysis from Gemini
12        response = await self.loop.run_in_executor(
13            None,
14            lambda: self.vision_model.generate_content([
15                "Analyze this screen to help user. Focus on relevant UI elements, text, code, and context.",
16                image
17            ])
18        )
19        return response.text
20
This function enables the AI to "see" what's on a shared screen and provide contextual help based on visual content.

4. Natural Language Intelligence

The core intelligence comes from connecting with large language models like OpenAI's GPT:
1# Define tool for screen analysis
2screen_tool = {
3    "type": "function",
4    "name": "analyze_screen",
5    "description": "Analyze current screen content when user asks for help with visible elements",
6    "parameters": {"type": "object", "properties": {}}
7}
8
9# Initialize OpenAI intelligence
10self.intelligence = OpenAIIntelligence(
11    loop=self.loop,
12    api_key=openai_api_key,
13    base_url="api.openai.com",
14    input_audio_transcription=InputAudioTranscription(model="whisper-1"),
15    tools=[screen_tool],
16    audio_track=self.audio_track,
17    handle_function_call=self.handle_function_call,
18)
19
This code configures the OpenAI integration, making the AI contextually aware and capable of natural conversation.

5. Voice Output Generation

For natural responses, we need a custom audio track to handle speech synthesis:
1class CustomAudioStreamTrack(CustomAudioTrack):
2    """Custom audio track for streaming synthesized speech in a video meeting."""
3    
4    def add_new_bytes(self, audio_data_stream: Iterator[bytes]):
5        """Add new audio data to be processed and played."""
6        await self._process_audio_task_queue.put(audio_data_stream)
7    
8    async def recv(self) -> AudioFrame:
9        """Receive the next audio frame for streaming."""
10        # Get frame from buffer or create silent frame if buffer is empty
11        if len(self.frame_buffer) > 0:
12            frame = self.frame_buffer.pop(0)
13        else:
14            # Create a silent frame when nothing to say
15            frame = AudioFrame(format="s16", layout="mono", samples=self.samples)
16            for p in frame.planes:
17                p.update(bytes(p.buffer_size))
18        
19        # Set frame properties and return
20        frame.pts = pts
21        frame.time_base = time_base
22        frame.sample_rate = self.sample_rate
23        return frame
24
This class manages the agent's voice, ensuring smooth, natural-sounding speech in the meeting.
All these components work together in milliseconds, creating a seamless conversational experience that feels remarkably human.

Building Real-World Applications with VideoSDK's Conversational AI

The flexibility of VideoSDK's infrastructure combined with modern AI capabilities enables a wide range of practical applications. Let's explore what you can build and how to implement specific features for each use case.

AI Meeting Assistants

One of the most powerful applications is an AI meeting assistant that can join virtual meetings to provide real-time support:
1# Set up event handlers for meeting participants
2def on_participant_joined(self, participant: Participant):
3    """Handler for when a participant joins the meeting."""
4    peer_name = participant.display_name
5    print("Participant joined:", peer_name)
6    
7    # Set instructions for the AI assistant
8    intelligence_instructions = """
9    You are an AI meeting assistant. Follow these rules:
10    1. Use analyze_screen tool when user asks about:
11    - Visible UI elements
12    - On-screen content
13    - Application help
14    - Workflow guidance
15    2. Keep responses under 2 sentences
16    3. Always acknowledge requests first
17    """
18
19    # Update OpenAI with instructions
20    asyncio.create_task(self.intelligence.update_session_instructions(intelligence_instructions))
21
22    # Set up event handlers for the participant
23    participant.add_event_listener(
24        ParticipantHandler(
25            participant_id=participant.id,
26            on_stream_enabled=on_stream_enabled,
27            on_stream_disabled=on_stream_disabled
28        )
29    )
30
This code shows how our agent adapts its behavior when joining a meeting, configuring itself to act as a helpful assistant that can understand both verbal requests and screen content.

Visual Programming Assistants

By leveraging the screen analysis capabilities, you can create AI assistants that help developers with coding tasks:
1async def handle_function_call(self, function_call):
2    """Handle function calls from OpenAI, particularly for screen analysis."""
3    if function_call.name == "analyze_screen":
4        if not self.latest_frame:
5            return "No screen content available"
6        
7        # Convert frame to image
8        image_data = self.latest_frame.to_ndarray()
9        image = Image.fromarray(image_data)
10        
11        try:
12            # Request code-specific analysis from Gemini
13            response = await self.loop.run_in_executor(
14                None,
15                lambda: self.vision_model.generate_content([
16                    "Analyze this code to help the developer. Identify bugs, suggest improvements, and explain complex patterns.",
17                    image
18                ])
19            )
20            return response.text
21        except Exception as e:
22            return f"Analysis error: {str(e)}"
23
This function enables the AI to analyze code on a shared screen and provide helpful suggestions, making it an invaluable pair programming partner.

Remote Support and Troubleshooting

Another powerful application is remote technical support, where the AI can see what's on a user's screen and provide guidance:
1# Main.py API endpoint for tech support
2@app.post("/start-support-session")
3async def start_support_session(req: SupportSessionConfig, bg_tasks: BackgroundTasks):
4    # Create a new meeting for the support session
5    meeting_id = create_new_meeting_id()
6    
7    # Configure AI agent with tech support instructions
8    support_instructions = """
9    You are a technical support specialist. When users share their screen:
10    1. Identify the application or system they're using
11    2. Look for error messages or visual issues
12    3. Provide step-by-step troubleshooting guidance
13    4. Be patient and ask clarifying questions
14    """
15    
16    # Start the AI agent in the background
17    bg_tasks.add_task(
18        start_support_agent, 
19        meeting_id=meeting_id,
20        token=req.token,
21        instructions=support_instructions
22    )
23    
24    # Return meeting details for the user to join
25    return {"meeting_id": meeting_id, "join_url": f"https://meet.example.com/{meeting_id}"}
26
This endpoint creates a dedicated support session where users can share their screens and get immediate AI-powered assistance.

Interactive Learning Environments

AI agents can also serve as tutors in educational settings:
1# In the AI agent class
2def configure_as_tutor(self, subject_area):
3    """Configure the AI agent to act as a tutor for a specific subject."""
4    tutor_instructions = f"""
5    You are a patient, helpful tutor specializing in {subject_area}.
6    When students share their work:
7    1. Analyze their approach and identify misconceptions
8    2. Provide encouraging feedback on correct steps
9    3. Ask Socratic questions to guide their thinking
10    4. Explain concepts clearly with examples
11    5. Adapt to their learning style and pace
12    """
13    
14    asyncio.create_task(self.intelligence.update_session_instructions(tutor_instructions))
15    
16    # Add specialized tools for this subject
17    if subject_area == "programming":
18        self.add_code_execution_tool()
19    elif subject_area == "mathematics":
20        self.add_equation_solver_tool()
21
This function configures the AI to act as a specialized tutor, with instructions and tools tailored to the subject matter.
What makes these applications particularly powerful is how they combine multiple modalities—voice, visual analysis, and natural language understanding—to create more helpful and context-aware interactions than traditional chatbots or voice assistants.

Technical Implementation: Building with VideoSDK

Let's dive deeper into the technical implementation of a VideoSDK-powered conversational AI agent, focusing on the key components you'll need to build.

Setting Up Your Project

First, you'll need to structure your project properly:
1├── agent/               # AI agent implementation
2│   ├── ai_agent.py      # Main agent class
3│   └── audio_stream_track.py  # Custom audio track
4├── intelligence/        # AI intelligence components
5│   └── openai/          # OpenAI integration
6│       └── openai_intelligence.py  # OpenAI client
7├── rtc/                 # Real-time communication handlers
8│   └── videosdk/        # VideoSDK event handlers
9├── utils/               # Utility functions and structures
10└── main.py              # API endpoints and server setup
11
This structure keeps your code organized and modular, making it easier to maintain and extend.

Creating a Custom Audio Track

The CustomAudioStreamTrack class is critical for the agent's voice output:
1class CustomAudioStreamTrack(CustomAudioTrack):
2    """
3    A custom implementation of an audio track for streaming audio in a video meeting.
4    This class provides functionality to buffer audio data, process it in chunks,
5    and deliver it in a continuous stream for real-time communication.
6    """
7    def __init__(self, loop, handle_interruption: Optional[bool] = True):
8        super().__init__()
9        self.loop = loop
10        self.frame_buffer = []  # Buffer to hold audio frames ready to be sent
11        self.audio_data_buffer = bytearray()  # Raw audio data buffer
12        self.sample_rate = 24000  # Audio sample rate in Hz
13        self.channels = 1  # Mono audio
14        
15        # Set up audio processing thread
16        self._process_audio_thread = threading.Thread(target=self.run_process_audio)
17        self._process_audio_thread.daemon = True
18        self._process_audio_thread.start()
19
This class handles the complex tasks of buffering audio data, processing it in chunks, and delivering it smoothly in a video meeting context.

Implementing OpenAI Integration

The OpenAIIntelligence class provides the core AI capabilities:
1class OpenAIIntelligence:
2    """
3    A class to handle real-time communication with OpenAI's streaming API.
4    """
5    def __init__(
6        self, 
7        loop: AbstractEventLoop, 
8        api_key,
9        model: str = "gpt-4o-realtime-preview-2024-10-01",
10        instructions="""\
11            Actively listen to the user's questions and provide concise, relevant responses. 
12            Acknowledge the user's intent before answering. Keep responses under 2 sentences.\
13        """,
14        base_url: str = "api.openai.com",
15        tools: List[Dict[str, Union[str, any]]] = [],
16        audio_track: CustomAudioStreamTrack = None,
17    ):
18        # Initialize configuration
19        self.model = model
20        self.loop = loop
21        self.api_key = api_key
22        self.instructions = instructions
23        self.tools = tools
24        self.audio_track = audio_track
25        
26        # WebSocket for real-time communication
27        self.ws = None
28        self.connected_event = asyncio.Event()
29
This class manages the connection to OpenAI's real-time API, handling both speech recognition and generation.

Exposing as a REST API

Finally, we expose our agent through a simple FastAPI interface:
1from fastapi import FastAPI, BackgroundTasks
2from pydantic import BaseModel
3
4app = FastAPI()
5
6class MeetingReqConfig(BaseModel):
7    meeting_id: str
8    token: str
9    
10async def server_operations(req:MeetingReqConfig):
11    # Create and start the AI agent
12    global ai_agent
13    ai_agent = AIAgent(req.meeting_id, req.token, "AI")
14    
15    try:
16        await ai_agent.join()
17        while True:
18            await asyncio.sleep(1)
19    except Exception as ex:
20        print(f"[ERROR]: either joining or running bg tasks: {ex}")
21    finally:
22        ai_agent.leave()
23    
24@app.post("/join-player")
25async def join_player(req: MeetingReqConfig, bg_tasks: BackgroundTasks):
26    bg_tasks.add_task(server_operations, req)
27    return {"message": "AI agent joined"}
28
This API makes it easy to integrate your conversational AI agent with other applications and services.

Key VideoSDK Components

The integration with VideoSDK is handled through these key components:
1# Configure meeting settings
2self.meeting_config = MeetingConfig(
3    name=name,
4    meeting_id=meeting_id,
5    token=authToken,
6    mic_enabled=True,
7    webcam_enabled=False,
8    custom_microphone_audio_track=self.audio_track,
9)
10
11# Initialize the meeting agent
12self.agent = VideoSDK.init_meeting(**self.meeting_config)
13
14# Add event listeners for meeting events
15self.agent.add_event_listener(
16    MeetingHandler(
17        on_meeting_joined=self.on_meeting_joined,
18        on_meeting_left=self.on_meeting_left,
19        on_participant_joined=self.on_participant_joined,
20        on_participant_left=self.on_participant_left,
21    ))
22
This code sets up the VideoSDK integration, allowing your AI agent to join meetings and interact with participants.

Best Practices for Conversational AI Development

Based on our experience building VideoSDK-powered conversational AI systems, here are some best practices to keep in mind:

1. Optimize Audio Processing

Audio quality is crucial for effective speech recognition. Our implementation includes careful audio processing:
1# Convert to float for processing
2audio_data_float = (audio_data.astype(np.float32) / np.iinfo(np.int16).max)
3
4# Convert to mono and resample to 16kHz (required by Whisper)
5audio_mono = librosa.to_mono(audio_data_float.T)
6audio_resampled = librosa.resample(
7    audio_mono, orig_sr=48000, target_sr=16000
8)
9
This code ensures high-quality audio input for the best possible transcription results.

2. Use Clear System Instructions

Providing clear instructions to the AI model is essential for getting the behavior you want:
1intelligence_instructions = """
2You are an AI meeting assistant. Follow these rules:
31. Use analyze_screen tool when user asks about:
4   - Visible UI elements
5   - On-screen content
6   - Application help
7   - Workflow guidance
82. Keep responses under 2 sentences
93. Always acknowledge requests first
10"""
11
12await self.intelligence.update_session_instructions(intelligence_instructions)
13
These instructions help shape the AI's behavior, making it more predictable and appropriate for your specific use case.

3. Implement Graceful Error Handling

Real-world applications need robust error handling:
1async def handle_function_call(self, function_call):
2    """Handle function calls from OpenAI, particularly for screen analysis."""
3    if function_call.name == "analyze_screen":
4        if not self.latest_frame:
5            return "No screen content available"
6        
7        try:
8            # Request analysis from Gemini
9            response = await self.loop.run_in_executor(
10                None,
11                lambda: self.vision_model.generate_content([
12                    "Analyze this screen to help user. Focus on relevant UI elements, text, code, and context.",
13                    image
14                ])
15            )
16            return response.text
17        except Exception as e:
18            return f"Analysis error: {str(e)}"
19    return "Unknown command"
20
This code includes multiple safeguards, including checking for missing frames and catching exceptions during API calls.

4. Consider Latency and Responsiveness

For natural conversation, minimize latency wherever possible:
1# Use small buffer sizes for more responsive audio
2self.samples = int(AUDIO_PTIME * self.sample_rate)  # 20ms chunks for low latency
3self.chunk_size = int(self.samples * self.channels * self.sample_width)
4
5# Process audio in a separate thread to avoid blocking
6self._process_audio_thread = threading.Thread(target=self.run_process_audio)
7self._process_audio_thread.daemon = True
8self._process_audio_thread.start()
9
These techniques help ensure that the AI responds quickly, making conversations feel more natural.

Overcoming Technical Challenges

While building conversational AI with VideoSDK offers tremendous possibilities, there are several technical challenges to address:

1. Managing Context and State

One of the biggest challenges in conversational AI is maintaining context throughout a conversation:
1class OpenAIIntelligence:
2    def __init__(self, ...):
3        # Session parameters maintain context
4        self.session_update_params = SessionUpdateParams(
5            model=self.model,
6            instructions=self.instructions,
7            input_audio_format=AudioFormats.PCM16,
8            output_audio_format=AudioFormats.PCM16,
9            temperature=self.temperature,
10            voice=self.voice,
11            tool_choice="auto",
12            tools=self.tools,
13            turn_detection=self.turn_detection,
14            modalities=self.modalities,
15            max_response_output_tokens=self.max_response_output_tokens,
16            input_audio_transcription=self.input_audio_transcription,
17        )
18
Using a structured session object helps maintain context across multiple turns of conversation.

2. Handling Interruptions

Natural conversation includes interruptions, which can be challenging to handle in AI systems:
1def interrupt(self):
2    """Interrupt the current audio playback."""
3    length = len(self.frame_buffer)
4    self.frame_buffer.clear()  # Clear all pending audio frames
5    
6    # Empty the task queue to stop processing more audio
7    while not self._process_audio_task_queue.empty():
8        self.skip_next_chunk = True
9        self._process_audio_task_queue.get_nowait()
10        self._process_audio_task_queue.task_done()
11
This method allows the AI to stop speaking immediately when a user starts talking, creating more natural turn-taking.

3. Resource Management

AI agents can be resource-intensive. Good cleanup practices are essential:
1async def cleanup(self):
2    """Cleanup resources when the agent is destroyed."""
3    # Cancel all running tasks
4    for task in self.audio_listener_tasks.values():
5        if task and not task.done():
6            task.cancel()
7    for task in self.screenshare_listener_tasks.values():
8        if task and not task.done():
9            task.cancel()
10    
11    # Clear the queues
12    self.frame_queue.clear()
13    
14    # Leave the meeting
15    self.leave()
16
This ensures all resources are properly released when the agent is no longer needed.

4. Handling Multi-modal Inputs

Processing both audio and visual inputs simultaneously requires careful coordination:
1def on_participant_joined(self, participant: Participant):
2    def on_stream_enabled(stream: Stream):
3        """Handler for when a participant enables a stream."""
4        if stream.kind == "audio":
5            # Start processing audio
6            self.audio_listener_tasks[stream.id] = self.loop.create_task(
7                self.add_audio_listener(stream, peer_name)
8            )
9        elif stream.kind == "share":
10            # Start processing screen share
11            self.screenshare_listener_tasks[stream.id] = self.loop.create_task(
12                self.add_screenshare_listener(stream, peer_name)
13            )
14
This approach allows the agent to handle different types of input streams appropriately.

The Future of Conversational AI with VideoSDK

As we look ahead, several emerging trends will shape the development of conversational AI on the VideoSDK platform:

Multimodal Intelligence

Future conversational AI systems will increasingly combine multiple forms of intelligence:
1# Initialize the meeting agent with multiple AI models
2self.vision_model = genai.GenerativeModel('gemini-1.5-flash')  # For visual understanding
3self.intelligence = OpenAIIntelligence(...)  # For conversation
4self.document_analyzer = DocumentAnalysisModel(...)  # For parsing documents
5
This multimodal approach will enable richer, more context-aware interactions.

Edge AI for Lower Latency

Moving AI processing closer to the user will reduce latency and improve responsiveness:
1# Configuration for edge deployment
2edge_config = {
3    "use_local_whisper": True,  # Use local speech recognition
4    "use_local_tts": True,      # Use local text-to-speech
5    "fallback_to_cloud": True   # Fall back to cloud for complex queries
6}
7
This hybrid approach combines the speed of edge computing with the power of cloud AI.

Custom Domain Adaptation

Specialized knowledge domains will become easier to incorporate:
1# Fine-tune the AI for a specific domain
2async def specialize_for_domain(self, domain_data, domain_name):
3    """Adapt the AI to a specific knowledge domain."""
4    domain_instructions = f"""
5    You are specialized in {domain_name}. Use the following knowledge:
6    {domain_data}
7    When answering questions about {domain_name}, refer to this specialized knowledge.
8    """
9    await self.intelligence.update_session_instructions(domain_instructions)
10
This will make AI agents more valuable in specialized fields like healthcare, legal, and technical support.

Conclusion: Building the Conversational Future

Conversational AI represents one of the most transformative applications of artificial intelligence, fundamentally changing how humans interact with technology. With VideoSDK's real-time communication platform, developers now have the tools to create sophisticated AI agents that can see, hear, understand, and respond in remarkably human-like ways.
The code examples we've explored demonstrate that building these advanced systems is increasingly accessible. You don't need to be an AI researcher or have enormous computing resources—just a clear understanding of the architecture, some knowledge of modern AI APIs, and VideoSDK's powerful real-time communication infrastructure.
As you embark on your own conversational AI projects, remember that the most compelling applications combine technical sophistication with genuine utility. The best conversational agents don't just understand and respond—they solve real problems, provide valuable insights, and make technology more accessible to everyone.
The future of human-computer interaction is conversational, and with VideoSDK, you have everything you need to be at the forefront of this revolution. Whether you're building meeting assistants, customer support agents, programming aids, or educational tools, the possibilities are limited only by your imagination.
Start building your conversational AI application today, and join the community of developers who are redefining what's possible in human-computer interaction.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ