Introducing "NAMO" Real-Time Speech AI Model: On-Device & Hybrid Cloud 📢PRESS RELEASE

Implementing Bidirectional Streaming With Gemini Live API and VideoSDK

A comprehensive guide to implementing bidirectional streaming using Gemini Live API and VideoSDK for building intelligent, interactive real-time applications.

In the rapidly evolving landscape of real-time applications, bidirectional streaming capabilities have become essential for creating immersive, interactive experiences. This comprehensive guide explores how to implement bidirectional streaming using Google's Gemini Live API integrated with VideoSDK, enabling you to build sophisticated AI-enhanced video applications with real-time capabilities.

Introduction

Bidirectional streaming allows for simultaneous communication between client and server, creating a seamless flow of data in both directions. When combined with AI capabilities like those offered by Google's Gemini, it opens up possibilities for creating intelligent video applications that can analyze visual content, respond to verbal queries, and interact naturally with users in real-time.
In this guide, we'll walk through creating an AI assistant that can join video meetings, process audio in real-time, analyze shared screens, and provide intelligent responses - all leveraging the power of bidirectional streaming through

WebRTC

technology.

Understanding Bidirectional Streaming Architecture

Before diving into implementation, it's important to understand the architecture that makes bidirectional streaming possible:
  1. Client-side components: Handle user interface, media capture, and streaming
  2. Server-side processing: Connects to AI services and manages communication
  3. AI integration: Processes audio/video streams and generates responses
  4. Real-time communication: Maintains low-latency bidirectional connections
This architecture enables applications where users and AI can interact in a natural, conversational manner with minimal latency - essential for productive video meetings or collaborative sessions.

Setting up the Gemini Live API

Obtaining API Keys and Authentication

To use Gemini's capabilities, you'll first need to obtain an API key from Google AI Studio. Once you have your API key, you can configure it in your environment:
1import google.generativeai as genai
2import os
3import dotenv
4
5# Load environment variables
6dotenv.load_dotenv()
7
8# Get API key from environment variables
9gemini_api_key = os.getenv("GEMINI_API_KEY")
10
11# Configure the Gemini client
12if gemini_api_key:
13    genai.configure(api_key=gemini_api_key)
14    vision_model = genai.GenerativeModel('gemini-1.5-flash')
15else:
16    print("GEMINI_API_KEY not set. Screen share analysis will be disabled.")
17

Configuring the API for Bidirectional Streaming

Gemini's Live API supports various modalities like text, audio, and vision. When setting up a bidirectional stream, you need to configure which modalities you'll be using:
1# Define tool for screen analysis
2screen_tool = {
3    "type": "function",
4    "name": "analyze_screen",
5    "description": "Analyze current screen content when user asks for help with visible elements",
6    "parameters": {"type": "object", "properties": {}}
7}
8
9# Configure intelligence with appropriate modalities and tools
10intelligence_instructions = """
11You are an AI meeting assistant. Follow these rules:
121. Use analyze_screen tool when user asks about:
13   - Visible UI elements
14   - On-screen content
15   - Application help
16   - Workflow guidance
172. Keep responses under 2 sentences
183. Always acknowledge requests first
19"""
20
21# Update OpenAI with instructions
22asyncio.create_task(intelligence.update_session_instructions(intelligence_instructions))
23

Integrating with VideoSDK

Setting up VideoSDK

VideoSDK provides the infrastructure for real-time video meetings and video conferencing. Here's how to set up a basic meeting agent:
1# Configure meeting settings
2self.meeting_config = MeetingConfig(
3    name=name,
4    meeting_id=meeting_id,
5    token=authToken,
6    mic_enabled=True,  # Enable microphone for the agent
7    webcam_enabled=False,  # No video feed for the agent
8    custom_microphone_audio_track=self.audio_track,  # Use custom audio track
9)
10
11# Initialize the meeting agent
12self.agent = VideoSDK.init_meeting(**self.meeting_config)
13
14# Add event listeners for meeting events
15self.agent.add_event_listener(
16    MeetingHandler(
17        on_meeting_joined=self.on_meeting_joined,
18        on_meeting_left=self.on_meeting_left,
19        on_participant_joined=self.on_participant_joined,
20        on_participant_left=self.on_participant_left,
21    ))
22

Handling Real-time Audio Streams

Processing audio in real-time requires capturing audio frames, converting them to the appropriate format, and sending them to the AI service:
1async def add_audio_listener(self, stream: Stream, peer_name: str):
2    """
3    Process audio from a participant and send it to AI for transcription.
4    """
5    print("Participant stream enabled", peer_name)
6    while True:
7        try:
8            await asyncio.sleep(0.01)  # Small delay to prevent CPU hogging
9
10            # Get audio frame
11            frame = await stream.track.recv()      
12            audio_data = frame.to_ndarray()[0]
13            
14            # Convert to float for processing
15            audio_data_float = (
16                audio_data.astype(np.float32) / np.iinfo(np.int16).max
17            )
18            
19            # Convert to mono and resample to 16kHz (required by Whisper)
20            audio_mono = librosa.to_mono(audio_data_float.T)
21            audio_resampled = librosa.resample(
22                audio_mono, orig_sr=48000, target_sr=16000
23            )
24            
25            # Convert back to PCM format for processing
26            pcm_frame = (
27                (audio_resampled * np.iinfo(np.int16).max)
28                .astype(np.int16)
29                .tobytes()
30            )
31            
32            # Send to AI for processing
33            await self.intelligence.send_audio_data(pcm_frame)
34
35        except Exception as e:
36            print("Audio processing error:", e)
37            break
38

Implementing Bidirectional Screen Analysis

One of the most powerful features of this integration is the ability to analyze shared screens in real-time. This allows the AI assistant to provide contextual help about what the user is seeing:
1async def handle_function_call(self, function_call):
2    """
3    Handle function calls from AI, particularly for screen analysis.
4    """
5    if function_call.name == "analyze_screen":
6        if not self.latest_frame:
7            return "No screen content available"
8        
9        # Convert frame to image
10        image_data = self.latest_frame.to_ndarray()
11        image = Image.fromarray(image_data)
12        
13        try:
14            # Request analysis from Gemini
15            response = await self.loop.run_in_executor(
16                None,  # Use default executor
17                lambda: self.vision_model.generate_content([
18                    "Analyze this screen to help user. Focus on relevant UI elements, text, code, and context.",
19                    image
20                ])
21            )
22            return response.text
23        except Exception as e:
24            return f"Analysis error: {str(e)}"
25    return "Unknown command"
26

Capturing and Processing Screen Shares

To enable screen analysis, you need to capture and process screen share streams:
1async def add_screenshare_listener(self, stream: Stream, peer_name: str):
2    """
3    Store the latest frame from a screen share stream.
4    """
5    print("Participant screenshare enabled", peer_name)
6    while True:
7        try:                
8            frame = await stream.track.recv()
9            self.latest_frame = frame  # Update latest frame
10        except Exception as e:
11            traceback.print_exc()
12            print("Screenshare processing error:", e)
13            break
14

Managing Communication Flow

The power of bidirectional streaming comes from managing the flow of information in both directions. This requires:
  1. Capturing user input (audio, video)
  2. Processing with AI
  3. Generating and delivering responses
The following function demonstrates how the AI can respond once it has processed screen content:
1async def process_function_call(self, function_call):
2    """
3    Process a function call from the AI model and respond.
4    """
5    # Execute the function call and get the result
6    result = await self.handle_function_call(function_call)
7    
8    print("Sending response of tool call", result)
9    
10    # Create an item with the function call output
11    res = ItemCreate(item=FunctionCallOutputItemParam(
12        call_id=function_call.call_id,
13        output=result
14    ))
15    
16    # Send the function result back to AI
17    await self.send_request(res)
18    
19    # Create a response to instruct the assistant to vocalize the output
20    response_instruction = ResponseCreate(
21        response=ResponseCreateParams(
22            modalities=["text", "audio"],  # Generate both text and audio
23            instructions=f"Ask user what help is needed and provide answer in 2 lines based on following screen result - {result}",
24            voice="alloy",
25            output_audio_format="pcm16"
26        )
27    )
28    
29    # Send the instruction to the assistant
30    await self.send_request(response_instruction)
31

Building the Client-Side Application

The client-side of our bidirectional streaming application needs to handle the user interface and media transmission. With VideoSDK, we can create a React application that manages meeting participants:
1export const MeetingView: React.FC<MeetingViewProps> = ({ setMeetingId }) => {
2  const {
3    participants,
4    localScreenShareOn,
5    toggleScreenShare,
6    end,
7    meetingId,
8    localMicOn,
9    localWebcamOn,
10    toggleWebcam,
11    toggleMic,
12  } = useMeeting();
13  const { token, aiJoined, setAiJoined } = useMeetingStore();
14
15  const inviteAI = async () => {
16    try {
17      const response = await fetch("http://localhost:8000/join-player", {
18        method: "POST",
19        headers: { "Content-Type": "application/json" },
20        body: JSON.stringify({ meeting_id: meetingId, token }),
21      });
22
23      if (!response.ok) throw new Error("Failed to invite AI");
24      setAiJoined(true);
25    } catch (error) {
26      console.error("Error inviting AI:", error);
27    }
28  };
29
30  // Render the UI with participant views and controls
31  return (
32    <div className="min-h-screen bg-gradient-to-br from-gray-900 to-black p-8">
33      {/* Meeting layout and controls */}
34    </div>
35  );
36};
37

Handling Errors and Edge Cases

In real-time applications, handling errors gracefully is critical. Here's an example of robust error handling in the audio processing pipeline:
1async def add_audio_listener(self, stream: Stream, peer_name: str):
2    """
3    Process audio from a participant and send it to AI for transcription.
4    """
5    print("Participant stream enabled", peer_name)
6    while True:
7        try:
8            # Audio processing code...
9            
10        except Exception as e:
11            print("Audio processing error:", e)
12            # Attempt reconnection or recovery
13            await asyncio.sleep(1)  # Delay before retry
14            # If error persists, exit the loop
15            if retry_count > MAX_RETRIES:
16                break
17            retry_count += 1
18

Key Considerations for Production

When moving your bidirectional streaming application to production, consider:
  1. Scalability: Use a distributed architecture to handle multiple concurrent sessions
  2. Security: Implement proper authentication and encryption for all communications
  3. Fallback mechanisms: Provide degraded functionality when connectivity issues occur
  4. User experience: Design for variations in network conditions and device capabilities

Conclusion

Bidirectional streaming with Gemini Live API and VideoSDK creates powerful opportunities for building intelligent, interactive applications. This approach enables real-time audio processing, screen analysis, and natural communication between users and AI assistants.
By following the implementation patterns shown in this guide, you can create applications that analyze visual content, respond to verbal queries, and interact naturally with users - all in real-time. The combination of VideoSDK's real-time communication capabilities with Gemini's advanced AI features provides a robust foundation for next-generation interactive applications.
As you build your own applications, remember that optimizing the streaming experience requires careful attention to latency, error handling, and user feedback mechanisms. With these considerations in mind, you can create seamless, responsive experiences that leverage the full potential of bidirectional AI communication.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ