What are the key components of bidirectional streaming architecture?

The key components include: 1) Client-side components that handle user interface, media capture, and streaming, 2) Server-side processing that connects to AI services and manages communication, 3) AI integration that processes audio/video streams and generates responses, and 4) Real-time communication that maintains low-latency bidirectional connections.

How do I set up Gemini Live API for bidirectional streaming?

To set up Gemini Live API, you need to obtain an API key from Google AI Studio, configure it in your environment, and set up the desired modalities (text, audio, vision) with appropriate instructions for how the AI should respond in your application context.

How can I integrate VideoSDK with Gemini for real-time applications?

Integration involves configuring a VideoSDK meeting agent with appropriate settings (like microphone and webcam options), adding event listeners for meeting events, and implementing handlers for audio and screen sharing streams that can process and send data to Gemini for analysis.

How do I process audio in real-time for AI transcription?

Real-time audio processing involves capturing audio frames, converting them to the appropriate format (typically resampling to 16kHz mono PCM for services like Whisper), and sending them to the AI service in a continuous stream. This requires efficient buffering and error handling to maintain real-time performance.

What makes screen analysis possible in a bidirectional streaming setup?

Screen analysis is enabled by capturing frames from screen share streams, converting them to image format, and sending them to vision AI services like Gemini with appropriate prompts for analysis. The results can then be processed and communicated back to users in real-time.

How do I handle errors in a real-time streaming application?

Robust error handling in real-time applications involves implementing retry mechanisms with appropriate delays, graceful degradation when services are unavailable, proper logging for debugging, and maintaining state awareness to recover from interruptions without disrupting the user experience.

What considerations should I keep in mind for production deployment?

Key production considerations include scalability (using distributed architecture for multiple concurrent sessions), security (implementing authentication and encryption), fallback mechanisms (providing degraded functionality when issues occur), and optimizing for various network conditions and device capabilities.

Implementing Bidirectional Streaming With Gemini Live API and VideoSDK

A comprehensive guide to implementing bidirectional streaming using Gemini Live API and VideoSDK for building intelligent, interactive real-time applications.

In the rapidly evolving landscape of real-time applications, bidirectional streaming capabilities have become essential for creating immersive, interactive experiences. This comprehensive guide explores how to implement bidirectional streaming using Google's Gemini Live API integrated with VideoSDK, enabling you to build sophisticated AI-enhanced video applications with real-time capabilities.

Introduction

Bidirectional streaming allows for simultaneous communication between client and server, creating a seamless flow of data in both directions. When combined with AI capabilities like those offered by Google's Gemini, it opens up possibilities for creating intelligent video applications that can analyze visual content, respond to verbal queries, and interact naturally with users in real-time.

In this guide, we'll walk through creating an AI assistant that can join video meetings, process audio in real-time, analyze shared screens, and provide intelligent responses - all leveraging the power of bidirectional streaming through

WebRTC

technology.

Understanding Bidirectional Streaming Architecture

Before diving into implementation, it's important to understand the architecture that makes bidirectional streaming possible:

Client-side components: Handle user interface, media capture, and streaming
Server-side processing: Connects to AI services and manages communication
AI integration: Processes audio/video streams and generates responses
Real-time communication: Maintains low-latency bidirectional connections

This architecture enables applications where users and AI can interact in a natural, conversational manner with minimal latency - essential for productive video meetings or collaborative sessions.

Setting up the Gemini Live API

Obtaining API Keys and Authentication

To use Gemini's capabilities, you'll first need to obtain an API key from Google AI Studio. Once you have your API key, you can configure it in your environment:

1import google.generativeai as genai
2import os
3import dotenv
4
5# Load environment variables
6dotenv.load_dotenv()
7
8# Get API key from environment variables
9gemini_api_key = os.getenv("GEMINI_API_KEY")
10
11# Configure the Gemini client
12if gemini_api_key:
13    genai.configure(api_key=gemini_api_key)
14    vision_model = genai.GenerativeModel('gemini-1.5-flash')
15else:
16    print("GEMINI_API_KEY not set. Screen share analysis will be disabled.")
17

Configuring the API for Bidirectional Streaming

Gemini's Live API supports various modalities like text, audio, and vision. When setting up a bidirectional stream, you need to configure which modalities you'll be using:

1# Define tool for screen analysis
2screen_tool = {
3    "type": "function",
4    "name": "analyze_screen",
5    "description": "Analyze current screen content when user asks for help with visible elements",
6    "parameters": {"type": "object", "properties": {}}
7}
8
9# Configure intelligence with appropriate modalities and tools
10intelligence_instructions = """
11You are an AI meeting assistant. Follow these rules:
121. Use analyze_screen tool when user asks about:
13   - Visible UI elements
14   - On-screen content
15   - Application help
16   - Workflow guidance
172. Keep responses under 2 sentences
183. Always acknowledge requests first
19"""
20
21# Update OpenAI with instructions
22asyncio.create_task(intelligence.update_session_instructions(intelligence_instructions))
23

Integrating with VideoSDK

Setting up VideoSDK

VideoSDK provides the infrastructure for real-time video meetings and video conferencing. Here's how to set up a basic meeting agent:

1# Configure meeting settings
2self.meeting_config = MeetingConfig(
3    name=name,
4    meeting_id=meeting_id,
5    token=authToken,
6    mic_enabled=True,  # Enable microphone for the agent
7    webcam_enabled=False,  # No video feed for the agent
8    custom_microphone_audio_track=self.audio_track,  # Use custom audio track
9)
10
11# Initialize the meeting agent
12self.agent = VideoSDK.init_meeting(**self.meeting_config)
13
14# Add event listeners for meeting events
15self.agent.add_event_listener(
16    MeetingHandler(
17        on_meeting_joined=self.on_meeting_joined,
18        on_meeting_left=self.on_meeting_left,
19        on_participant_joined=self.on_participant_joined,
20        on_participant_left=self.on_participant_left,
21    ))
22

Handling Real-time Audio Streams

Processing audio in real-time requires capturing audio frames, converting them to the appropriate format, and sending them to the AI service:

1async def add_audio_listener(self, stream: Stream, peer_name: str):
2    """
3    Process audio from a participant and send it to AI for transcription.
4    """
5    print("Participant stream enabled", peer_name)
6    while True:
7        try:
8            await asyncio.sleep(0.01)  # Small delay to prevent CPU hogging
9
10            # Get audio frame
11            frame = await stream.track.recv()      
12            audio_data = frame.to_ndarray()[0]
13            
14            # Convert to float for processing
15            audio_data_float = (
16                audio_data.astype(np.float32) / np.iinfo(np.int16).max
17            )
18            
19            # Convert to mono and resample to 16kHz (required by Whisper)
20            audio_mono = librosa.to_mono(audio_data_float.T)
21            audio_resampled = librosa.resample(
22                audio_mono, orig_sr=48000, target_sr=16000
23            )
24            
25            # Convert back to PCM format for processing
26            pcm_frame = (
27                (audio_resampled * np.iinfo(np.int16).max)
28                .astype(np.int16)
29                .tobytes()
30            )
31            
32            # Send to AI for processing
33            await self.intelligence.send_audio_data(pcm_frame)
34
35        except Exception as e:
36            print("Audio processing error:", e)
37            break
38

Implementing Bidirectional Screen Analysis

One of the most powerful features of this integration is the ability to analyze shared screens in real-time. This allows the AI assistant to provide contextual help about what the user is seeing:

1async def handle_function_call(self, function_call):
2    """
3    Handle function calls from AI, particularly for screen analysis.
4    """
5    if function_call.name == "analyze_screen":
6        if not self.latest_frame:
7            return "No screen content available"
8        
9        # Convert frame to image
10        image_data = self.latest_frame.to_ndarray()
11        image = Image.fromarray(image_data)
12        
13        try:
14            # Request analysis from Gemini
15            response = await self.loop.run_in_executor(
16                None,  # Use default executor
17                lambda: self.vision_model.generate_content([
18                    "Analyze this screen to help user. Focus on relevant UI elements, text, code, and context.",
19                    image
20                ])
21            )
22            return response.text
23        except Exception as e:
24            return f"Analysis error: {str(e)}"
25    return "Unknown command"
26

Capturing and Processing Screen Shares

To enable screen analysis, you need to capture and process screen share streams:

1async def add_screenshare_listener(self, stream: Stream, peer_name: str):
2    """
3    Store the latest frame from a screen share stream.
4    """
5    print("Participant screenshare enabled", peer_name)
6    while True:
7        try:                
8            frame = await stream.track.recv()
9            self.latest_frame = frame  # Update latest frame
10        except Exception as e:
11            traceback.print_exc()
12            print("Screenshare processing error:", e)
13            break
14

Managing Communication Flow

The power of bidirectional streaming comes from managing the flow of information in both directions. This requires:

Capturing user input (audio, video)
Processing with AI
Generating and delivering responses

The following function demonstrates how the AI can respond once it has processed screen content:

1async def process_function_call(self, function_call):
2    """
3    Process a function call from the AI model and respond.
4    """
5    # Execute the function call and get the result
6    result = await self.handle_function_call(function_call)
7    
8    print("Sending response of tool call", result)
9    
10    # Create an item with the function call output
11    res = ItemCreate(item=FunctionCallOutputItemParam(
12        call_id=function_call.call_id,
13        output=result
14    ))
15    
16    # Send the function result back to AI
17    await self.send_request(res)
18    
19    # Create a response to instruct the assistant to vocalize the output
20    response_instruction = ResponseCreate(
21        response=ResponseCreateParams(
22            modalities=["text", "audio"],  # Generate both text and audio
23            instructions=f"Ask user what help is needed and provide answer in 2 lines based on following screen result - {result}",
24            voice="alloy",
25            output_audio_format="pcm16"
26        )
27    )
28    
29    # Send the instruction to the assistant
30    await self.send_request(response_instruction)
31

Building the Client-Side Application

The client-side of our bidirectional streaming application needs to handle the user interface and media transmission. With VideoSDK, we can create a React application that manages meeting participants:

1export const MeetingView: React.FC<MeetingViewProps> = ({ setMeetingId }) => {
2  const {
3    participants,
4    localScreenShareOn,
5    toggleScreenShare,
6    end,
7    meetingId,
8    localMicOn,
9    localWebcamOn,
10    toggleWebcam,
11    toggleMic,
12  } = useMeeting();
13  const { token, aiJoined, setAiJoined } = useMeetingStore();
14
15  const inviteAI = async () => {
16    try {
17      const response = await fetch("http://localhost:8000/join-player", {
18        method: "POST",
19        headers: { "Content-Type": "application/json" },
20        body: JSON.stringify({ meeting_id: meetingId, token }),
21      });
22
23      if (!response.ok) throw new Error("Failed to invite AI");
24      setAiJoined(true);
25    } catch (error) {
26      console.error("Error inviting AI:", error);
27    }
28  };
29
30  // Render the UI with participant views and controls
31  return (
32    <div className="min-h-screen bg-gradient-to-br from-gray-900 to-black p-8">
33      {/* Meeting layout and controls */}
34    </div>
35  );
36};
37

Handling Errors and Edge Cases

In real-time applications, handling errors gracefully is critical. Here's an example of robust error handling in the audio processing pipeline:

1async def add_audio_listener(self, stream: Stream, peer_name: str):
2    """
3    Process audio from a participant and send it to AI for transcription.
4    """
5    print("Participant stream enabled", peer_name)
6    while True:
7        try:
8            # Audio processing code...
9            
10        except Exception as e:
11            print("Audio processing error:", e)
12            # Attempt reconnection or recovery
13            await asyncio.sleep(1)  # Delay before retry
14            # If error persists, exit the loop
15            if retry_count > MAX_RETRIES:
16                break
17            retry_count += 1
18

Key Considerations for Production

When moving your bidirectional streaming application to production, consider:

Scalability: Use a distributed architecture to handle multiple concurrent sessions
Security: Implement proper authentication and encryption for all communications
Fallback mechanisms: Provide degraded functionality when connectivity issues occur
User experience: Design for variations in network conditions and device capabilities

Conclusion

Bidirectional streaming with Gemini Live API and VideoSDK creates powerful opportunities for building intelligent, interactive applications. This approach enables real-time audio processing, screen analysis, and natural communication between users and AI assistants.

By following the implementation patterns shown in this guide, you can create applications that analyze visual content, respond to verbal queries, and interact naturally with users - all in real-time. The combination of VideoSDK's real-time communication capabilities with Gemini's advanced AI features provides a robust foundation for next-generation interactive applications.

As you build your own applications, remember that optimizing the streaming experience requires careful attention to latency, error handling, and user feedback mechanisms. With these considerations in mind, you can create seamless, responsive experiences that leverage the full potential of bidirectional AI communication.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ

Free 10,000 minutes for video calls

RELEVANT BLOGS