In the rapidly evolving landscape of real-time applications, bidirectional streaming capabilities have become essential for creating immersive, interactive experiences. This comprehensive guide explores how to implement bidirectional streaming using Google's Gemini Live API integrated with VideoSDK, enabling you to build sophisticated AI-enhanced video applications with real-time capabilities.
Introduction
Bidirectional streaming allows for simultaneous communication between client and server, creating a seamless flow of data in both directions. When combined with AI capabilities like those offered by Google's Gemini, it opens up possibilities for creating intelligent video applications that can analyze visual content, respond to verbal queries, and interact naturally with users in real-time.
In this guide, we'll walk through creating an AI assistant that can join video meetings, process audio in real-time, analyze shared screens, and provide intelligent responses - all leveraging the power of bidirectional streaming through
WebRTC
technology.Understanding Bidirectional Streaming Architecture
Before diving into implementation, it's important to understand the architecture that makes bidirectional streaming possible:
- Client-side components: Handle user interface, media capture, and streaming
- Server-side processing: Connects to AI services and manages communication
- AI integration: Processes audio/video streams and generates responses
- Real-time communication: Maintains low-latency bidirectional connections
This architecture enables applications where users and AI can interact in a natural, conversational manner with minimal latency - essential for productive video meetings or collaborative sessions.
Setting up the Gemini Live API
Obtaining API Keys and Authentication
To use Gemini's capabilities, you'll first need to obtain an API key from Google AI Studio. Once you have your API key, you can configure it in your environment:
1import google.generativeai as genai
2import os
3import dotenv
4
5# Load environment variables
6dotenv.load_dotenv()
7
8# Get API key from environment variables
9gemini_api_key = os.getenv("GEMINI_API_KEY")
10
11# Configure the Gemini client
12if gemini_api_key:
13 genai.configure(api_key=gemini_api_key)
14 vision_model = genai.GenerativeModel('gemini-1.5-flash')
15else:
16 print("GEMINI_API_KEY not set. Screen share analysis will be disabled.")
17
Configuring the API for Bidirectional Streaming
Gemini's Live API supports various modalities like text, audio, and vision. When setting up a bidirectional stream, you need to configure which modalities you'll be using:
1# Define tool for screen analysis
2screen_tool = {
3 "type": "function",
4 "name": "analyze_screen",
5 "description": "Analyze current screen content when user asks for help with visible elements",
6 "parameters": {"type": "object", "properties": {}}
7}
8
9# Configure intelligence with appropriate modalities and tools
10intelligence_instructions = """
11You are an AI meeting assistant. Follow these rules:
121. Use analyze_screen tool when user asks about:
13 - Visible UI elements
14 - On-screen content
15 - Application help
16 - Workflow guidance
172. Keep responses under 2 sentences
183. Always acknowledge requests first
19"""
20
21# Update OpenAI with instructions
22asyncio.create_task(intelligence.update_session_instructions(intelligence_instructions))
23
Integrating with VideoSDK
Setting up VideoSDK
VideoSDK provides the infrastructure for real-time video meetings and video conferencing. Here's how to set up a basic meeting agent:
1# Configure meeting settings
2self.meeting_config = MeetingConfig(
3 name=name,
4 meeting_id=meeting_id,
5 token=authToken,
6 mic_enabled=True, # Enable microphone for the agent
7 webcam_enabled=False, # No video feed for the agent
8 custom_microphone_audio_track=self.audio_track, # Use custom audio track
9)
10
11# Initialize the meeting agent
12self.agent = VideoSDK.init_meeting(**self.meeting_config)
13
14# Add event listeners for meeting events
15self.agent.add_event_listener(
16 MeetingHandler(
17 on_meeting_joined=self.on_meeting_joined,
18 on_meeting_left=self.on_meeting_left,
19 on_participant_joined=self.on_participant_joined,
20 on_participant_left=self.on_participant_left,
21 ))
22
Handling Real-time Audio Streams
Processing audio in real-time requires capturing audio frames, converting them to the appropriate format, and sending them to the AI service:
1async def add_audio_listener(self, stream: Stream, peer_name: str):
2 """
3 Process audio from a participant and send it to AI for transcription.
4 """
5 print("Participant stream enabled", peer_name)
6 while True:
7 try:
8 await asyncio.sleep(0.01) # Small delay to prevent CPU hogging
9
10 # Get audio frame
11 frame = await stream.track.recv()
12 audio_data = frame.to_ndarray()[0]
13
14 # Convert to float for processing
15 audio_data_float = (
16 audio_data.astype(np.float32) / np.iinfo(np.int16).max
17 )
18
19 # Convert to mono and resample to 16kHz (required by Whisper)
20 audio_mono = librosa.to_mono(audio_data_float.T)
21 audio_resampled = librosa.resample(
22 audio_mono, orig_sr=48000, target_sr=16000
23 )
24
25 # Convert back to PCM format for processing
26 pcm_frame = (
27 (audio_resampled * np.iinfo(np.int16).max)
28 .astype(np.int16)
29 .tobytes()
30 )
31
32 # Send to AI for processing
33 await self.intelligence.send_audio_data(pcm_frame)
34
35 except Exception as e:
36 print("Audio processing error:", e)
37 break
38
Implementing Bidirectional Screen Analysis
One of the most powerful features of this integration is the ability to analyze shared screens in real-time. This allows the AI assistant to provide contextual help about what the user is seeing:
1async def handle_function_call(self, function_call):
2 """
3 Handle function calls from AI, particularly for screen analysis.
4 """
5 if function_call.name == "analyze_screen":
6 if not self.latest_frame:
7 return "No screen content available"
8
9 # Convert frame to image
10 image_data = self.latest_frame.to_ndarray()
11 image = Image.fromarray(image_data)
12
13 try:
14 # Request analysis from Gemini
15 response = await self.loop.run_in_executor(
16 None, # Use default executor
17 lambda: self.vision_model.generate_content([
18 "Analyze this screen to help user. Focus on relevant UI elements, text, code, and context.",
19 image
20 ])
21 )
22 return response.text
23 except Exception as e:
24 return f"Analysis error: {str(e)}"
25 return "Unknown command"
26
Capturing and Processing Screen Shares
To enable screen analysis, you need to capture and process screen share streams:
1async def add_screenshare_listener(self, stream: Stream, peer_name: str):
2 """
3 Store the latest frame from a screen share stream.
4 """
5 print("Participant screenshare enabled", peer_name)
6 while True:
7 try:
8 frame = await stream.track.recv()
9 self.latest_frame = frame # Update latest frame
10 except Exception as e:
11 traceback.print_exc()
12 print("Screenshare processing error:", e)
13 break
14
Managing Communication Flow
The power of bidirectional streaming comes from managing the flow of information in both directions. This requires:
- Capturing user input (audio, video)
- Processing with AI
- Generating and delivering responses
The following function demonstrates how the AI can respond once it has processed screen content:
1async def process_function_call(self, function_call):
2 """
3 Process a function call from the AI model and respond.
4 """
5 # Execute the function call and get the result
6 result = await self.handle_function_call(function_call)
7
8 print("Sending response of tool call", result)
9
10 # Create an item with the function call output
11 res = ItemCreate(item=FunctionCallOutputItemParam(
12 call_id=function_call.call_id,
13 output=result
14 ))
15
16 # Send the function result back to AI
17 await self.send_request(res)
18
19 # Create a response to instruct the assistant to vocalize the output
20 response_instruction = ResponseCreate(
21 response=ResponseCreateParams(
22 modalities=["text", "audio"], # Generate both text and audio
23 instructions=f"Ask user what help is needed and provide answer in 2 lines based on following screen result - {result}",
24 voice="alloy",
25 output_audio_format="pcm16"
26 )
27 )
28
29 # Send the instruction to the assistant
30 await self.send_request(response_instruction)
31
Building the Client-Side Application
The client-side of our bidirectional streaming application needs to handle the user interface and media transmission. With VideoSDK, we can create a React application that manages meeting participants:
1export const MeetingView: React.FC<MeetingViewProps> = ({ setMeetingId }) => {
2 const {
3 participants,
4 localScreenShareOn,
5 toggleScreenShare,
6 end,
7 meetingId,
8 localMicOn,
9 localWebcamOn,
10 toggleWebcam,
11 toggleMic,
12 } = useMeeting();
13 const { token, aiJoined, setAiJoined } = useMeetingStore();
14
15 const inviteAI = async () => {
16 try {
17 const response = await fetch("http://localhost:8000/join-player", {
18 method: "POST",
19 headers: { "Content-Type": "application/json" },
20 body: JSON.stringify({ meeting_id: meetingId, token }),
21 });
22
23 if (!response.ok) throw new Error("Failed to invite AI");
24 setAiJoined(true);
25 } catch (error) {
26 console.error("Error inviting AI:", error);
27 }
28 };
29
30 // Render the UI with participant views and controls
31 return (
32 <div className="min-h-screen bg-gradient-to-br from-gray-900 to-black p-8">
33 {/* Meeting layout and controls */}
34 </div>
35 );
36};
37
Handling Errors and Edge Cases
In real-time applications, handling errors gracefully is critical. Here's an example of robust error handling in the audio processing pipeline:
1async def add_audio_listener(self, stream: Stream, peer_name: str):
2 """
3 Process audio from a participant and send it to AI for transcription.
4 """
5 print("Participant stream enabled", peer_name)
6 while True:
7 try:
8 # Audio processing code...
9
10 except Exception as e:
11 print("Audio processing error:", e)
12 # Attempt reconnection or recovery
13 await asyncio.sleep(1) # Delay before retry
14 # If error persists, exit the loop
15 if retry_count > MAX_RETRIES:
16 break
17 retry_count += 1
18
Key Considerations for Production
When moving your bidirectional streaming application to production, consider:
- Scalability: Use a distributed architecture to handle multiple concurrent sessions
- Security: Implement proper authentication and encryption for all communications
- Fallback mechanisms: Provide degraded functionality when connectivity issues occur
- User experience: Design for variations in network conditions and device capabilities
Conclusion
Bidirectional streaming with Gemini Live API and VideoSDK creates powerful opportunities for building intelligent, interactive applications. This approach enables real-time audio processing, screen analysis, and natural communication between users and AI assistants.
By following the implementation patterns shown in this guide, you can create applications that analyze visual content, respond to verbal queries, and interact naturally with users - all in real-time. The combination of VideoSDK's real-time communication capabilities with Gemini's advanced AI features provides a robust foundation for next-generation interactive applications.
As you build your own applications, remember that optimizing the streaming experience requires careful attention to latency, error handling, and user feedback mechanisms. With these considerations in mind, you can create seamless, responsive experiences that leverage the full potential of bidirectional AI communication.
Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ