Building an AI Meeting Assistant with Gemini Vision API and VideoSDK: Analyzing Screen Content in Real-time
Discover how to build a smart AI meeting assistant that can analyze shared screens using Gemini Vision API and VideoSDK. This comprehensive guide walks through the implementation of a system that combines real-time communication, audio processing, and visual analysis to provide contextual help during meetings.
Understanding Gemini Vision API
conversational AI and voice agent technology
, allowing for more natural interactions with visual content. Unlike traditional vision APIs that might just identify objects in images, Gemini can:- Understand the context of visual information
- Read and interpret text within images and screenshots
- Analyze diagrams, charts, and technical content
- Provide insights about UI elements and application interfaces
- Connect visual elements with natural language understanding
Deepgram
, but Gemini offers enhanced visual analysis capabilities.Project Overview: VideoSDK Gemini Vision Agent
VideoSDK's real-time communication technology
with OpenAI's conversational AI and Gemini Vision API to create an intelligent meeting assistant. This approach builds on concepts similar tobuilding AI interview assistants
but extends the functionality to include real-time screen analysis. Our assistant can:- Join video meetings as a participant
- Listen to participants' questions using OpenAI's Whisper for transcription
- Capture and analyze shared screens in real-time using Gemini Vision API
- Provide contextual assistance based on both speech and visual content
- Respond naturally using text-to-speech
intelligent virtual assistance
that can understand both what users say and what they see.Setting Up Your Development Environment
Prerequisites
Python for the server-side components
- VideoSDK offers excellent Python support for WebRTC applications
- Node.js/npm for the client application
- API keys for:
- VideoSDK (for video conferencing)
- OpenAI (for speech transcription and conversation)
- Google Gemini (for vision analysis)
comprehensive WebRTC tutorial
helpful for understanding the fundamental concepts.Configuration Setup
1git clone https://github.com/videosdk-community/videosdk-gemini-vision-agent.git
2cd videosdk-gemini-vision-agent
3
4# Server environment setup
5python -m venv .venv
6pip install -r requirements.txt
7cp .env.example .env
8
9# Add your API keys to .env
10# OPENAI_API_KEY=your_openai_key_here
11# GEMINI_API_KEY=your_gemini_api_key
12
13# Client setup
14cd client
15cp .env.example .env
16# Add VITE_VIDEOSDK_TOKEN=your_videosdk_auth_token_here
17
API communication works
with these services will help you troubleshoot any integration issues that might arise.The Core Architecture: How It Works
- VideoSDK Integration: For joining meetings and accessing audio/video streams
- Audio Processing Pipeline: To capture, process, and transcribe speech
- Screen Capture System: To grab frames from shared screens
- Gemini Vision Analysis: To understand screen content
- OpenAI Conversation: To manage the dialogue and generate responses
The AI Agent Class
AIAgent
class, which orchestrates the entire flow:1class AIAgent:
2 def __init__(self, meeting_id: str, authToken: str, name: str):
3 # Initialize components
4 self.loop = asyncio.get_event_loop()
5 self.audio_track = CustomAudioStreamTrack(
6 loop=self.loop,
7 handle_interruption=True
8 )
9
10 # Configure the meeting
11 self.meeting_config = MeetingConfig(
12 name=name,
13 meeting_id=meeting_id,
14 token=authToken,
15 mic_enabled=True,
16 webcam_enabled=False,
17 custom_microphone_audio_track=self.audio_track,
18 )
19
20 # Initialize Gemini vision model
21 if gemini_api_key:
22 genai.configure(api_key=gemini_api_key)
23 self.vision_model = genai.GenerativeModel('gemini-1.5-flash')
24
25 # Initialize OpenAI for transcription and conversation
26 self.intelligence = OpenAIIntelligence(
27 loop=self.loop,
28 api_key=openai_api_key,
29 input_audio_transcription=InputAudioTranscription(model="whisper-1"),
30 tools=[screen_tool],
31 audio_track=self.audio_track,
32 handle_function_call=self.handle_function_call,
33 )
34
35 # Storage for the latest screen frame
36 self.frame_queue = deque(maxlen=1)
37 self.latest_frame = None
38
Real-time Screen Analysis with Gemini Vision API
1async def handle_function_call(self, function_call):
2 """Process function calls from OpenAI, particularly for screen analysis."""
3 if function_call.name == "analyze_screen":
4 if not self.latest_frame:
5 return "No screen content available"
6
7 # Convert frame to image
8 image_data = self.latest_frame.to_ndarray()
9 image = Image.fromarray(image_data)
10
11 try:
12 # Request analysis from Gemini
13 response = await self.loop.run_in_executor(
14 None,
15 lambda: self.vision_model.generate_content([
16 "Analyze this screen to help user. Focus on relevant UI elements, text, code, and context.",
17 image
18 ])
19 )
20 return response.text
21 except Exception as e:
22 return f"Analysis error: {str(e)}"
23 return "Unknown command"
24
Capturing Screen Shares
1async def add_screenshare_listener(self, stream: Stream, peer_name: str):
2 """Store the latest frame from a screen share stream."""
3 print("Participant screenshare enabled", peer_name)
4 while True:
5 try:
6 frame = await stream.track.recv()
7 self.latest_frame = frame # Update latest frame
8 except Exception as e:
9 traceback.print_exc()
10 print("Screenshare processing error:", e)
11 break
12
Processing Audio with OpenAI
1async def add_audio_listener(self, stream: Stream, peer_name: str):
2 """Process audio from a participant and send it to OpenAI for transcription."""
3 while True:
4 try:
5 # Get audio frame
6 frame = await stream.track.recv()
7 audio_data = frame.to_ndarray()[0]
8
9 # Process audio (convert format, resample)
10 audio_data_float = (
11 audio_data.astype(np.float32) / np.iinfo(np.int16).max
12 )
13 audio_mono = librosa.to_mono(audio_data_float.T)
14 audio_resampled = librosa.resample(
15 audio_mono, orig_sr=48000, target_sr=16000
16 )
17
18 # Convert to PCM and send to OpenAI
19 pcm_frame = (
20 (audio_resampled * np.iinfo(np.int16).max)
21 .astype(np.int16)
22 .tobytes()
23 )
24
25 await self.intelligence.send_audio_data(pcm_frame)
26 except Exception as e:
27 print("Audio processing error:", e)
28 break
29
Handling Participant Events
1def on_participant_joined(self, participant: Participant):
2 """Setup listeners for participant audio and screen shares."""
3 peer_name = participant.display_name
4 print("Participant joined:", peer_name)
5
6 # Set instructions for the AI assistant
7 intelligence_instructions = """
8 You are an AI meeting assistant. Follow these rules:
9 1. Use analyze_screen tool when user asks about:
10 - Visible UI elements
11 - On-screen content
12 - Application help
13 - Workflow guidance
14 2. Keep responses under 2 sentences
15 3. Always acknowledge requests first
16 """
17
18 # Update OpenAI with instructions
19 asyncio.create_task(
20 self.intelligence.update_session_instructions(intelligence_instructions)
21 )
22
23 # Setup stream handlers
24 participant.add_event_listener(
25 ParticipantHandler(
26 participant_id=participant.id,
27 on_stream_enabled=on_stream_enabled,
28 on_stream_disabled=on_stream_disabled
29 )
30 )
31
Integrating OpenAI's Real-time APIs
OpenAIIntelligence
class manages this integration:1class OpenAIIntelligence:
2 """Manages real-time communication with OpenAI's streaming API."""
3
4 async def connect(self):
5 """Establish WebSocket connection to OpenAI's real-time API."""
6 url = f"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
7
8 self.ws = await self._http_session.ws_connect(
9 url=url,
10 headers={
11 "Authorization": f"Bearer {self.api_key}",
12 "OpenAI-Beta": "realtime=v1",
13 },
14 )
15
16 # Start message handling and update session
17 self.receive_message_task = self.loop.create_task(
18 self.receive_message_handler()
19 )
20 await self.update_session(self.session_update_params)
21
Custom Audio Output with the AudioStreamTrack
1class CustomAudioStreamTrack(CustomAudioTrack):
2 """Custom audio track for streaming AI-generated speech."""
3
4 async def add_new_bytes(self, audio_data_stream: Iterator[bytes]):
5 """Add new audio data to be processed and played."""
6 await self._process_audio_task_queue.put(audio_data_stream)
7
8 async def recv(self) -> AudioFrame:
9 """Provide the next audio frame for streaming."""
10 # Complex audio buffering and timing logic...
11 if len(self.frame_buffer) > 0:
12 frame = self.frame_buffer.pop(0)
13 else:
14 # Create silent frame if buffer is empty
15 frame = AudioFrame(format="s16", layout="mono", samples=self.samples)
16 for p in frame.planes:
17 p.update(bytes(p.buffer_size))
18
19 # Set frame timing properties
20 frame.pts = pts
21 frame.time_base = time_base
22 frame.sample_rate = self.sample_rate
23 return frame
24
How Gemini Vision Enhances Meeting Intelligence
- Contextual Understanding: When a participant asks for help with something on screen, Gemini can analyze the current content and provide relevant assistance.
- Technical Content Analysis: It can interpret code, diagrams, charts, and technical documents being shared.
- UI Navigation Help: It can guide users through unfamiliar interfaces by identifying UI elements and explaining their functions.
- Error Identification: It can spot and explain errors in code or applications being demonstrated.
- Documentation Support: It can help participants understand complex documentation or presentations.
Deploying and Running the Application
1# Start the server (from project root)
2uvicorn app:app
3
4# Start the client (from /client folder)
5npm run dev
6
how to make a video chat app
or consider different frameworks likeAngular for video chat applications
.- Captures audio and transcribes it using OpenAI's Whisper
- When questions are asked, it processes them using GPT-4o
- If the question relates to screen content, it triggers the Gemini Vision API
- Gemini analyzes the current screen and returns detailed information
- The assistant formulates a helpful response using both the transcript and visual analysis
- The response is spoken back using text-to-speech
Performance Considerations
- Selective Analysis: Only analyze screens when explicitly requested, rather than continuously
- Frame Rate Control: Analyze at a lower frequency than the screen share frame rate
- Resolution Adjustment: Resize frames before analysis to reduce processing requirements
- Caching: Cache analysis results for static content to avoid repeated processing
Privacy and Security Implications
- Data Processing: Ensure users are aware that screen content is being sent to external APIs
- Sensitive Information: Provide mechanisms to pause analysis when sensitive information is on screen
- API Key Security: Securely manage API keys and credentials
- Compliance: Consider regulatory requirements for data processing in your jurisdiction
Future Directions and Alternatives
WebRTC vs. gRPC
or considering newer technologies likeWebTransport vs. WebRTC
could inform your architecture decisions for specific use cases.Unity WebRTC integration
or consider how this approach could be adapted forinteractive live streaming
scenarios.Conclusion
HIPAA-compliant video conferencing solutions
or specialized applications for sectors likefitness
.real-time messaging with integrated video
.Want to level-up your learning? Subscribe now
Subscribe to our newsletter for more tech based insights
FAQ