Introducing "NAMO" Real-Time Speech AI Model: On-Device & Hybrid Cloud 📢PRESS RELEASE

Building an AI Meeting Assistant with Gemini Vision API and VideoSDK: Analyzing Screen Content in Real-time

Discover how to build a smart AI meeting assistant that can analyze shared screens using Gemini Vision API and VideoSDK. This comprehensive guide walks through the implementation of a system that combines real-time communication, audio processing, and visual analysis to provide contextual help during meetings.

In today's digital-first workplace, virtual meetings have become the cornerstone of collaboration. But what if your meetings could be enhanced with an AI assistant that understands not just what's being said, but also what's being shown on screen? This is now possible by combining Google's powerful Gemini Vision API with VideoSDK's real-time communication capabilities.
In this article, we'll explore how to build an intelligent meeting assistant that can join video calls, analyze shared screens in real-time, and provide contextual assistance based on both audio and visual inputs.

Understanding Gemini Vision API

Gemini Vision API is part of Google's advanced multimodal AI models, designed to process and understand visual information with remarkable accuracy. This technology represents a significant step forward in

conversational AI and voice agent technology

, allowing for more natural interactions with visual content. Unlike traditional vision APIs that might just identify objects in images, Gemini can:
  • Understand the context of visual information
  • Read and interpret text within images and screenshots
  • Analyze diagrams, charts, and technical content
  • Provide insights about UI elements and application interfaces
  • Connect visual elements with natural language understanding
This makes it particularly valuable for applications like our AI meeting assistant, where understanding screen content is essential for providing relevant help. Similar approaches have been used with other voice technologies like

Deepgram

, but Gemini offers enhanced visual analysis capabilities.

Project Overview: VideoSDK Gemini Vision Agent

Our project combines

VideoSDK's real-time communication technology

with OpenAI's conversational AI and Gemini Vision API to create an intelligent meeting assistant. This approach builds on concepts similar to

building AI interview assistants

but extends the functionality to include real-time screen analysis. Our assistant can:
  1. Join video meetings as a participant
  2. Listen to participants' questions using OpenAI's Whisper for transcription
  3. Capture and analyze shared screens in real-time using Gemini Vision API
  4. Provide contextual assistance based on both speech and visual content
  5. Respond naturally using text-to-speech
This integration enables more

intelligent virtual assistance

that can understand both what users say and what they see.

Setting Up Your Development Environment

Prerequisites

To build this project, you'll need:
  • Python for the server-side components

    • VideoSDK offers excellent Python support for WebRTC applications
  • Node.js/npm for the client application
  • API keys for:
    • VideoSDK (for video conferencing)
    • OpenAI (for speech transcription and conversation)
    • Google Gemini (for vision analysis)
If you're new to WebRTC, you might find the

comprehensive WebRTC tutorial

helpful for understanding the fundamental concepts.

Configuration Setup

Clone the repository and set up environment variables:
1git clone https://github.com/videosdk-community/videosdk-gemini-vision-agent.git
2cd videosdk-gemini-vision-agent
3
4# Server environment setup
5python -m venv .venv
6pip install -r requirements.txt
7cp .env.example .env
8
9# Add your API keys to .env
10# OPENAI_API_KEY=your_openai_key_here
11# GEMINI_API_KEY=your_gemini_api_key
12
13# Client setup
14cd client
15cp .env.example .env
16# Add VITE_VIDEOSDK_TOKEN=your_videosdk_auth_token_here
17
Understanding how

API communication works

with these services will help you troubleshoot any integration issues that might arise.

The Core Architecture: How It Works

The AI agent implements several key components:
  1. VideoSDK Integration: For joining meetings and accessing audio/video streams
  2. Audio Processing Pipeline: To capture, process, and transcribe speech
  3. Screen Capture System: To grab frames from shared screens
  4. Gemini Vision Analysis: To understand screen content
  5. OpenAI Conversation: To manage the dialogue and generate responses
Let's examine each component.

The AI Agent Class

The heart of our system is the AIAgent class, which orchestrates the entire flow:
1class AIAgent:
2    def __init__(self, meeting_id: str, authToken: str, name: str):
3        # Initialize components
4        self.loop = asyncio.get_event_loop()
5        self.audio_track = CustomAudioStreamTrack(
6            loop=self.loop,
7            handle_interruption=True
8        )
9        
10        # Configure the meeting
11        self.meeting_config = MeetingConfig(
12            name=name,
13            meeting_id=meeting_id,
14            token=authToken,
15            mic_enabled=True,
16            webcam_enabled=False,
17            custom_microphone_audio_track=self.audio_track,
18        )
19        
20        # Initialize Gemini vision model
21        if gemini_api_key:
22            genai.configure(api_key=gemini_api_key)
23            self.vision_model = genai.GenerativeModel('gemini-1.5-flash')
24        
25        # Initialize OpenAI for transcription and conversation
26        self.intelligence = OpenAIIntelligence(
27            loop=self.loop,
28            api_key=openai_api_key,
29            input_audio_transcription=InputAudioTranscription(model="whisper-1"),
30            tools=[screen_tool],
31            audio_track=self.audio_track,
32            handle_function_call=self.handle_function_call,
33        )
34        
35        # Storage for the latest screen frame
36        self.frame_queue = deque(maxlen=1)
37        self.latest_frame = None
38

Real-time Screen Analysis with Gemini Vision API

One of the most powerful aspects of our AI assistant is its ability to analyze shared screens in real-time. Here's how we implement this functionality:
1async def handle_function_call(self, function_call):
2    """Process function calls from OpenAI, particularly for screen analysis."""
3    if function_call.name == "analyze_screen":
4        if not self.latest_frame:
5            return "No screen content available"
6        
7        # Convert frame to image
8        image_data = self.latest_frame.to_ndarray()
9        image = Image.fromarray(image_data)
10        
11        try:
12            # Request analysis from Gemini
13            response = await self.loop.run_in_executor(
14                None,
15                lambda: self.vision_model.generate_content([
16                    "Analyze this screen to help user. Focus on relevant UI elements, text, code, and context.",
17                    image
18                ])
19            )
20            return response.text
21        except Exception as e:
22            return f"Analysis error: {str(e)}"
23    return "Unknown command"
24
The Gemini Vision API analyzes the screenshot and provides detailed information about what's on the screen, which the assistant can then use to provide contextual help to meeting participants.

Capturing Screen Shares

To analyze screens, we first need to capture them. Here's how we capture frames from participants' screen shares:
1async def add_screenshare_listener(self, stream: Stream, peer_name: str):
2    """Store the latest frame from a screen share stream."""
3    print("Participant screenshare enabled", peer_name)
4    while True:
5        try:                
6            frame = await stream.track.recv()
7            self.latest_frame = frame  # Update latest frame
8        except Exception as e:
9            traceback.print_exc()
10            print("Screenshare processing error:", e)
11            break
12

Processing Audio with OpenAI

While Gemini handles the visual analysis, OpenAI processes audio transcription and manages the conversation flow:
1async def add_audio_listener(self, stream: Stream, peer_name: str):
2    """Process audio from a participant and send it to OpenAI for transcription."""
3    while True:
4        try:
5            # Get audio frame
6            frame = await stream.track.recv()      
7            audio_data = frame.to_ndarray()[0]
8            
9            # Process audio (convert format, resample)
10            audio_data_float = (
11                audio_data.astype(np.float32) / np.iinfo(np.int16).max
12            )
13            audio_mono = librosa.to_mono(audio_data_float.T)
14            audio_resampled = librosa.resample(
15                audio_mono, orig_sr=48000, target_sr=16000
16            )
17            
18            # Convert to PCM and send to OpenAI
19            pcm_frame = (
20                (audio_resampled * np.iinfo(np.int16).max)
21                .astype(np.int16)
22                .tobytes()
23            )
24            
25            await self.intelligence.send_audio_data(pcm_frame)
26        except Exception as e:
27            print("Audio processing error:", e)
28            break
29

Handling Participant Events

Our AI agent needs to respond to meeting events like participants joining, leaving, or starting screen shares:
1def on_participant_joined(self, participant: Participant):
2    """Setup listeners for participant audio and screen shares."""
3    peer_name = participant.display_name
4    print("Participant joined:", peer_name)
5    
6    # Set instructions for the AI assistant
7    intelligence_instructions = """
8    You are an AI meeting assistant. Follow these rules:
9    1. Use analyze_screen tool when user asks about:
10    - Visible UI elements
11    - On-screen content
12    - Application help
13    - Workflow guidance
14    2. Keep responses under 2 sentences
15    3. Always acknowledge requests first
16    """
17
18    # Update OpenAI with instructions
19    asyncio.create_task(
20        self.intelligence.update_session_instructions(intelligence_instructions)
21    )
22
23    # Setup stream handlers
24    participant.add_event_listener(
25        ParticipantHandler(
26            participant_id=participant.id,
27            on_stream_enabled=on_stream_enabled,
28            on_stream_disabled=on_stream_disabled
29        )
30    )
31

Integrating OpenAI's Real-time APIs

Our project uses OpenAI's real-time APIs for transcription and conversation. The OpenAIIntelligence class manages this integration:
1class OpenAIIntelligence:
2    """Manages real-time communication with OpenAI's streaming API."""
3    
4    async def connect(self):
5        """Establish WebSocket connection to OpenAI's real-time API."""
6        url = f"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
7        
8        self.ws = await self._http_session.ws_connect(
9            url=url,
10            headers={
11                "Authorization": f"Bearer {self.api_key}",
12                "OpenAI-Beta": "realtime=v1",
13            },
14        )
15        
16        # Start message handling and update session
17        self.receive_message_task = self.loop.create_task(
18            self.receive_message_handler()
19        )
20        await self.update_session(self.session_update_params)
21
This real-time connection allows our assistant to process speech transcription and generate responses with minimal latency.

Custom Audio Output with the AudioStreamTrack

For the assistant to speak, we implement a custom audio track:
1class CustomAudioStreamTrack(CustomAudioTrack):
2    """Custom audio track for streaming AI-generated speech."""
3    
4    async def add_new_bytes(self, audio_data_stream: Iterator[bytes]):
5        """Add new audio data to be processed and played."""
6        await self._process_audio_task_queue.put(audio_data_stream)
7    
8    async def recv(self) -> AudioFrame:
9        """Provide the next audio frame for streaming."""
10        # Complex audio buffering and timing logic...
11        if len(self.frame_buffer) > 0:
12            frame = self.frame_buffer.pop(0)
13        else:
14            # Create silent frame if buffer is empty
15            frame = AudioFrame(format="s16", layout="mono", samples=self.samples)
16            for p in frame.planes:
17                p.update(bytes(p.buffer_size))
18        
19        # Set frame timing properties
20        frame.pts = pts
21        frame.time_base = time_base
22        frame.sample_rate = self.sample_rate
23        return frame
24
This class handles the complex buffering and timing requirements needed for smooth audio playback in the meeting.

How Gemini Vision Enhances Meeting Intelligence

The Gemini Vision API is the key differentiator in our meeting assistant. Here's how it provides value:
  1. Contextual Understanding: When a participant asks for help with something on screen, Gemini can analyze the current content and provide relevant assistance.
  2. Technical Content Analysis: It can interpret code, diagrams, charts, and technical documents being shared.
  3. UI Navigation Help: It can guide users through unfamiliar interfaces by identifying UI elements and explaining their functions.
  4. Error Identification: It can spot and explain errors in code or applications being demonstrated.
  5. Documentation Support: It can help participants understand complex documentation or presentations.

Deploying and Running the Application

Once everything is set up, running the application is straightforward:
1# Start the server (from project root)
2uvicorn app:app
3
4# Start the client (from /client folder)
5npm run dev
6
If you're building a more complex application, you might want to explore

how to make a video chat app

or consider different frameworks like

Angular for video chat applications

.
When a participant joins a meeting and shares their screen, the AI assistant:
  1. Captures audio and transcribes it using OpenAI's Whisper
  2. When questions are asked, it processes them using GPT-4o
  3. If the question relates to screen content, it triggers the Gemini Vision API
  4. Gemini analyzes the current screen and returns detailed information
  5. The assistant formulates a helpful response using both the transcript and visual analysis
  6. The response is spoken back using text-to-speech

Performance Considerations

Real-time screen analysis can be resource-intensive. Here are some optimization strategies:
  1. Selective Analysis: Only analyze screens when explicitly requested, rather than continuously
  2. Frame Rate Control: Analyze at a lower frequency than the screen share frame rate
  3. Resolution Adjustment: Resize frames before analysis to reduce processing requirements
  4. Caching: Cache analysis results for static content to avoid repeated processing

Privacy and Security Implications

When implementing this type of solution, consider:
  1. Data Processing: Ensure users are aware that screen content is being sent to external APIs
  2. Sensitive Information: Provide mechanisms to pause analysis when sensitive information is on screen
  3. API Key Security: Securely manage API keys and credentials
  4. Compliance: Consider regulatory requirements for data processing in your jurisdiction

Future Directions and Alternatives

As you develop your AI meeting assistant further, you might want to explore different approaches to real-time communication. For instance, comparing

WebRTC vs. gRPC

or considering newer technologies like

WebTransport vs. WebRTC

could inform your architecture decisions for specific use cases.
For interactive applications in different domains, you might also explore

Unity WebRTC integration

or consider how this approach could be adapted for

interactive live streaming

scenarios.

Conclusion

By combining VideoSDK's real-time communication capabilities with OpenAI's conversational AI and Google's Gemini Vision API, we've created a powerful AI meeting assistant that can understand both verbal questions and visual context.
This integration opens up new possibilities for meeting productivity, technical support, educational settings, and collaborative work. For specific use cases, you might consider building on this foundation to create

HIPAA-compliant video conferencing solutions

or specialized applications for sectors like

fitness

.
As multimodal AI continues to advance, we can expect even more sophisticated capabilities in the future, potentially enabling more complex applications like

real-time messaging with integrated video

.
The next time someone asks "What does this button do?" or "Can you explain what's on my screen?", your AI assistant will be ready with a helpful, contextually aware response.
Note: This implementation requires API keys for VideoSDK, OpenAI, and Google's Gemini. Make sure to follow each platform's usage policies and pricing considerations when deploying in production.

Get 10,000 Free Minutes Every Months

No credit card required to start

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ