Introducing "NAMO" Real-Time Speech AI Model: On-Device & Hybrid Cloud πŸ“’PRESS RELEASE

Building a Real-time AI Translator with OpenAI's Voice API and VideoSDK

Learn how to build a real-time AI translator using OpenAI's Realtime Voice API and VideoSDK. Create applications that instantly translate conversations between different languages.

Imagine a world where language barriers disappear in real-time conversations. This is no longer science fictionβ€”it's achievable today with OpenAI's groundbreaking Realtime Voice API and VideoSDK. In this article, we'll guide you through building your own AI translator that enables seamless communication across different languages.

The Power of OpenAI's Realtime Voice API

OpenAI's Realtime Voice API represents a significant leap forward in voice technology. Unlike traditional speech processing systems that rely on separate speech-to-text and text-to-speech steps, the Realtime API offers a streamlined, low-latency solution designed specifically for conversational applications.
What makes this API revolutionary is its ability to process audio streams in real-time while maintaining context throughout the conversation. This means translations happen almost instantly, creating natural-feeling conversations despite language differences.
The gpt-4o-realtime-preview model (as seen in the provided repository) offers impressive capabilities:
  • Near-instantaneous translation between languages
  • Preservation of tone, emphasis, and speaking style
  • Context awareness throughout conversations
  • Support for multiple speakers
  • Ability to follow complex instructions

Integrating VideoSDK for a Complete Solution

While OpenAI handles the language transformation, VideoSDK provides the infrastructure for audio/video communication. This combination creates a powerful platform for

real-time multilingual interaction

.
VideoSDK offers essential features like:
  • Real-time audio/video streaming
  • Meeting creation and management
  • Participant tracking
  • Custom audio track handling
Together, these technologies enable a seamless experience where users can see and hear each other while communicating in different languages. This approach builds on fundamental

WebRTC principles

for reliable communication.

System Architecture

Let's look at how our AI translator is architected based on the GitHub repository:
1β”œβ”€β”€ agent/
2β”‚   β”œβ”€β”€ ai_agent.py           # Core AI translation agent
3β”‚   └── audio_stream_track.py # Custom audio processing
4β”œβ”€β”€ intelligence/
5β”‚   └── openai/
6β”‚       └── openai_intelligence.py  # OpenAI API integration
7β”œβ”€β”€ client/                   # Frontend application
8└── README.md                 # Project documentation
9
The system follows these high-level steps:
  1. Capture audio from participants
  2. Stream audio to OpenAI's Realtime API
  3. Process and translate speech
  4. Stream translated audio back to participants
This architecture leverages

WebSocket communication

for reliable real-time data exchange, which is critical for fluid translations.

The AI Translation Agent

The heart of this system is the AIAgent class in ai_agent.py. Let's examine its key components:

Initialization and Setup

1def __init__(self, meeting_id: str, authToken: str, name: str):
2    self.loop = asyncio.get_event_loop()
3    self.audio_track = CustomAudioStreamTrack(
4        loop=self.loop,
5        handle_interruption=True
6    )
7    self.meeting_config = MeetingConfig(
8        name=name,
9        meeting_id=meeting_id,
10        token=authToken,
11        mic_enabled=True,
12        webcam_enabled=False,
13        custom_microphone_audio_track=self.audio_track
14    )
15    # Initialize OpenAI connection parameters
16    self.intelligence = OpenAIIntelligence(
17        loop=self.loop,
18        api_key=api_key,
19        base_url="api.openai.com",
20        input_audio_transcription=InputAudioTranscription(model="whisper-1"),
21        audio_track=self.audio_track
22    )
23    
24    self.participants_data = {}
25
This initialization sets up:
  • An event loop for asynchronous operations
  • A custom audio track for handling the AI's speech
  • Meeting configuration using VideoSDK
  • Connection to OpenAI's intelligence services
  • Storage for participant information
The configuration draws on

Python WebRTC implementation techniques

for optimal performance.

Dynamic Language Translation

One of the most impressive aspects of this implementation is how it dynamically detects participant languages and creates appropriate translation instructions:
1def on_participant_joined(self, participant: Participant):
2    peer_name = participant.display_name
3    native_lang = participant.meta_data["preferredLanguage"]
4    self.participants_data[participant.id] = {
5        "name": peer_name,
6        "lang": native_lang
7    }
8    
9    if len(self.participants_data) == 2:
10        # Extract the info for each participant
11        participant_ids = list(self.participants_data.keys())
12        p1 = self.participants_data[participant_ids[0]]
13        p2 = self.participants_data[participant_ids[1]]
14
15        # Build translator-specific instructions
16        translator_instructions = f"""
17            You are a real-time translator bridging a conversation between:
18            - {p1['name']} (speaks {p1['lang']})
19            - {p2['name']} (speaks {p2['lang']})
20
21            You have to listen and speak those exactly word in different language
22            eg. when {p1['lang']} is spoken then say that exact in language {p2['lang']}
23            similar when {p2['lang']} is spoken then say that exact in language {p1['lang']}
24            Keep in account who speaks what and use 
25            NOTE - 
26            Your job is to translate, from one language to another, don't engage in any conversation
27        """
28
29        # Update OpenAI's instructions
30        asyncio.create_task(self.intelligence.update_session_instructions(translator_instructions))
31
This code:
  1. Extracts each participant's preferred language from their metadata
  2. Creates custom instructions for the AI to perform bi-directional translation
  3. Updates the OpenAI session with these instructions
This approach ensures the AI knows exactly which languages to translate between, personalizing the experience based on the participants' needs. The implementation leverages techniques similar to those used in

conversational AI voice agents

.

Real-time Audio Processing

For the translation to feel natural, audio must be processed in real-time. The repository includes a sophisticated audio processing pipeline:

Audio Capture and Processing

1async def add_audio_listener(self, stream: Stream):
2    while True:
3        try:
4            await asyncio.sleep(0.01)
5            if not self.intelligence.ws:
6                continue
7
8            frame = await stream.track.recv()      
9            audio_data = frame.to_ndarray()[0]
10            audio_data_float = (
11                audio_data.astype(np.float32) / np.iinfo(np.int16).max
12            )
13            audio_mono = librosa.to_mono(audio_data_float.T)
14            audio_resampled = librosa.resample(
15                audio_mono, orig_sr=48000, target_sr=16000
16            )
17            pcm_frame = (
18                (audio_resampled * np.iinfo(np.int16).max)
19                .astype(np.int16)
20                .tobytes()
21            )
22            
23            # Send to OpenAI
24            await self.intelligence.send_audio_data(pcm_frame)
25
26        except Exception as e:
27            print("Audio processing error:", e)
28            break
29
This function:
  1. Captures audio frames from a participant's stream
  2. Converts the audio to a suitable format (mono, 16kHz sampling rate)
  3. Sends the processed audio to OpenAI for translation
The audio pipeline builds on principles used in

AI phone agents with voice integration

to ensure clear transmission.

Custom Audio Track for AI Speech

The AI's translated speech is delivered through a CustomAudioStreamTrack that handles buffer management and smooth audio delivery:
1async def add_new_bytes(self, audio_data_stream: Iterator[bytes]):
2    await self._process_audio_task_queue.put(audio_data_stream)
3
This allows the AI to "speak" the translations through the VideoSDK stream, creating a natural conversation flow.

OpenAI Intelligence Integration

The OpenAIIntelligence class manages the real-time connection to OpenAI's services:
1async def connect(self):
2    url = f"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
3    logger.info("Establishing OpenAI WS connection... ")
4    self.ws = await self._http_session.ws_connect(
5        url=url,
6        headers={
7            "Authorization": f"Bearer {self.api_key}",
8            "OpenAI-Beta": "realtime=v1",
9        },
10    )
11    
12    if self.pending_instructions is not None:
13        await self.update_session_instructions(self.pending_instructions)
14
15    logger.info("OpenAI WS connection established")
16    self.receive_message_task = self.loop.create_task(
17        self.receive_message_handler()
18    )
19
20    await self.update_session(self.session_update_params)
21    await self.receive_message_task
22
This establishes a WebSocket connection to OpenAI's Realtime API, which enables:
  • Continuous audio streaming to the AI
  • Real-time receipt of translated audio
  • Dynamic updates to system instructions
The implementation leverages advanced

messaging protocols

to ensure reliable data transmission.

Handling AI Responses

The system processes OpenAI's responses in real-time:
1def handle_response(self, message: str):
2    message = json.loads(message)
3
4    match message["type"]:
5        case EventType.SESSION_CREATED:
6            logger.info(f"Server Message: {message["type"]}")
7            
8        case EventType.SESSION_UPDATE:
9            logger.info(f"Server Message: {message["type"]}")
10
11        case EventType.RESPONSE_AUDIO_DELTA:
12            logger.info(f"Server Message: {message["type"]}")
13            self.on_audio_response(base64.b64decode(message["delta"]))
14            
15        case EventType.RESPONSE_AUDIO_TRANSCRIPT_DONE:
16            logger.info(f"Server Message: {message["type"]}")
17            print(f"Response Transcription: {message["transcript"]}")
18        
19        case EventType.ITEM_INPUT_AUDIO_TRANSCRIPTION_COMPLETED:
20            logger.info(f"Server Message: {message["type"]}")
21            print(f"Client Transcription: {message["transcript"]}")
22
This code processes different types of messages from OpenAI:
  • Delta audio chunks (pieces of the translated speech)
  • Transcripts of the AI's response
  • Transcripts of the user's input
  • Session status updates

Setting Up Your Own AI Translator

To set up this project yourself, follow these steps based on the repository README:
  1. Clone the repository: sh git clone https://github.com/videosdk-community/videosdk-openai-realtime-translator.git cd videosdk-openai-realtime-translator
  2. Set up the client: sh cd client cp .env.example .env
    Then add your VideoSDK token to the .env file: VITE_APP_VIDEOSDK_TOKEN=your_videosdk_auth_token_here
  3. Set up the server: sh python -m venv .venv pip install -r requirements.txt cp .env.example .env
    Add your OpenAI API key to the root .env file: OPENAI_API_KEY=your_openai_key_here
  4. Start the application: ```sh

    Start the server

    uvicorn app:app

    In another terminal, start the client

    cd client npm run dev ```
For a more comprehensive understanding of WebRTC implementation, check out this

WebRTC tutorial guide

.

Challenges and Considerations

While building an AI translator with OpenAI's Realtime API, be aware of these challenges:
  1. API Costs: The Realtime API can be expensive for high-volume usage. Plan your budget accordingly.
  2. Network Latency: Even with optimized code, network conditions can affect translation speed. Consider fallback mechanisms for poor connections, such as implementing

    TURN servers

    for improved reliability.
  3. Language Limitations: While OpenAI supports many languages, performance may vary across different language pairs.
  4. Turn Detection: Fine-tuning the speech detection parameters is crucial for a natural conversation flow.
  5. Error Handling: As seen in the code, robust error handling is essential for maintaining a stable experience.

Future Enhancements

This AI translator could be enhanced with:
  1. Multilingual Support: Extending beyond two participants to handle group conversations in multiple languages with techniques used in

    AI interview assistants

    .
  2. Custom Voices: Allowing users to choose voices that match their preferences.
  3. Translation Memory: Implementing a system to remember and consistently translate specific terms.
  4. Visual Cues: Adding visual indicators to show when someone is speaking or when translation is occurring.
  5. Offline Mode: Implementing a lighter model for situations with limited connectivity.

Integrating with Video Chat Applications

The real-time translation capabilities we've built here can be extended further by integrating with full-featured video chat applications:
  1. Angular Integration: Combine with

    Angular video chat functionality

    for enterprise applications.
  2. React Native: Create mobile translation apps using

    React Native real-time messaging techniques

    .
  3. Mobile Video Chat: Implement in a comprehensive

    video chat app

    for wider adoption.
  4. Twilio Integration: Combine with

    Twilio's WebRTC capabilities

    for additional communication channels.
  5. HIPAA Compliance: Ensure medical translation applications meet

    HIPAA compliance standards

    .

Conclusion

Building a real-time AI translator with OpenAI's Realtime Voice API and VideoSDK demonstrates the incredible potential of modern AI technologies. By combining these powerful tools, developers can create solutions that bridge language gaps and foster global communication.
The repository we've examined provides a solid foundation for creating your own translation applications. Whether for business meetings, educational settings, or personal connections, these technologies open up new possibilities for human interaction across language barriers.
As OpenAI continues to refine its real-time capabilities and models become more efficient, we can expect even more natural and seamless translation experiences in the future.
Ready to build your own AI translator? Get started with the repository today and be part of breaking down language barriers worldwide. For deeper insights into voice translation agents, explore

VideoSDK's intelligent virtual assistants and AI translation agents

.

Get 10,000 Free Minutes Every Months

No credit card required to start.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ