Introduction
Gemini Live API enables low-latency, real-time voice and video interactions with Gemini. It can process continuous audio, video, or text streams to provide instant, natural-sounding voice responses. Key Features:- ✅ Proactive Audio: Control when and in what contexts the model responds
- ✅ Audio Transcription: Provides text transcription of user input and model output
- ✅ Tool Use: Integration with function calls and Google Search
- ✅ Empathetic Dialogue: Adjusts response style and tone based on user emotions
- ✅ Interruption: Users can interrupt the model at any time for responsive interaction
- ✅ Multilingual Support: Supports conversations in 24 languages
- ✅ High Audio Quality: Natural, realistic voice in multiple languages
Interface Description
Endpoint:wss://console.mixroute.io/v1beta/models/{model}/liveStream
Features:
- Supports all Gemini native features
- Direct passthrough, no protocol conversion
- Uses Gemini Live API native format
Authentication
Bearer Token, e.g.,Bearer sk-xxxxxxxxxx
Supported Models
The following models support Gemini Live API:| Model ID | Availability | Use Case | Key Features |
|---|---|---|---|
gemini-live-2.5-flash-native-audio | Generally Available | Recommended. Low-latency voice agent. Supports seamless multilingual switching and emotional tone. | Native audio, audio transcription, voice activity detection, empathetic dialogue, proactive audio, tool use |
Voice and Language Configuration
Voice Configuration
Gemini Live API supports 30 different preset voices, each with unique expressive characteristics:| Voice Name | Style | Voice Name | Style | Voice Name | Style |
|---|---|---|---|---|---|
| Zephyr | Bright | Puck | Cheerful | Charon | Informative |
| Kore | Firm | Fenrir | Excited | Leda | Youthful |
| Orus | Firm | Aoede | Light | Callirrhoe | Easy-going |
| Autonoe | Bright | Enceladus | Breathy | Iapetus | Clear |
| Umbriel | Relaxed | Algieba | Smooth | Despina | Flowing |
| Erinome | Clear | Algenib | Raspy | Rasalgethi | Informative |
| Laomedeia | Cheerful | Achernar | Soft | Alnilam | Strong |
| Schedar | Steady | Gacrux | Mature | Pulcherrima | Positive |
| Achird | Friendly | Zubenelgenubi | Casual | Vindemiatrix | Gentle |
| Sadachbia | Lively | Sadaltager | Scholarly | Sulafat | Warm |
Language Configuration
Supports 24 languages, specified via BCP-47 language codes:| Language | Code | Language | Code |
|---|---|---|---|
| Arabic (Egypt) | ar-EG | German (Germany) | de-DE |
| English (US) | en-US | Spanish (US) | es-US |
| French (France) | fr-FR | Hindi (India) | hi-IN |
| Indonesian | id-ID | Italian (Italy) | it-IT |
| Japanese (Japan) | ja-JP | Korean (Korea) | ko-KR |
| Portuguese (Brazil) | pt-BR | Russian (Russia) | ru-RU |
| Dutch (Netherlands) | nl-NL | Polish (Poland) | pl-PL |
| Thai (Thailand) | th-TH | Turkish (Turkey) | tr-TR |
| Vietnamese (Vietnam) | vi-VN | Romanian | ro-RO |
| Ukrainian | uk-UA | Bengali | bn-BD |
| English (India) | en-IN | Marathi (India) | mr-IN |
| Tamil (India) | ta-IN | Telugu (India) | te-IN |
| Chinese (Simplified) | zh-CN |
Usage Examples
- JavaScript
- Python
Configuration Examples
- Audio Only Mode
- Audio + Text Transcription Mode (Recommended)
responseModalities: Response modality, can only choose one of the following:["AUDIO"]- Audio output only["AUDIO", "TEXT"]- Audio + text transcription (recommended, get both audio and text)
voiceName: Voice name, supports 30 preset voices (see voice configuration table above)languageCode: Language code, supports 24 languages (see language configuration table above)googleSearch: Enable Google Search functionalityproactiveAudio: Proactive audio, model can choose not to respond to irrelevant audioempatheticMode: Empathetic dialogue, adjusts response style based on emotionsoutputAudioTranscription: Enable audio-to-text output (requires"TEXT"inresponseModalitiesto see transcription)automaticActivityDetection: Voice activity detection configuration
Message Types
Client Messages
| Message Type | Description |
|---|---|
setup | Session configuration |
clientContent | Client content (text/audio) |
realtimeInput | Real-time audio input |
toolResponse | Tool response |
Server Messages
| Message Type | Description |
|---|---|
setupComplete | Setup complete confirmation |
serverContent | Server content (text/audio/transcription) |
toolCall | Tool call |
toolCallCancellation | Tool call cancellation |
usageMetadata | Usage statistics |
Token Statistics
System tracks separately:- Total Token count
- Audio Tokens (input/output)
- Text Tokens (input/output)
usageMetadata messages:
Pricing
Gemini Live API is billed by token, tracking text and audio tokens separately:- Text Tokens: For input text content and output text transcriptions
- Audio Tokens: For input audio and output audio content
usageMetadata messages, including text and audio input/output token counts.
Technical Specifications
Audio Format
Input Audio:- Format: 16-bit PCM
- Sample Rate: 16kHz
- Byte Order: Little-endian
- Encoding: Base64
- Format: 16-bit PCM
- Sample Rate: 24kHz
- Byte Order: Little-endian
- Encoding: Base64
FAQ
How to select a voice?
How to select a voice?
Specify the voice name in
speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName in the setup message. 30 preset voices are supported, see the voice configuration section above for the complete list. Default voice is Zephyr.How to enable audio-to-text transcription?
How to enable audio-to-text transcription?
Two conditions must be met:
- Add
outputAudioTranscription: {}field in the setup message - Include
"TEXT"ingenerationConfig.responseModalities(e.g.,["AUDIO", "TEXT"])
serverContent.outputTranscription.How to enable Google Search?
How to enable Google Search?
Add
tools: { googleSearch: {} } field in the setup message. Once enabled, the model can search for the latest web information when answering questions.How to enable tool calls?
How to enable tool calls?
Add tool definitions in the setup message:
How to interrupt model response?
How to interrupt model response?
Sending a new
realtimeInput or clientContent message will interrupt the current response.Is video input supported?
Is video input supported?
Yes, Gemini Live API supports video input. Video data (JPEG format, 1 FPS) can be included in
clientContent.How to get usage statistics?
How to get usage statistics?
The system sends
usageMetadata messages during or after the response, containing detailed usage statistics.How to configure voice recognition sensitivity?
How to configure voice recognition sensitivity?
Configure in
realtimeInputConfig.automaticActivityDetection in the setup message:Reference Documentation
- Gemini Live API Official Documentation
- WebSocket Protocol Documentation
- Send Audio Video Streams
- Configure Language and Voice