Gemini Live - MixRoute API

Introduction

Gemini Live API enables low-latency, real-time voice and video interactions with Gemini. It can process continuous audio, video, or text streams to provide instant, natural-sounding voice responses. Key Features:

✅ Proactive Audio: Control when and in what contexts the model responds
✅ Audio Transcription: Provides text transcription of user input and model output
✅ Tool Use: Integration with function calls and Google Search
✅ Empathetic Dialogue: Adjusts response style and tone based on user emotions
✅ Interruption: Users can interrupt the model at any time for responsive interaction
✅ Multilingual Support: Supports conversations in 24 languages
✅ High Audio Quality: Natural, realistic voice in multiple languages

Interface Description

Endpoint: wss://api.mixroute.ai/v1beta/models/{model}/liveStream Features:

Supports all Gemini native features
Direct passthrough, no protocol conversion
Uses Gemini Live API native format

Example:

const ws = new WebSocket('wss://api.mixroute.ai/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream', {
  headers: {
    'Authorization': 'Bearer sk-xxxx'
  }
});

Authentication

Bearer Token, e.g., Bearer sk-xxxxxxxxxx

Supported Models

The following models support Gemini Live API:

Model ID	Availability	Use Case	Key Features
`gemini-live-2.5-flash-native-audio`	Generally Available	Recommended. Low-latency voice agent. Supports seamless multilingual switching and emotional tone.	Native audio, audio transcription, voice activity detection, empathetic dialogue, proactive audio, tool use

Voice and Language Configuration

Voice Configuration

Gemini Live API supports 30 different preset voices, each with unique expressive characteristics:

Voice Name	Style	Voice Name	Style	Voice Name	Style
Zephyr	Bright	Puck	Cheerful	Charon	Informative
Kore	Firm	Fenrir	Excited	Leda	Youthful
Orus	Firm	Aoede	Light	Callirrhoe	Easy-going
Autonoe	Bright	Enceladus	Breathy	Iapetus	Clear
Umbriel	Relaxed	Algieba	Smooth	Despina	Flowing
Erinome	Clear	Algenib	Raspy	Rasalgethi	Informative
Laomedeia	Cheerful	Achernar	Soft	Alnilam	Strong
Schedar	Steady	Gacrux	Mature	Pulcherrima	Positive
Achird	Friendly	Zubenelgenubi	Casual	Vindemiatrix	Gentle
Sadachbia	Lively	Sadaltager	Scholarly	Sulafat	Warm

Default voice: Zephyr (Bright)

Language Configuration

Supports 24 languages, specified via BCP-47 language codes:

Language	Code	Language	Code
Arabic (Egypt)	ar-EG	German (Germany)	de-DE
English (US)	en-US	Spanish (US)	es-US
French (France)	fr-FR	Hindi (India)	hi-IN
Indonesian	id-ID	Italian (Italy)	it-IT
Japanese (Japan)	ja-JP	Korean (Korea)	ko-KR
Portuguese (Brazil)	pt-BR	Russian (Russia)	ru-RU
Dutch (Netherlands)	nl-NL	Polish (Poland)	pl-PL
Thai (Thailand)	th-TH	Turkish (Turkey)	tr-TR
Vietnamese (Vietnam)	vi-VN	Romanian	ro-RO
Ukrainian	uk-UA	Bengali	bn-BD
English (India)	en-IN	Marathi (India)	mr-IN
Tamil (India)	ta-IN	Telugu (India)	te-IN
Chinese (Simplified)	zh-CN

Default language: Automatically inferred from system instruction language

Usage Examples

JavaScript
Python

const ws = new WebSocket('wss://api.mixroute.ai/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream', {
  headers: {
    'Authorization': 'Bearer sk-xxxx'
  }
});

ws.onopen = () => {
  console.log('WebSocket connected');
  
  // Send setup message
  ws.send(JSON.stringify({
    setup: {
      model: "gemini-live-2.5-flash-native-audio",
      generationConfig: {
        temperature: 0.7,
        responseModalities: ["AUDIO"]
      },
      systemInstruction: {
        parts: [
          { text: "You are a helpful assistant. Speak naturally and conversationally." }
        ]
      },
      speechConfig: {
        voiceConfig: {
          prebuiltVoiceConfig: {
            voiceName: "Puck"
          }
        }
      }
    }
  }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  console.log('Received:', message);
  
  if (message.serverContent) {
    // Handle output transcription (audio to text)
    if (message.serverContent.outputTranscription) {
      const text = message.serverContent.outputTranscription.text;
      if (text) {
        console.log('[Transcription]', text);
      }
    }
    
    if (message.serverContent.modelTurn) {
      // Handle model output
      message.serverContent.modelTurn.parts.forEach(part => {
        if (part.text) {
          console.log('Text:', part.text);
        }
        if (part.inlineData && part.inlineData.mimeType === "audio/pcm") {
          // Handle audio data
          const audioData = part.inlineData.data;
          // audioData is base64 encoded PCM audio
        }
      });
    }
    if (message.serverContent.turnComplete) {
      console.log('Turn complete');
    }
  }
  
  if (message.setupComplete) {
    console.log('Setup complete');
  }
};

// Send real-time audio input
function sendRealtimeAudio(audioBuffer) {
  const base64Audio = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  
  ws.send(JSON.stringify({
    realtimeInput: {
      mediaChunks: [
        {
          mimeType: "audio/pcm;rate=16000",
          data: base64Audio
        }
      ]
    }
  }));
}

// Send text message
function sendText(text) {
  ws.send(JSON.stringify({
    clientContent: {
      turns: [
        {
          role: "user",
          parts: [
            { text: text }
          ]
        }
      ],
      turnComplete: true
    }
  }));
}

import websocket
import json
import base64
import threading

def on_message(ws, message):
    data = json.loads(message)
    print(f"Received: {data}")
    
    # Handle output transcription
    if "serverContent" in data:
        server_content = data["serverContent"]
        
        if "outputTranscription" in server_content:
            transcription = server_content["outputTranscription"]
            text = transcription.get("text", "")
            if text:
                print(f"[Transcription] {text}")
        
        if "modelTurn" in server_content:
            model_turn = server_content["modelTurn"]
            if "parts" in model_turn:
                for part in model_turn["parts"]:
                    if "text" in part:
                        print(f"Text: {part['text']}")
                    elif "inlineData" in part:
                        inline_data = part["inlineData"]
                        if inline_data.get("mimeType") == "audio/pcm":
                            audio_b64 = inline_data.get("data", "")
                            if audio_b64:
                                audio_data = base64.b64decode(audio_b64)
                                # Handle audio data

def on_error(ws, error):
    print(f"Error: {error}")

def on_close(ws, close_status_code, close_msg):
    print("Connection closed")

def on_open(ws):
    print("WebSocket connected")
    
    # Send setup message
    setup_message = {
        "setup": {
            "model": "gemini-live-2.5-flash-native-audio",
            "generationConfig": {
                "temperature": 0.7,
                "responseModalities": ["AUDIO"]
            },
            "systemInstruction": {
                "parts": [
                    {"text": "You are a helpful assistant."}
                ]
            },
            "speechConfig": {
                "voiceConfig": {
                    "prebuiltVoiceConfig": {
                        "voiceName": "Puck"
                    }
                }
            }
        }
    }
    ws.send(json.dumps(setup_message))

# Connect WebSocket
ws_url = "wss://api.mixroute.ai/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream"
ws = websocket.WebSocketApp(
    ws_url,
    header={"Authorization": "Bearer sk-xxxx"},
    on_open=on_open,
    on_message=on_message,
    on_error=on_error,
    on_close=on_close
)

ws.run_forever()

Configuration Examples

Audio Only Mode
Audio + Text Transcription Mode (Recommended)

{
  "setup": {
    "model": "gemini-live-2.5-flash-native-audio",
    "generationConfig": {
      "temperature": 0.7,
      "responseModalities": ["AUDIO"],
      "speechConfig": {
        "voiceConfig": {
          "prebuiltVoiceConfig": {
            "voiceName": "Zephyr"
          }
        },
        "languageCode": "en-US"
      }
    },
    "systemInstruction": {
      "parts": [
        {"text": "You are a friendly assistant. Answer questions naturally and conversationally."}
      ]
    }
  }
}

{
  "setup": {
    "model": "gemini-live-2.5-flash-native-audio",
    "generationConfig": {
      "temperature": 0.7,
      "responseModalities": ["AUDIO", "TEXT"],
      "speechConfig": {
        "voiceConfig": {
          "prebuiltVoiceConfig": {
            "voiceName": "Zephyr"
          }
        },
        "languageCode": "en-US"
      }
    },
    "systemInstruction": {
      "parts": [
        {"text": "You are a friendly assistant. Answer questions naturally and conversationally."}
      ]
    },
    "tools": {
      "googleSearch": {}
    },
    "proactivity": {
      "proactiveAudio": false,
      "empatheticMode": true
    },
    "outputAudioTranscription": {},
    "realtimeInputConfig": {
      "automaticActivityDetection": {
        "disabled": false,
        "startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
        "endOfSpeechSensitivity": "END_SENSITIVITY_HIGH",
        "prefixPaddingMs": 0,
        "silenceDurationMs": 0
      }
    }
  }
}

Configuration Notes:

responseModalities: Response modality, can only choose one of the following:
- ["AUDIO"] - Audio output only
- ["AUDIO", "TEXT"] - Audio + text transcription (recommended, get both audio and text)
voiceName: Voice name, supports 30 preset voices (see voice configuration table above)
languageCode: Language code, supports 24 languages (see language configuration table above)
googleSearch: Enable Google Search functionality
proactiveAudio: Proactive audio, model can choose not to respond to irrelevant audio
empatheticMode: Empathetic dialogue, adjusts response style based on emotions
outputAudioTranscription: Enable audio-to-text output (requires "TEXT" in responseModalities to see transcription)
automaticActivityDetection: Voice activity detection configuration

Message Types

Client Messages

Message Type	Description
`setup`	Session configuration
`clientContent`	Client content (text/audio)
`realtimeInput`	Real-time audio input
`toolResponse`	Tool response

Server Messages

Message Type	Description
`setupComplete`	Setup complete confirmation
`serverContent`	Server content (text/audio/transcription)
`toolCall`	Tool call
`toolCallCancellation`	Tool call cancellation
`usageMetadata`	Usage statistics

Token Statistics

System tracks separately:

Total Token count
Audio Tokens (input/output)
Text Tokens (input/output)

Usage information is returned in usageMetadata messages:

{
  "usageMetadata": {
    "totalTokenCount": 100,
    "inputTokenCount": 50,
    "outputTokenCount": 50,
    "inputTokenDetails": {
      "textTokens": 30,
      "audioTokens": 20
    },
    "outputTokenDetails": {
      "textTokens": 25,
      "audioTokens": 25
    }
  }
}

Pricing

Model prices may change. Please refer to the latest prices shown in the model marketplace.

Gemini Live API is billed by token, tracking text and audio tokens separately:

Text Tokens: For input text content and output text transcriptions
Audio Tokens: For input audio and output audio content

The system returns detailed usage statistics in usageMetadata messages, including text and audio input/output token counts.

Technical Specifications

Audio Format

Input Audio:

Format: 16-bit PCM
Sample Rate: 16kHz
Byte Order: Little-endian
Encoding: Base64

Output Audio:

Format: 16-bit PCM
Sample Rate: 24kHz
Byte Order: Little-endian
Encoding: Base64

FAQ

How to select a voice?

Specify the voice name in speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName in the setup message. 30 preset voices are supported, see the voice configuration section above for the complete list. Default voice is Zephyr.

How to enable audio-to-text transcription?

Two conditions must be met:

Add outputAudioTranscription: {} field in the setup message
Include "TEXT" in generationConfig.responseModalities (e.g., ["AUDIO", "TEXT"])

Once enabled, the server will return audio text transcription in serverContent.outputTranscription.

How to enable Google Search?

Add tools: { googleSearch: {} } field in the setup message. Once enabled, the model can search for the latest web information when answering questions.

How to enable tool calls?

Add tool definitions in the setup message:

{
  "setup": {
    "tools": {
      "functionDeclarations": [
        {
          "name": "get_weather",
          "description": "Get the weather",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string"
              }
            }
          }
        }
      ]
    }
  }
}

How to interrupt model response?

Sending a new realtimeInput or clientContent message will interrupt the current response.

Is video input supported?

Yes, Gemini Live API supports video input. Video data (JPEG format, 1 FPS) can be included in clientContent.

How to get usage statistics?

The system sends usageMetadata messages during or after the response, containing detailed usage statistics.

How to configure voice recognition sensitivity?

Configure in realtimeInputConfig.automaticActivityDetection in the setup message:

{
  "realtimeInputConfig": {
    "automaticActivityDetection": {
      "startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
      "endOfSpeechSensitivity": "END_SENSITIVITY_HIGH",
      "prefixPaddingMs": 0,
      "silenceDurationMs": 0
    }
  }
}

Reference Documentation

wscat -c "wss://api.mixroute.ai/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=sk-xxxxxxxxxx"

{
  "setupComplete": {}
}

​Introduction

​Interface Description

​Authentication

​Supported Models

​Voice and Language Configuration

​Voice Configuration

​Language Configuration

​Usage Examples

​Configuration Examples

​Message Types

​Client Messages

​Server Messages

​Token Statistics

​Pricing

​Technical Specifications

​Audio Format

​FAQ

​Reference Documentation

Introduction

Interface Description

Authentication

Supported Models

Voice and Language Configuration

Voice Configuration

Language Configuration

Usage Examples

Configuration Examples

Message Types

Client Messages

Server Messages

Token Statistics

Pricing

Technical Specifications

Audio Format

FAQ

Reference Documentation