Skip to main content

Introduction

Gemini Live API enables low-latency, real-time voice and video interactions with Gemini. It can process continuous audio, video, or text streams to provide instant, natural-sounding voice responses. Key Features:
  • ✅ Proactive Audio: Control when and in what contexts the model responds
  • ✅ Audio Transcription: Provides text transcription of user input and model output
  • ✅ Tool Use: Integration with function calls and Google Search
  • ✅ Empathetic Dialogue: Adjusts response style and tone based on user emotions
  • ✅ Interruption: Users can interrupt the model at any time for responsive interaction
  • ✅ Multilingual Support: Supports conversations in 24 languages
  • ✅ High Audio Quality: Natural, realistic voice in multiple languages

Interface Description

Endpoint: wss://console.mixroute.io/v1beta/models/{model}/liveStream Features:
  • Supports all Gemini native features
  • Direct passthrough, no protocol conversion
  • Uses Gemini Live API native format
Example:
const ws = new WebSocket('wss://console.mixroute.io/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream', {
  headers: {
    'Authorization': 'Bearer sk-xxxx'
  }
});

Authentication

Bearer Token, e.g., Bearer sk-xxxxxxxxxx

Supported Models

The following models support Gemini Live API:
Model IDAvailabilityUse CaseKey Features
gemini-live-2.5-flash-native-audioGenerally AvailableRecommended. Low-latency voice agent. Supports seamless multilingual switching and emotional tone.Native audio, audio transcription, voice activity detection, empathetic dialogue, proactive audio, tool use

Voice and Language Configuration

Voice Configuration

Gemini Live API supports 30 different preset voices, each with unique expressive characteristics:
Voice NameStyleVoice NameStyleVoice NameStyle
ZephyrBrightPuckCheerfulCharonInformative
KoreFirmFenrirExcitedLedaYouthful
OrusFirmAoedeLightCallirrhoeEasy-going
AutonoeBrightEnceladusBreathyIapetusClear
UmbrielRelaxedAlgiebaSmoothDespinaFlowing
ErinomeClearAlgenibRaspyRasalgethiInformative
LaomedeiaCheerfulAchernarSoftAlnilamStrong
SchedarSteadyGacruxMaturePulcherrimaPositive
AchirdFriendlyZubenelgenubiCasualVindemiatrixGentle
SadachbiaLivelySadaltagerScholarlySulafatWarm
Default voice: Zephyr (Bright)

Language Configuration

Supports 24 languages, specified via BCP-47 language codes:
LanguageCodeLanguageCode
Arabic (Egypt)ar-EGGerman (Germany)de-DE
English (US)en-USSpanish (US)es-US
French (France)fr-FRHindi (India)hi-IN
Indonesianid-IDItalian (Italy)it-IT
Japanese (Japan)ja-JPKorean (Korea)ko-KR
Portuguese (Brazil)pt-BRRussian (Russia)ru-RU
Dutch (Netherlands)nl-NLPolish (Poland)pl-PL
Thai (Thailand)th-THTurkish (Turkey)tr-TR
Vietnamese (Vietnam)vi-VNRomanianro-RO
Ukrainianuk-UABengalibn-BD
English (India)en-INMarathi (India)mr-IN
Tamil (India)ta-INTelugu (India)te-IN
Chinese (Simplified)zh-CN
Default language: Automatically inferred from system instruction language

Usage Examples

const ws = new WebSocket('wss://console.mixroute.io/v1beta/models/gemini-live-2.5-flash-native-audio/liveStream', {
  headers: {
    'Authorization': 'Bearer sk-xxxx'
  }
});

ws.onopen = () => {
  console.log('WebSocket connected');
  
  // Send setup message
  ws.send(JSON.stringify({
    setup: {
      model: "gemini-live-2.5-flash-native-audio",
      generationConfig: {
        temperature: 0.7,
        responseModalities: ["AUDIO"]
      },
      systemInstruction: {
        parts: [
          { text: "You are a helpful assistant. Speak naturally and conversationally." }
        ]
      },
      speechConfig: {
        voiceConfig: {
          prebuiltVoiceConfig: {
            voiceName: "Puck"
          }
        }
      }
    }
  }));
};

ws.onmessage = (event) => {
  const message = JSON.parse(event.data);
  console.log('Received:', message);
  
  if (message.serverContent) {
    // Handle output transcription (audio to text)
    if (message.serverContent.outputTranscription) {
      const text = message.serverContent.outputTranscription.text;
      if (text) {
        console.log('[Transcription]', text);
      }
    }
    
    if (message.serverContent.modelTurn) {
      // Handle model output
      message.serverContent.modelTurn.parts.forEach(part => {
        if (part.text) {
          console.log('Text:', part.text);
        }
        if (part.inlineData && part.inlineData.mimeType === "audio/pcm") {
          // Handle audio data
          const audioData = part.inlineData.data;
          // audioData is base64 encoded PCM audio
        }
      });
    }
    if (message.serverContent.turnComplete) {
      console.log('Turn complete');
    }
  }
  
  if (message.setupComplete) {
    console.log('Setup complete');
  }
};

// Send real-time audio input
function sendRealtimeAudio(audioBuffer) {
  const base64Audio = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  
  ws.send(JSON.stringify({
    realtimeInput: {
      mediaChunks: [
        {
          mimeType: "audio/pcm;rate=16000",
          data: base64Audio
        }
      ]
    }
  }));
}

// Send text message
function sendText(text) {
  ws.send(JSON.stringify({
    clientContent: {
      turns: [
        {
          role: "user",
          parts: [
            { text: text }
          ]
        }
      ],
      turnComplete: true
    }
  }));
}

Configuration Examples

{
  "setup": {
    "model": "gemini-live-2.5-flash-native-audio",
    "generationConfig": {
      "temperature": 0.7,
      "responseModalities": ["AUDIO"],
      "speechConfig": {
        "voiceConfig": {
          "prebuiltVoiceConfig": {
            "voiceName": "Zephyr"
          }
        },
        "languageCode": "en-US"
      }
    },
    "systemInstruction": {
      "parts": [
        {"text": "You are a friendly assistant. Answer questions naturally and conversationally."}
      ]
    }
  }
}
Configuration Notes:
  • responseModalities: Response modality, can only choose one of the following:
    • ["AUDIO"] - Audio output only
    • ["AUDIO", "TEXT"] - Audio + text transcription (recommended, get both audio and text)
  • voiceName: Voice name, supports 30 preset voices (see voice configuration table above)
  • languageCode: Language code, supports 24 languages (see language configuration table above)
  • googleSearch: Enable Google Search functionality
  • proactiveAudio: Proactive audio, model can choose not to respond to irrelevant audio
  • empatheticMode: Empathetic dialogue, adjusts response style based on emotions
  • outputAudioTranscription: Enable audio-to-text output (requires "TEXT" in responseModalities to see transcription)
  • automaticActivityDetection: Voice activity detection configuration

Message Types

Client Messages

Message TypeDescription
setupSession configuration
clientContentClient content (text/audio)
realtimeInputReal-time audio input
toolResponseTool response

Server Messages

Message TypeDescription
setupCompleteSetup complete confirmation
serverContentServer content (text/audio/transcription)
toolCallTool call
toolCallCancellationTool call cancellation
usageMetadataUsage statistics

Token Statistics

System tracks separately:
  • Total Token count
  • Audio Tokens (input/output)
  • Text Tokens (input/output)
Usage information is returned in usageMetadata messages:
{
  "usageMetadata": {
    "totalTokenCount": 100,
    "inputTokenCount": 50,
    "outputTokenCount": 50,
    "inputTokenDetails": {
      "textTokens": 30,
      "audioTokens": 20
    },
    "outputTokenDetails": {
      "textTokens": 25,
      "audioTokens": 25
    }
  }
}

Pricing

Model prices may change. Please refer to the latest prices shown in the model marketplace.
Gemini Live API is billed by token, tracking text and audio tokens separately:
  • Text Tokens: For input text content and output text transcriptions
  • Audio Tokens: For input audio and output audio content
The system returns detailed usage statistics in usageMetadata messages, including text and audio input/output token counts.

Technical Specifications

Audio Format

Input Audio:
  • Format: 16-bit PCM
  • Sample Rate: 16kHz
  • Byte Order: Little-endian
  • Encoding: Base64
Output Audio:
  • Format: 16-bit PCM
  • Sample Rate: 24kHz
  • Byte Order: Little-endian
  • Encoding: Base64

FAQ

Specify the voice name in speechConfig.voiceConfig.prebuiltVoiceConfig.voiceName in the setup message. 30 preset voices are supported, see the voice configuration section above for the complete list. Default voice is Zephyr.
Two conditions must be met:
  1. Add outputAudioTranscription: {} field in the setup message
  2. Include "TEXT" in generationConfig.responseModalities (e.g., ["AUDIO", "TEXT"])
Once enabled, the server will return audio text transcription in serverContent.outputTranscription.
Add tool definitions in the setup message:
{
  "setup": {
    "tools": {
      "functionDeclarations": [
        {
          "name": "get_weather",
          "description": "Get the weather",
          "parameters": {
            "type": "object",
            "properties": {
              "location": {
                "type": "string"
              }
            }
          }
        }
      ]
    }
  }
}
Sending a new realtimeInput or clientContent message will interrupt the current response.
Yes, Gemini Live API supports video input. Video data (JPEG format, 1 FPS) can be included in clientContent.
The system sends usageMetadata messages during or after the response, containing detailed usage statistics.
Configure in realtimeInputConfig.automaticActivityDetection in the setup message:
{
  "realtimeInputConfig": {
    "automaticActivityDetection": {
      "startOfSpeechSensitivity": "START_SENSITIVITY_LOW",
      "endOfSpeechSensitivity": "END_SENSITIVITY_HIGH",
      "prefixPaddingMs": 0,
      "silenceDurationMs": 0
    }
  }
}

Reference Documentation

wscat -c "wss://console.mixroute.io/ws/google.ai.generativelanguage.v1beta.GenerativeService.BidiGenerateContent?key=sk-xxxxxxxxxx"
{
  "setupComplete": {}
}