Current location: Home> Gemini Tutorial> How to generate text from audio data in Gemini API

How to generate text from audio data in Gemini API

Author: LoRA Time:

Gemini can analyze and understand audio inputs, thus implementing the following use cases:

1. Describe, summarize or answer questions related to audio content.

2. Provide audio transliteration content.

3. Analyze specific clips of the audio.

Before calling the Gemini API, make sure that you have installed the selected SDK and have the Gemini API key configured for use.

1. Enter audio

You can provide audio data to Gemini in the following ways:

Upload the audio file first, and then issue a request to generateContent.

Pass embedded audio data to generateContent via request.

1. Upload audio files

You can upload audio files using the Files API. Always use the Files API if the total request size (including files, text prompts, system instructions, etc.) exceeds 20 MB.

The following code uploads the audio file and then uses it in a call to generateContent.

Python

 from google import genai
client = genai.Client(api_key="GOOGLE_API_KEY")
myfile = client.files.upload(file="path/to/sample.mp3")
response = client.models.generate_content(
   model="gemini-2.0-flash", contents=["Describe this audio clip", myfile]
)
print(response.text)

JavaScript

 import {
 GoogleGenAI,
 createUserContent,
 createPartFromUri,
} from "@google/genai";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
async function main() {
 const myfile = await ai.files.upload({
   file: "path/to/sample.mp3",
   config: { mimeType: "audio/mp3" },
 });
 const response = await ai.models.generateContent({
   model: "gemini-2.0-flash",
   contents: createUserContent([
     createPartFromUri(myfile.uri, myfile.mimeType),
     "Describe this audio clip",
   ]),
 });
 console.log(response.text);
}
await main();

Go

 file, err := client.UploadFileFromPath(ctx, "path/to/sample.mp3", nil)
if err != nil {
   log.Fatal(err)
}
defer client.DeleteFile(ctx, file.Name)
model := client.GeneratedModel("gemini-2.0-flash")
resp, err := model.GenerateContent(ctx,
   genai.FileData{URI: file.URI},
   genai.Text("Describe this audio clip"))
if err != nil {
   log.Fatal(err)
}
printResponse(resp)

REST

 AUDIO_PATH="path/to/sample.mp3"
MIME_TYPE=$(file -b --mime-type "${AUDIO_PATH}")
NUM_BYTES=$(wc -c < "${AUDIO_PATH}")
DISPLAY_NAME=AUDIO
tmp_header_file=upload-header.tmp
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${GOOGLE_API_KEY}" 
 -D upload-header.tmp 
 -H "X-Goog-Upload-Protocol: resumable" 
 -H "X-Goog-Upload-Command: start" 
 -H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" 
 -H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" 
 -H "Content-Type: application/json" 
 -d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" 
 -H "Content-Length: ${NUM_BYTES}" 
 -H "X-Goog-Upload-Offset: 0" 
 -H "X-Goog-Upload-Command: upload, finalize" 
 --data-binary "@${AUDIO_PATH}" 2> /dev/null> file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" 
   -H 'Content-Type: application/json' 
   -X POST 
   -d '{
     "contents": [{
       "parts":[
         {"text": "Describe this audio clip"},
         {"file_data":{"mime_type": "${MIME_TYPE}", "file_uri": '$file_uri'}}]
       }]
     }' 2> /dev/null> response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json

2. Transfer audio data inline

Instead of uploading an audio file, you can pass the embedded audio data in the request to generateContent:

Python

 from google.genai import types
with open('path/to/small-sample.mp3', 'rb') as f:
   audio_bytes = f.read()
response = client.models.generate_content(
 model='gemini-2.0-flash',
 contents=[
   'Describe this audio clip',
   types.Part.from_bytes(
     data=audio_bytes,
     mime_type='audio/mp3',
   )
 ]
)
print(response.text)

JavaScript

 import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const base64AudioFile = fs.readFileSync("path/to/small-sample.mp3", {
 encoding: "base64",
});
const contents = [
 { text: "Please summarize the audio." },
 {
   inlineData: {
     mimeType: "audio/mp3",
     data: base64AudioFile,
   },
 },
];
const response = await ai.models.generateContent({
 model: "gemini-2.0-flash",
 contents: contents,
});
console.log(response.text);

Go

 // Initialize a Gemini model appropriate for your use case.
model := client.GeneratedModel("gemini-2.0-flash")
bytes, err := os.ReadFile("path/to/small-sample.mp3")
if err != nil {
 log.Fatal(err)
}
prompt := []genai.Part{
 genai.Blob{MIMEType: "audio/mp3", Data: bytes},
 genai.Text("Please summarize the audio."),
}
// Generate content using the prompt.
resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
 log.Fatal(err)
}
// Handle the response of generated text
for _, c := range resp.Candidates {
 if c.Content != nil {
   fmt.Println(*c.Content)
 }
}

Regarding embedded audio data, please note the following points:

1. The request size is capped at 20 MB, including text prompts, system descriptions, and embedded files. If the file size causes the total request size to exceed 20 MB, use the Files API to upload the audio file for use in the request.

2. If you are using audio selections multiple times, a more efficient way is to upload audio files.

2. Obtain the transcribed content

To obtain the transcribed content of the audio data, just make a request in the prompt:

Python

 myfile = client.files.upload(file='path/to/sample.mp3')
prompt = 'Generate a transcript of the speech.'
response = client.models.generate_content(
 model='gemini-2.0-flash',
 contents=[prompt, myfile]
)
print(response.text)

JavaScript

 import {
 GoogleGenAI,
 createUserContent,
 createPartFromUri,
} from "@google/genai";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const myfile = await ai.files.upload({
 file: "path/to/sample.mp3",
 config: { mimeType: "audio/mpeg" },
});
const result = await ai.models.generateContent({
 model: "gemini-2.0-flash",
 contents: createUserContent([
   createPartFromUri(myfile.uri, myfile.mimeType),
   "Generate a transcript of the speech.",
 ]),
});
console.log("result.text=", result.text);

Go

 // Initialize a Gemini model appropriate for your use case.
model := client.GeneratedModel("gemini-2.0-flash")
// Create a prompt using text and the URI reference for the uploaded file.
prompt := []genai.Part{
 genai.FileData{URI: sampleAudio.URI},
 genai.Text("Generate a transcript of the speech."),
}
// Generate content using the prompt.
resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
 log.Fatal(err)
}
// Handle the response of generated text
for _, c := range resp.Candidates {
 if c.Content != nil {
   fmt.Println(*c.Content)
 }
}

3. Quote time stamp

You can use timestamps of the form MM:SS to reference specific parts of the audio file. For example, the following prompt will request the transcribe,

1. From the beginning of the file, the start time is 2 minutes and 30 seconds.

2. From the beginning of the file, the end time is 3 minutes and 29 seconds.

Python

 # Create a prompt containing timestamps.
prompt = "Provide a transcript of the speech from 02:30 to 03:29."

JavaScript

 // Create a prompt containing timestamps.
const prompt = "Provide a transcript of the speech from 02:30 to 03:29."

Go

 // Create a prompt containing timestamps.
prompt := []genai.Part{
   genai.FileData{URI: sampleAudio.URI},
   genai.Text("Provide a transcript of the speech from 02:30 to 03:29."),
}

4. Statistics of word elements

Call the countTokens method to get the number of tokens in the audio file. For example:

Python

 response = client.models.count_tokens(
 model='gemini-2.0-flash',
 contents=[myfile]
)
print(response)

JavaScript

 import {
 GoogleGenAI,
 createUserContent,
 createPartFromUri,
} from "@google/genai";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const myfile = await ai.files.upload({
 file: "path/to/sample.mp3",
 config: { mimeType: "audio/mpeg" },
});
const countTokensResponse = await ai.models.countTokens({
 model: "gemini-2.0-flash",
 contents: createUserContent([
   createPartFromUri(myfile.uri, myfile.mimeType),
 ]),
});
console.log(countTokensResponse.totalTokens);

Go

 tokens, err := model.CountTokens(ctx, genai.FileData{URI: sampleAudio.URI})
if err != nil {
   log.Fatal(err)
}
fmt.Printf("File %s is %d tokens", sampleAudio.DisplayName, tokens.TotalTokens)

5. Supported audio formats

Gemini supports the following audio format MIME types:

1. WAV - audio/wav

2. MP3 - audio/mp3

3. AIFF - audio/aiff

4. AAC - audio/aac

5. OGG Vorbis - audio/ogg

6. FLAC - audio/flac

6. Technical details of audio

1. Gemini represents audio per second as 32 tokens; for example, audio per minute is represented as 1,920 tokens.

2. Gemini can only infer answers to English pronunciations.

3. Gemini can "understand" non-voice content, such as bird songs or sirens.

4. The maximum support time limit for audio data in a single question is 9.5 hours. Gemini does not limit the number of audio files in a single question; however, the total duration of all audio files in a single question must not exceed 9.5 hours.

5. Gemini downsamples the audio file to a data resolution of 16 Kbps.

6. If the audio source contains multiple channels, Gemini combines these channels into one channel.