中文(繁體)

中文(繁體) English

Gemini簡介
- 代碼執行
- 長上下文
- 音頻理解
- 圖片理解
- 視頻生成
- 圖片生成
- 文本生成
- 結構化輸出
功能

目前位置: 首頁> Gemini 教學> Gemini API 如何根據音頻數據生成文本

NEWS
Manus邀請碼申請攻略

2025年04月23日
NEWS
Character.AI 推出AvatarFX：AI 視頻生成模型讓靜態圖片“開口說話”

2025年04月23日
NEWS
Manychat完成1.4億美元B輪融資，借AI加速全球社交電商佈局

2025年04月23日
NEWS
谷歌AI概覽嚴重衝擊SEO點擊率：Ahrefs研究顯示流量下降超34%

2025年04月22日

Gemini API 如何根據音頻數據生成文本

作者: LoRA 時間: 2025年04月22日

Gemini可以分析和理解音頻輸入，從而實現以下用例：

1. 描述、總結或回答與音頻內容相關的問題。

2. 提供音頻轉寫內容。

3. 分析音頻的特定片段。

在調用Gemini API 之前，請確保您已安裝所選的SDK，並已配置好Gemini API 密鑰，可以使用。

一、輸入音頻

您可以通過以下方式向Gemini 提供音頻數據：

先上傳音頻文件，然後再向generateContent 發出請求。

通過請求傳遞內嵌音頻數據到generateContent。

1. 上傳音頻文件

您可以使用Files API 上傳音頻文件。如果請求總大小（包括文件、文本提示、系統說明等）超過20 MB，請始終使用Files API。

以下代碼會上傳音頻文件，然後在對generateContent 的調用中使用該文件。

Python

 from google import genai
client = genai.Client(api_key="GOOGLE_API_KEY")
myfile = client.files.upload(file="path/to/sample.mp3")
response = client.models.generate_content(
   model="gemini-2.0-flash", contents=["Describe this audio clip", myfile]
)
print(response.text)

JavaScript

 import {
 GoogleGenAI,
 createUserContent,
 createPartFromUri,
} from "@google/genai";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
async function main() {
 const myfile = await ai.files.upload({
   file: "path/to/sample.mp3",
   config: { mimeType: "audio/mp3" },
 });
 const response = await ai.models.generateContent({
   model: "gemini-2.0-flash",
   contents: createUserContent([
     createPartFromUri(myfile.uri, myfile.mimeType),
     "Describe this audio clip",
   ]),
 });
 console.log(response.text);
}
await main();

Go

 file, err := client.UploadFileFromPath(ctx, "path/to/sample.mp3", nil)
if err != nil {
   log.Fatal(err)
}
defer client.DeleteFile(ctx, file.Name)
model := client.GenerativeModel("gemini-2.0-flash")
resp, err := model.GenerateContent(ctx,
   genai.FileData{URI: file.URI},
   genai.Text("Describe this audio clip"))
if err != nil {
   log.Fatal(err)
}
printResponse(resp)

REST

 AUDIO_PATH="path/to/sample.mp3"
MIME_TYPE=$(file -b --mime-type "${AUDIO_PATH}")
NUM_BYTES=$(wc -c < "${AUDIO_PATH}")
DISPLAY_NAME=AUDIO
tmp_header_file=upload-header.tmp
# Initial resumable request defining metadata.
# The upload url is in the response headers dump them to a file.
curl "https://generativelanguage.googleapis.com/upload/v1beta/files?key=${GOOGLE_API_KEY}" 
 -D upload-header.tmp 
 -H "X-Goog-Upload-Protocol: resumable" 
 -H "X-Goog-Upload-Command: start" 
 -H "X-Goog-Upload-Header-Content-Length: ${NUM_BYTES}" 
 -H "X-Goog-Upload-Header-Content-Type: ${MIME_TYPE}" 
 -H "Content-Type: application/json" 
 -d "{'file': {'display_name': '${DISPLAY_NAME}'}}" 2> /dev/null
upload_url=$(grep -i "x-goog-upload-url: " "${tmp_header_file}" | cut -d" " -f2 | tr -d "r")
rm "${tmp_header_file}"
# Upload the actual bytes.
curl "${upload_url}" 
 -H "Content-Length: ${NUM_BYTES}" 
 -H "X-Goog-Upload-Offset: 0" 
 -H "X-Goog-Upload-Command: upload, finalize" 
 --data-binary "@${AUDIO_PATH}" 2> /dev/null > file_info.json
file_uri=$(jq ".file.uri" file_info.json)
echo file_uri=$file_uri
# Now generate content using that file
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=$GOOGLE_API_KEY" 
   -H 'Content-Type: application/json' 
   -X POST 
   -d '{
     "contents": [{
       "parts":[
         {"text": "Describe this audio clip"},
         {"file_data":{"mime_type": "${MIME_TYPE}", "file_uri": '$file_uri'}}]
       }]
     }' 2> /dev/null > response.json
cat response.json
echo
jq ".candidates[].content.parts[].text" response.json

2. 內嵌傳遞音頻數據

您可以將請求中的內嵌音頻數據傳遞給generateContent，而不是上傳音頻文件：

Python

 from google.genai import types
with open('path/to/small-sample.mp3', 'rb') as f:
   audio_bytes = f.read()
response = client.models.generate_content(
 model='gemini-2.0-flash',
 contents=[
   'Describe this audio clip',
   types.Part.from_bytes(
     data=audio_bytes,
     mime_type='audio/mp3',
   )
 ]
)
print(response.text)

JavaScript

 import { GoogleGenAI } from "@google/genai";
import * as fs from "node:fs";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const base64AudioFile = fs.readFileSync("path/to/small-sample.mp3", {
 encoding: "base64",
});
const contents = [
 { text: "Please summarize the audio." },
 {
   inlineData: {
     mimeType: "audio/mp3",
     data: base64AudioFile,
   },
 },
];
const response = await ai.models.generateContent({
 model: "gemini-2.0-flash",
 contents: contents,
});
console.log(response.text);

Go

 // Initialize a Gemini model appropriate for your use case.
model := client.GenerativeModel("gemini-2.0-flash")
bytes, err := os.ReadFile("path/to/small-sample.mp3")
if err != nil {
 log.Fatal(err)
}
prompt := []genai.Part{
 genai.Blob{MIMEType: "audio/mp3", Data: bytes},
 genai.Text("Please summarize the audio."),
}
// Generate content using the prompt.
resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
 log.Fatal(err)
}
// Handle the response of generated text
for _, c := range resp.Candidates {
 if c.Content != nil {
   fmt.Println(*c.Content)
 }
}

關於內嵌音頻數據，請注意以下幾點：

1. 請求大小上限為20 MB，其中包括文本提示、系統說明和內嵌的文件。如果文件大小會導致請求總大小超過20 MB，請使用Files API 上傳音頻文件以在請求中使用。

2. 如果您要多次使用音頻選段，則更高效的方式是上傳音頻文件。

二、獲取轉寫內容

如需獲取音頻數據的轉寫內容，只需在提示中提出請求即可：

Python

 myfile = client.files.upload(file='path/to/sample.mp3')
prompt = 'Generate a transcript of the speech.'
response = client.models.generate_content(
 model='gemini-2.0-flash',
 contents=[prompt, myfile]
)
print(response.text)

JavaScript

 import {
 GoogleGenAI,
 createUserContent,
 createPartFromUri,
} from "@google/genai";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const myfile = await ai.files.upload({
 file: "path/to/sample.mp3",
 config: { mimeType: "audio/mpeg" },
});
const result = await ai.models.generateContent({
 model: "gemini-2.0-flash",
 contents: createUserContent([
   createPartFromUri(myfile.uri, myfile.mimeType),
   "Generate a transcript of the speech.",
 ]),
});
console.log("result.text=", result.text);

Go

 // Initialize a Gemini model appropriate for your use case.
model := client.GenerativeModel("gemini-2.0-flash")
// Create a prompt using text and the URI reference for the uploaded file.
prompt := []genai.Part{
 genai.FileData{URI: sampleAudio.URI},
 genai.Text("Generate a transcript of the speech."),
}
// Generate content using the prompt.
resp, err := model.GenerateContent(ctx, prompt...)
if err != nil {
 log.Fatal(err)
}
// Handle the response of generated text
for _, c := range resp.Candidates {
 if c.Content != nil {
   fmt.Println(*c.Content)
 }
}

三、引用時間戳

您可以使用形式為MM:SS 的時間戳來引用音頻文件的特定部分。例如，以下提示會請求轉寫內容，

1. 從文件開頭算起，開始時間為2 分30 秒。

2. 從文件開頭算起，結束時間為3 分29 秒。

Python

 # Create a prompt containing timestamps.
prompt = "Provide a transcript of the speech from 02:30 to 03:29."

JavaScript

 // Create a prompt containing timestamps.
const prompt = "Provide a transcript of the speech from 02:30 to 03:29."

Go

 // Create a prompt containing timestamps.
prompt := []genai.Part{
   genai.FileData{URI: sampleAudio.URI},
   genai.Text("Provide a transcript of the speech from 02:30 to 03:29."),
}

四、統計詞元數

調用countTokens 方法可獲取音頻文件中的令牌數量。例如：

Python

 response = client.models.count_tokens(
 model='gemini-2.0-flash',
 contents=[myfile]
)
print(response)

JavaScript

 import {
 GoogleGenAI,
 createUserContent,
 createPartFromUri,
} from "@google/genai";
const ai = new GoogleGenAI({ apiKey: "GOOGLE_API_KEY" });
const myfile = await ai.files.upload({
 file: "path/to/sample.mp3",
 config: { mimeType: "audio/mpeg" },
});
const countTokensResponse = await ai.models.countTokens({
 model: "gemini-2.0-flash",
 contents: createUserContent([
   createPartFromUri(myfile.uri, myfile.mimeType),
 ]),
});
console.log(countTokensResponse.totalTokens);

Go

 tokens, err := model.CountTokens(ctx, genai.FileData{URI: sampleAudio.URI})
if err != nil {
   log.Fatal(err)
}
fmt.Printf("File %s is %d tokens", sampleAudio.DisplayName, tokens.TotalTokens)

五、支持的音頻格式

Gemini 支持以下音頻格式MIME 類型：

1. WAV - audio/wav

2. MP3 - audio/mp3

3. AIFF - audio/aiff

4. AAC - audio/aac

5. OGG Vorbis - audio/ogg

6. FLAC - audio/flac

六、音頻的技術詳情

1. Gemini 將每秒的音頻表示為32 個令牌；例如，一分鐘的音頻表示為1,920 個令牌。

2. Gemini 只能推斷對英語語音的回答。

3. Gemini可以“理解”非語音內容，例如鳥鳴或警笛。

4. 單個問題中音頻數據的支持時長上限為9.5 小時。 Gemini 不限制單個問題中的音頻文件數量；不過，單個問題中的所有音頻文件總時長不得超過9.5 小時。

5. Gemini 會將音頻文件下採樣為16 Kbps 的數據分辨率。

6. 如果音頻源包含多個聲道，Gemini 會將這些聲道合併為一個聲道。

←長上下文圖片理解→