[GH-ISSUE #15427] Any documentation for Audio? #9860

Open
opened 2026-04-12 22:43:17 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @VistritPandey on GitHub (Apr 8, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/15427

I see there is an audio tag on Gemma4, but I couldn't find any documentation. Any example would be great!

ref link: https://ollama.com/library/gemma4

Originally created by @VistritPandey on GitHub (Apr 8, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/15427 I see there is an `audio` tag on Gemma4, but I couldn't find any documentation. Any example would be great! ref link: https://ollama.com/library/gemma4
GiteaMirror added the feature request label 2026-04-12 22:43:17 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 8, 2026):

CLI:

$ ollama run gemma4:e4b transcribe "./now is the time for all good men to jump over the quick brown fox.wav"
Added audio './now is the time for all good men to jump over the quick brown fox.wav'
Now it's time for all good men to jump over the quick brown fox.

API:

$ (echo -n '{
  "model":"gemma4:e4b",
  "stream":false,
  "messages":[{
    "role":"user",
    "content":"transcribe",
    "images":["' ; base64 -w0 "now is the time for all good men to jump over the quick brown fox.wav" ; echo '"]
  }]}') | curl -s localhost:11434/api/chat -d @- | jq
{
  "model": "gemma4:e4b",
  "created_at": "2026-04-03T17:15:38.79946058Z",
  "message": {
    "role": "assistant",
    "content": "Now it's time for all good men to jump over the quick brown fox."
  },
  "done": true,
  "done_reason": "stop",
  "total_duration": 700613958,
  "load_duration": 514170935,
  "prompt_eval_count": 121,
  "prompt_eval_duration": 5653482,
  "eval_count": 18,
  "eval_duration": 83928687
}

Python:

import os
import sys
from openai import OpenAI
import ollama

audio_file = sys.argv[1]
model = os.getenv("MODEL", "gemma4:e4b")

def openai_api(file):
  client = OpenAI(base_url = os.getenv("OLLAMA_HOST", "http://localhost:11434") + "/v1", api_key="ollama")
  audio_file= open(file, "rb")

  transcription = client.audio.transcriptions.create(
      model=model,
      file=audio_file
  )

  return(transcription.text)

def ollama_api(file):
  response = ollama.chat(
      model=model,
      messages=[{"role":"user","content":"transcribe","images":[file]}])
  return(response.message.content)

print(openai_api(audio_file))
print(ollama_api(audio_file))
<!-- gh-comment-id:4209608686 --> @rick-github commented on GitHub (Apr 8, 2026): CLI: ```console $ ollama run gemma4:e4b transcribe "./now is the time for all good men to jump over the quick brown fox.wav" Added audio './now is the time for all good men to jump over the quick brown fox.wav' Now it's time for all good men to jump over the quick brown fox. ``` API: ```console $ (echo -n '{ "model":"gemma4:e4b", "stream":false, "messages":[{ "role":"user", "content":"transcribe", "images":["' ; base64 -w0 "now is the time for all good men to jump over the quick brown fox.wav" ; echo '"] }]}') | curl -s localhost:11434/api/chat -d @- | jq { "model": "gemma4:e4b", "created_at": "2026-04-03T17:15:38.79946058Z", "message": { "role": "assistant", "content": "Now it's time for all good men to jump over the quick brown fox." }, "done": true, "done_reason": "stop", "total_duration": 700613958, "load_duration": 514170935, "prompt_eval_count": 121, "prompt_eval_duration": 5653482, "eval_count": 18, "eval_duration": 83928687 } ``` Python: ```python import os import sys from openai import OpenAI import ollama audio_file = sys.argv[1] model = os.getenv("MODEL", "gemma4:e4b") def openai_api(file): client = OpenAI(base_url = os.getenv("OLLAMA_HOST", "http://localhost:11434") + "/v1", api_key="ollama") audio_file= open(file, "rb") transcription = client.audio.transcriptions.create( model=model, file=audio_file ) return(transcription.text) def ollama_api(file): response = ollama.chat( model=model, messages=[{"role":"user","content":"transcribe","images":[file]}]) return(response.message.content) print(openai_api(audio_file)) print(ollama_api(audio_file)) ```
Author
Owner

@marcof-nikogo commented on GitHub (Apr 9, 2026):

Hi,
it works...
But it seems it has a limit of 60 sec duration.
Would it work chunking the file with some second of overlap between chunks ?
i.e. how to give it enough context ?

<!-- gh-comment-id:4213244020 --> @marcof-nikogo commented on GitHub (Apr 9, 2026): Hi, it works... But it seems it has a limit of 60 sec duration. Would it work chunking the file with some second of overlap between chunks ? i.e. how to give it enough context ?
Author
Owner

@Nikorasu commented on GitHub (Apr 10, 2026):

So is it just for transcription? Or can the LLM answer questions about a portion of audio?

<!-- gh-comment-id:4221928151 --> @Nikorasu commented on GitHub (Apr 10, 2026): So is it just for transcription? Or can the LLM answer questions about a portion of audio?
Author
Owner

@JoeLoginIsAlreadyTaken commented on GitHub (Apr 10, 2026):

Hi, it works... But it seems it has a limit of 60 sec duration. Would it work chunking the file with some second of overlap between chunks ? i.e. how to give it enough context ?

I think best way would be some kind of "voice activity detection" to detect longer gaps in the audio.
This usually works quite well to split into sentences or at least longer pauses.
Should work even with Web Audio API using a filter.

<!-- gh-comment-id:4225328939 --> @JoeLoginIsAlreadyTaken commented on GitHub (Apr 10, 2026): > Hi, it works... But it seems it has a limit of 60 sec duration. Would it work chunking the file with some second of overlap between chunks ? i.e. how to give it enough context ? I think best way would be some kind of "voice activity detection" to detect longer gaps in the audio. This usually works quite well to split into sentences or at least longer pauses. Should work even with Web Audio API using a filter.
Author
Owner

@rick-github commented on GitHub (Apr 10, 2026):

So is it just for transcription? Or can the LLM answer questions about a portion of audio?

$ ollama run gemma4:e4b tell me about this audio /tmp/ask-not-what-your-country-can-do-for-you-ask-what-you-can-do-for-your-country.wav 
Added audio '/tmp/ask-not-what-your-country-can-do-for-you-ask-what-you-can-do-for-your-country.wav'
Thinking...
Here's a thinking process that leads to the suggested response:

1.  **Analyze the Request:** The user provided an audio snippet (or the text transcription of 
one, which is a series of calls to action: "Ask not what your country can do for you. Ask what 
you can do for your country.") and asked me to "tell me about this audio."

2.  **Analyze the Content (The Audio/Quote):**
    *   *The phrase:* "Ask not what your country can do for you. Ask what you can do for your 
country."
    *   *Origin:* This is one of the most famous quotes in American history.
    *   *Speaker:* John F. Kennedy (JFK).
    *   *Context:* It was delivered during his Inaugural Address.
    *   *Meaning/Theme:* It's a call to civic duty, self-reliance, and active participation in 
society. It emphasizes contribution rather than entitlement.

3.  **Determine the Goal:** The response must be informative, accurate, and engaging. It needs 
to explain the *meaning*, *origin*, and *impact* of the quote.

4.  **Structure the Response:**
    *   **Acknowledge/Identify:** Start by recognizing the quote and its historical 
significance. (This hooks the user.)
    *   **Provide Context (The "Who, What, When"):** Identify the speaker, the event, and the 
date. (Crucial for accuracy.)
    *   **Explain the Meaning/Significance (The "Why"):** Break down the message. What is the 
quote *really* saying? (Focus on active citizenship, contribution, and duty.)
    *   **Discuss Impact/Legacy (The "So what?"):** Why does this quote resonate so much? 
(It's timeless, inspiring, and a call to action.)
    *   **Tone Check:** The tone should be knowledgeable, appreciative, and inspiring.

5.  **Drafting - Step-by-Step (Self-Correction/Refinement):**

    *   *(Initial thought for Context):* JFK said it at the start of his presidency. 
*Correction:* Be more specific—it was his Inaugural Address, a monumental speech.
    *   *(Initial thought for Meaning):* It means help out. *Correction:* Be more articulate. 
It means shifting focus from *receiving* to *contributing*; it's about *active citizenship*.
    *   *(Refining the Tone):* Don't just state facts; convey the weight and power of the 
message.

6.  **Final Polish (The resulting output):** (This matches the detailed, multi-faceted 
response provided in the good example.)

    *(Self-check against the user's prompt):* Did I "tell them about the audio"? Yes, I 
provided historical context, meaning, and impact.

7.  **Final Output Generation.** (This leads to the comprehensive answer.)
...done thinking.

This is one of the most famous and impactful quotes in American history. The phrase you've 
provided—"Ask not what your country can do for you. Ask what you can do for your country"—was 
delivered by **John F. Kennedy** during his **Inaugural Address** on January 20, 1961.

Here is a detailed breakdown of the quote, its context, and its significance.

### 📜 Context: JFK's Inaugural Address

The quote was delivered during a time of immense global tension. The early 1960s were marked 
by the Cold War, the looming threat of nuclear conflict, and significant social upheaval (the 
Civil Rights movement, Vietnam tensions, etc.). The nation needed a call to renewed purpose 
and civic duty.

JFK's speech was a masterful piece of political rhetoric that transcended partisan politics. 
It wasn't just a promise of his administration's plans; it was a moral challenge to the 
American people.

### ✨ The Meaning of the Quote

At its core, the quote is a powerful shift in perspective, moving from **entitlement** to 
**contribution**.

*   **"Ask not what your country can do for you..."**: This phrase challenges the idea of a 
relationship where the citizen is merely a passive recipient of benefits—of government 
handouts, economic security, or services. It rejects the mindset of waiting for others to 
solve your problems.
*   **"Ask what you can do for your country."**: This is the call to action. It requires 
individuals to look inward and assess their unique skills, passions, and resources. It 
emphasizes that one's primary role is not to wait for rescue, but to *participate* in the 
improvement of society.

In simple terms, it is a message of **active citizenship** and **social responsibility.**

### 🌟 Its Significance and Impact

The quote is incredibly enduring for several reasons:

**1. A Call to Civic Duty:** It has become a touchstone phrase for civic responsibility. It 
reminds people that democracy is not a passive state of being; it is an active process that 
requires the participation and effort of its people.

**2. A Timeless Message:** While it was delivered in the context of the Cold War, its message 
is universal. It applies to any society, in any era, that needs a collective push toward 
bettering itself.

**3. Inspiring Action:** The quote has been widely used by leaders, activists, and 
motivational speakers because it is inherently motivating. It forces the listener to think, 
"What *can* I do?" rather than, "What *will* be done for me?"

### 💡 Summary

This quote is much more than a piece of rhetoric; it is a philosophical challenge. It is a 
reminder that the strength of a nation does not come from its resources or its leaders, but 
from the willingness of its people to get involved and contribute. It is a rallying cry for 
participation and self-determination.
<!-- gh-comment-id:4226056937 --> @rick-github commented on GitHub (Apr 10, 2026): > So is it just for transcription? Or can the LLM answer questions about a portion of audio? ```console $ ollama run gemma4:e4b tell me about this audio /tmp/ask-not-what-your-country-can-do-for-you-ask-what-you-can-do-for-your-country.wav Added audio '/tmp/ask-not-what-your-country-can-do-for-you-ask-what-you-can-do-for-your-country.wav' Thinking... Here's a thinking process that leads to the suggested response: 1. **Analyze the Request:** The user provided an audio snippet (or the text transcription of one, which is a series of calls to action: "Ask not what your country can do for you. Ask what you can do for your country.") and asked me to "tell me about this audio." 2. **Analyze the Content (The Audio/Quote):** * *The phrase:* "Ask not what your country can do for you. Ask what you can do for your country." * *Origin:* This is one of the most famous quotes in American history. * *Speaker:* John F. Kennedy (JFK). * *Context:* It was delivered during his Inaugural Address. * *Meaning/Theme:* It's a call to civic duty, self-reliance, and active participation in society. It emphasizes contribution rather than entitlement. 3. **Determine the Goal:** The response must be informative, accurate, and engaging. It needs to explain the *meaning*, *origin*, and *impact* of the quote. 4. **Structure the Response:** * **Acknowledge/Identify:** Start by recognizing the quote and its historical significance. (This hooks the user.) * **Provide Context (The "Who, What, When"):** Identify the speaker, the event, and the date. (Crucial for accuracy.) * **Explain the Meaning/Significance (The "Why"):** Break down the message. What is the quote *really* saying? (Focus on active citizenship, contribution, and duty.) * **Discuss Impact/Legacy (The "So what?"):** Why does this quote resonate so much? (It's timeless, inspiring, and a call to action.) * **Tone Check:** The tone should be knowledgeable, appreciative, and inspiring. 5. **Drafting - Step-by-Step (Self-Correction/Refinement):** * *(Initial thought for Context):* JFK said it at the start of his presidency. *Correction:* Be more specific—it was his Inaugural Address, a monumental speech. * *(Initial thought for Meaning):* It means help out. *Correction:* Be more articulate. It means shifting focus from *receiving* to *contributing*; it's about *active citizenship*. * *(Refining the Tone):* Don't just state facts; convey the weight and power of the message. 6. **Final Polish (The resulting output):** (This matches the detailed, multi-faceted response provided in the good example.) *(Self-check against the user's prompt):* Did I "tell them about the audio"? Yes, I provided historical context, meaning, and impact. 7. **Final Output Generation.** (This leads to the comprehensive answer.) ...done thinking. This is one of the most famous and impactful quotes in American history. The phrase you've provided—"Ask not what your country can do for you. Ask what you can do for your country"—was delivered by **John F. Kennedy** during his **Inaugural Address** on January 20, 1961. Here is a detailed breakdown of the quote, its context, and its significance. ### 📜 Context: JFK's Inaugural Address The quote was delivered during a time of immense global tension. The early 1960s were marked by the Cold War, the looming threat of nuclear conflict, and significant social upheaval (the Civil Rights movement, Vietnam tensions, etc.). The nation needed a call to renewed purpose and civic duty. JFK's speech was a masterful piece of political rhetoric that transcended partisan politics. It wasn't just a promise of his administration's plans; it was a moral challenge to the American people. ### ✨ The Meaning of the Quote At its core, the quote is a powerful shift in perspective, moving from **entitlement** to **contribution**. * **"Ask not what your country can do for you..."**: This phrase challenges the idea of a relationship where the citizen is merely a passive recipient of benefits—of government handouts, economic security, or services. It rejects the mindset of waiting for others to solve your problems. * **"Ask what you can do for your country."**: This is the call to action. It requires individuals to look inward and assess their unique skills, passions, and resources. It emphasizes that one's primary role is not to wait for rescue, but to *participate* in the improvement of society. In simple terms, it is a message of **active citizenship** and **social responsibility.** ### 🌟 Its Significance and Impact The quote is incredibly enduring for several reasons: **1. A Call to Civic Duty:** It has become a touchstone phrase for civic responsibility. It reminds people that democracy is not a passive state of being; it is an active process that requires the participation and effort of its people. **2. A Timeless Message:** While it was delivered in the context of the Cold War, its message is universal. It applies to any society, in any era, that needs a collective push toward bettering itself. **3. Inspiring Action:** The quote has been widely used by leaders, activists, and motivational speakers because it is inherently motivating. It forces the listener to think, "What *can* I do?" rather than, "What *will* be done for me?" ### 💡 Summary This quote is much more than a piece of rhetoric; it is a philosophical challenge. It is a reminder that the strength of a nation does not come from its resources or its leaders, but from the willingness of its people to get involved and contribute. It is a rallying cry for participation and self-determination. ```
Author
Owner

@JoeLoginIsAlreadyTaken commented on GitHub (Apr 10, 2026):

web_audio_transcribe.html
Here is a (vibe coded) quick and dirty example using Web-Audio API with VAD to automatically chunk the Audio on gaps.
These Chunks are then send to Ollama and transcribed.
Works really nice. Tested on Firefox.
Make sure to set the Ollama Endpoint and the Model name properly.

<!-- gh-comment-id:4226663167 --> @JoeLoginIsAlreadyTaken commented on GitHub (Apr 10, 2026): [web_audio_transcribe.html](https://github.com/user-attachments/files/26639679/web_audio_transcribe.html) Here is a (vibe coded) quick and dirty example using Web-Audio API with VAD to automatically chunk the Audio on gaps. These Chunks are then send to Ollama and transcribed. Works really nice. Tested on Firefox. Make sure to set the Ollama Endpoint and the Model name properly.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#9860