[GH-ISSUE #11236] Trying to get Ollama models to work with Macbook Pro M4 Pro, but keep getting response: {"done_reason":"load"} #7401

Closed
opened 2026-04-12 19:29:23 -05:00 by GiteaMirror · 6 comments
Owner

Originally created by @Pbot64 on GitHub (Jun 29, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11236

What is the issue?

Hi,

I'm trying to get this project working https://github.com/asiff00/On-Device-Speech-to-Speech-Conversational-AI?tab=readme-ov-file on my Macbook Pro M4 Pro (OS Version 15.5 (24F74).

I've tried phi3, mistral, and llama3.2 and I keep getting the same response: done_reason:"load".
I've used the CLI version and the MacOS version 0.9.3 and even tried the pre-release [v0.9.4] for MacOS via (https://github.com/ollama/ollama/releases/tag/v0.9.4-rc2).

Here is my entire .env (besides my Hugging Face Token)
VOICE_NAME=af_heart
SPEED=0.9
export ESPEAK_PATH=/usr/local/bin/espeak

LLM settings

LM_STUDIO_URL=http://localhost:11434/v1
OLLAMA_URL=http://localhost:11434/api/generate
DEFAULT_SYSTEM_PROMPT=You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses.
MAX_TOKENS=512
NUM_THREADS=2
LLM_TEMPERATURE=0.9
LLM_STREAM=true
LLM_RETRY_DELAY=0.5
MAX_RETRIES=3

Model names

VAD_MODEL=pyannote/segmentation-3.0
WHISPER_MODEL=openai/whisper-tiny.en
LLM_MODEL=mistral
TTS_MODEL=kokoro.pth

VAD settings

VAD_MIN_DURATION_ON=0.1
VAD_MIN_DURATION_OFF=0.1

Audio settings

CHUNK=256
FORMAT=pyaudio.paFloat32
CHANNELS=1
RATE=16000
OUTPUT_SAMPLE_RATE=24000
RECORD_DURATION=5
SILENCE_THRESHOLD=0.001
INTERRUPTION_THRESHOLD=0.01
MAX_SILENCE_DURATION=1
SPEECH_CHECK_TIMEOUT=0.1
SPEECH_CHECK_THRESHOLD=0.02
ROLLING_BUFFER_TIME=0.5
TARGET_SIZE = 25
PLAYBACK_DELAY = 0.001
FIRST_SENTENCE_SIZE = 2

I am stuck. Totally and completely stuck. I feel like I've tried everything to get this kokoro conversational project working, but I'm stuck at the LLM integration. Help would be very much appreciated! Copilot says it's a bug with Ollama trying to work with Apple Metal. Please...

Relevant log output

curl http://localhost:11434/api/generate -d '{
  "model": "phi3",    
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Say hello!"}
  ],
  "stream": false
}'
{"model":"phi3","created_at":"2025-06-29T20:25:26.312542Z","response":"","done":true,"done_reason":"load"}%

OS

macOS

GPU

Apple

CPU

Apple

Ollama version

0.9.3

Originally created by @Pbot64 on GitHub (Jun 29, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11236 ### What is the issue? Hi, I'm trying to get this project working _https://github.com/asiff00/On-Device-Speech-to-Speech-Conversational-AI?tab=readme-ov-file_ on my Macbook Pro M4 Pro (OS Version 15.5 (24F74). I've tried phi3, mistral, and llama3.2 and I keep getting the same response: done_reason:"load". I've used the CLI version and the MacOS version 0.9.3 and even tried the pre-release [v0.9.4] for MacOS via (https://github.com/ollama/ollama/releases/tag/v0.9.4-rc2). Here is my entire .env (besides my Hugging Face Token) VOICE_NAME=af_heart SPEED=0.9 export ESPEAK_PATH=/usr/local/bin/espeak # LLM settings LM_STUDIO_URL=http://localhost:11434/v1 OLLAMA_URL=http://localhost:11434/api/generate DEFAULT_SYSTEM_PROMPT=You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses. MAX_TOKENS=512 NUM_THREADS=2 LLM_TEMPERATURE=0.9 LLM_STREAM=true LLM_RETRY_DELAY=0.5 MAX_RETRIES=3 # Model names VAD_MODEL=pyannote/segmentation-3.0 WHISPER_MODEL=openai/whisper-tiny.en LLM_MODEL=mistral TTS_MODEL=kokoro.pth # VAD settings VAD_MIN_DURATION_ON=0.1 VAD_MIN_DURATION_OFF=0.1 # Audio settings CHUNK=256 FORMAT=pyaudio.paFloat32 CHANNELS=1 RATE=16000 OUTPUT_SAMPLE_RATE=24000 RECORD_DURATION=5 SILENCE_THRESHOLD=0.001 INTERRUPTION_THRESHOLD=0.01 MAX_SILENCE_DURATION=1 SPEECH_CHECK_TIMEOUT=0.1 SPEECH_CHECK_THRESHOLD=0.02 ROLLING_BUFFER_TIME=0.5 TARGET_SIZE = 25 PLAYBACK_DELAY = 0.001 FIRST_SENTENCE_SIZE = 2 I am stuck. Totally and completely stuck. I feel like I've tried everything to get this kokoro conversational project working, but I'm stuck at the LLM integration. Help would be very much appreciated! Copilot says it's a bug with Ollama trying to work with Apple Metal. Please... ### Relevant log output ```shell curl http://localhost:11434/api/generate -d '{ "model": "phi3", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Say hello!"} ], "stream": false }' {"model":"phi3","created_at":"2025-06-29T20:25:26.312542Z","response":"","done":true,"done_reason":"load"}% ``` ### OS macOS ### GPU Apple ### CPU Apple ### Ollama version 0.9.3
GiteaMirror added the bug label 2026-04-12 19:29:23 -05:00
Author
Owner

@rick-github commented on GitHub (Jun 29, 2025):

If you want to use messages[], you need to use the /api/chat endpoint.

<!-- gh-comment-id:3017132169 --> @rick-github commented on GitHub (Jun 29, 2025): If you want to use `messages[]`, you need to use the `/api/chat` [endpoint](https://github.com/ollama/ollama/blob/main/docs/api.md#generate-a-chat-completion).
Author
Owner

@Pbot64 commented on GitHub (Jun 29, 2025):

Tried using that end point. Still not working, I get this "cursh>" which is strange

<!-- gh-comment-id:3017151705 --> @Pbot64 commented on GitHub (Jun 29, 2025): Tried using that end point. Still not working, I get this "cursh>" which is strange
Author
Owner

@rick-github commented on GitHub (Jun 29, 2025):

$ curl http://localhost:11434/api/chat -d '{
  "model": "phi3",    
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Say hello!"}
  ],
  "stream": false
}'
{"model":"phi3","created_at":"2025-06-29T21:48:31.650657348Z","message":{"role":"assistant","content":"Hello there! How can I help you today?"},"done_reason":"stop","done":true,"total_duration":39757122491,"load_duration":39529064468,"prompt_eval_count":25,"prompt_eval_duration":153800265,"eval_count":11,"eval_duration":73081385}
<!-- gh-comment-id:3017161583 --> @rick-github commented on GitHub (Jun 29, 2025): ```console $ curl http://localhost:11434/api/chat -d '{ "model": "phi3", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Say hello!"} ], "stream": false }' {"model":"phi3","created_at":"2025-06-29T21:48:31.650657348Z","message":{"role":"assistant","content":"Hello there! How can I help you today?"},"done_reason":"stop","done":true,"total_duration":39757122491,"load_duration":39529064468,"prompt_eval_count":25,"prompt_eval_duration":153800265,"eval_count":11,"eval_duration":73081385} ```
Author
Owner

@Pbot64 commented on GitHub (Jun 29, 2025):

Appreciate the help. I just was serving the wrong model, but I still can't generate any sound. This is what it gives me:

eSpeak path set to: /opt/homebrew/bin/espeak
This is the kokoro.py you’re actually running

Initializing Whisper model...

Initializing Voice Activity Detection...
Registered checkpoint save hook for _speechbrain_save
Registered checkpoint load hook for _speechbrain_load
Registered checkpoint save hook for save
Registered checkpoint load hook for load
Registered checkpoint save hook for _save
Registered checkpoint load hook for _recover

=== Voice Chat Bot Initializing ===
Device being used: cpu

Initializing voice generator...
Loaded voice: af_heart

Warming up the LLM model...
[DEBUG] Checking Ollama health at http://localhost:11434
[DEBUG] Ollama health status: 200
[DEBUG] Calling get_ai_response for warmup...
[DEBUG] Warmup get_ai_response returned: <generator object get_ai_response at 0x1921a1770>

=== Voice Chat Bot Ready ===
The bot is now listening for speech.
Just start speaking, and I'll respond automatically!
You can interrupt me anytime by starting to speak.
[DEBUG] About to record audio...

Listening... (Press Ctrl+C to stop)

Potential speech detected...
Processing speech segment...
🔈 Captured audio: <class 'numpy.ndarray'> (93184,)
📊 Audio stats: -0.10005563 0.15666205 -5.5727873e-05
[DEBUG] Audio data is not None, proceeding to VAD...
[DEBUG] Detecting speech segments...
[DEBUG] speech_segments: tensor([ 8.9648e-05, 4.7134e-05, 1.0209e-04, ..., -6.8553e-04,
-6.7169e-04, -5.9594e-04])
[DEBUG] VAD found speech segments, proceeding to transcription...

Transcribing detected speech...
🎛️ Input features shape: torch.Size([1, 80, 3000])
🧠 Generated token IDs: tensor([[18435, 11, 314, 716, 257, 1692, 852, 13, 1867, 318,
534, 1438, 30]])
📝 Final transcription: Hello, I am a human being. What is your name?
🔊 Audio data shape: (65070,)
📝 Final transcription list: Hello, I am a human being. What is your name?
[DEBUG] Transcribed user_input: Hello, I am a human being. What is your name?
[DEBUG] Transcription not empty, sending to LLM...
🧠 Sending to LLM: Hello, I am a human being. What is your name?
You (voice): Hello, I am a human being. What is your name?

Thinking...
[DEBUG] Calling process_input from voice input...
[DEBUG] process_input called with user_input: Hello, I am a human being. What is your name?
[DEBUG] messages before append: [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}]
[DEBUG] messages after append: [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}]

Thinking...
[DEBUG] Calling get_ai_response with model: phi3, url: http://localhost:11434/api/chat
[DEBUG] messages_to_send: [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}]
[DEBUG] get_ai_response returned: <generator object get_ai_response at 0x193431e70>
[DEBUG][get_ai_response] Payload to LLM: {'model': 'phi3', 'messages': [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}], 'options': {'num_ctx': 1024, 'num_thread': 2}, 'stream': False}
[DEBUG][get_ai_response] Raw response status: 200
[DEBUG][get_ai_response] Non-streaming JSON: {'model': 'phi3', 'created_at': '2025-06-29T22:32:32.485973Z', 'message': {'role': 'assistant', 'content': 'Umm, my designation as an AI assistant doesn't have a personal name like humans do, but you can call me "Aiden" if that helps! How may I assist you today? So tell me about yourself without revealing any sensitive information.'}, 'done_reason': 'stop', 'done': True, 'total_duration': 1538489125, 'load_duration': 538388042, 'prompt_eval_count': 95, 'prompt_eval_duration': 286457208, 'eval_count': 55, 'eval_duration': 710626375}
[DEBUG][get_ai_response] No response text in non-streaming reply.

Audio Generation Complete - Processed: 0, Generated: 0, Failed: 0

Timing Chart:
Event | Time (s) | Δ+

User stopped speaking | 0.00 | 0.00
VAD started | 0.00 | 0.00
End-to-end response | 2.41 | 2.41
[DEBUG] process_input completed successfully.
[DEBUG] process_input returned: was_interrupted=False, speech_data=None
LLM and TTS response completed. was_interrupted = False
[DEBUG] About to record audio...

Listening... (Press Ctrl+C to stop)

<!-- gh-comment-id:3017211023 --> @Pbot64 commented on GitHub (Jun 29, 2025): Appreciate the help. I just was serving the wrong model, but I still can't generate any sound. This is what it gives me: eSpeak path set to: /opt/homebrew/bin/espeak ✅ This is the kokoro.py you’re actually running Initializing Whisper model... Initializing Voice Activity Detection... Registered checkpoint save hook for _speechbrain_save Registered checkpoint load hook for _speechbrain_load Registered checkpoint save hook for save Registered checkpoint load hook for load Registered checkpoint save hook for _save Registered checkpoint load hook for _recover === Voice Chat Bot Initializing === Device being used: cpu Initializing voice generator... Loaded voice: af_heart Warming up the LLM model... [DEBUG] Checking Ollama health at http://localhost:11434 [DEBUG] Ollama health status: 200 [DEBUG] Calling get_ai_response for warmup... [DEBUG] Warmup get_ai_response returned: <generator object get_ai_response at 0x1921a1770> === Voice Chat Bot Ready === The bot is now listening for speech. Just start speaking, and I'll respond automatically! You can interrupt me anytime by starting to speak. [DEBUG] About to record audio... Listening... (Press Ctrl+C to stop) Potential speech detected... Processing speech segment... 🔈 Captured audio: <class 'numpy.ndarray'> (93184,) 📊 Audio stats: -0.10005563 0.15666205 -5.5727873e-05 [DEBUG] Audio data is not None, proceeding to VAD... [DEBUG] Detecting speech segments... [DEBUG] speech_segments: tensor([ 8.9648e-05, 4.7134e-05, 1.0209e-04, ..., -6.8553e-04, -6.7169e-04, -5.9594e-04]) [DEBUG] VAD found speech segments, proceeding to transcription... Transcribing detected speech... 🎛️ Input features shape: torch.Size([1, 80, 3000]) 🧠 Generated token IDs: tensor([[18435, 11, 314, 716, 257, 1692, 852, 13, 1867, 318, 534, 1438, 30]]) 📝 Final transcription: Hello, I am a human being. What is your name? 🔊 Audio data shape: (65070,) 📝 Final transcription list: Hello, I am a human being. What is your name? [DEBUG] Transcribed user_input: Hello, I am a human being. What is your name? [DEBUG] Transcription not empty, sending to LLM... 🧠 Sending to LLM: Hello, I am a human being. What is your name? You (voice): Hello, I am a human being. What is your name? Thinking... [DEBUG] Calling process_input from voice input... [DEBUG] process_input called with user_input: Hello, I am a human being. What is your name? [DEBUG] messages before append: [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}] [DEBUG] messages after append: [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}] Thinking... [DEBUG] Calling get_ai_response with model: phi3, url: http://localhost:11434/api/chat [DEBUG] messages_to_send: [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}] [DEBUG] get_ai_response returned: <generator object get_ai_response at 0x193431e70> [DEBUG][get_ai_response] Payload to LLM: {'model': 'phi3', 'messages': [{'role': 'system', 'content': "You are a friendly, helpful, and intelligent assistant. Begin your responses with phrases like 'Umm,' 'So,' or similar. Focus on the user query and reply directly to the user in the first person ('I'), responding promptly and naturally. Do not include any additional information or context in your responses."}, {'role': 'user', 'content': ' Hello, I am a human being. What is your name?'}], 'options': {'num_ctx': 1024, 'num_thread': 2}, 'stream': False} [DEBUG][get_ai_response] Raw response status: 200 [DEBUG][get_ai_response] Non-streaming JSON: {'model': 'phi3', 'created_at': '2025-06-29T22:32:32.485973Z', 'message': {'role': 'assistant', 'content': 'Umm, my designation as an AI assistant doesn\'t have a personal name like humans do, but you can call me "Aiden" if that helps! How may I assist you today? So tell me about yourself without revealing any sensitive information.'}, 'done_reason': 'stop', 'done': True, 'total_duration': 1538489125, 'load_duration': 538388042, 'prompt_eval_count': 95, 'prompt_eval_duration': 286457208, 'eval_count': 55, 'eval_duration': 710626375} [DEBUG][get_ai_response] No response text in non-streaming reply. Audio Generation Complete - Processed: 0, Generated: 0, Failed: 0 Timing Chart: Event | Time (s) | Δ+ --------------------------------------------- User stopped speaking | 0.00 | 0.00 VAD started | 0.00 | 0.00 End-to-end response | 2.41 | 2.41 [DEBUG] process_input completed successfully. [DEBUG] process_input returned: was_interrupted=False, speech_data=None ✅ LLM and TTS response completed. was_interrupted = False [DEBUG] About to record audio... Listening... (Press Ctrl+C to stop)
Author
Owner

@rick-github commented on GitHub (Jun 29, 2025):

[DEBUG][get_ai_response] Non-streaming JSON: {'model': 'phi3', 'created_at': '2025-06-29T22:32:32.485973Z', 'message': {'role': 'assistant', 'content': 'Umm, my designation as an AI assistant doesn't have a personal name like humans do, but you can call me "Aiden" if that helps! How may I assist you today? So tell me about yourself without revealing any sensitive information.'}, 'done_reason': 'stop', 'done': True, 'total_duration': 1538489125, 'load_duration': 538388042, 'prompt_eval_count': 95, 'prompt_eval_duration': 286457208, 'eval_count': 55, 'eval_duration': 710626375}
[DEBUG][get_ai_response] No response text in non-streaming reply.

ollama is returning a response but the project is not recognizing it. You probably need to follow up on https://github.com/asiff00/On-Device-Speech-to-Speech-Conversational-AI/issues/12, so far this doesn't look like an ollama issue.

<!-- gh-comment-id:3017240017 --> @rick-github commented on GitHub (Jun 29, 2025): > [DEBUG][get_ai_response] Non-streaming JSON: {'model': 'phi3', 'created_at': '2025-06-29T22:32:32.485973Z', 'message': {'role': 'assistant', 'content': 'Umm, my designation as an AI assistant doesn't have a personal name like humans do, but you can call me "Aiden" if that helps! How may I assist you today? So tell me about yourself without revealing any sensitive information.'}, 'done_reason': 'stop', 'done': True, 'total_duration': 1538489125, 'load_duration': 538388042, 'prompt_eval_count': 95, 'prompt_eval_duration': 286457208, 'eval_count': 55, 'eval_duration': 710626375} > [DEBUG][get_ai_response] No response text in non-streaming reply. ollama is returning a response but the project is not recognizing it. You probably need to follow up on https://github.com/asiff00/On-Device-Speech-to-Speech-Conversational-AI/issues/12, so far this doesn't look like an ollama issue.
Author
Owner

@Pbot64 commented on GitHub (Jun 29, 2025):

now I'm getting this
Warming up the LLM model...
[DEBUG] Checking Ollama health at http://localhost:11434
[DEBUG] Ollama health status: 200
[DEBUG] Calling get_ai_response for warmup...
[LLM ERROR] Failed to get response from Ollama: Extra data: line 2 column 1 (char 118)
[DEBUG] Warmup get_ai_response returned: None
Failed to initialized the AI model!

<!-- gh-comment-id:3017244494 --> @Pbot64 commented on GitHub (Jun 29, 2025): now I'm getting this Warming up the LLM model... [DEBUG] Checking Ollama health at http://localhost:11434 [DEBUG] Ollama health status: 200 [DEBUG] Calling get_ai_response for warmup... [LLM ERROR] Failed to get response from Ollama: Extra data: line 2 column 1 (char 118) [DEBUG] Warmup get_ai_response returned: None Failed to initialized the AI model!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#7401