[GH-ISSUE #1150] loading yi:6b-200k seems take forever #47093

Closed
opened 2026-04-28 03:06:04 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @happy15 on GitHub (Nov 16, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1150

Originally assigned to: @BruceMacD on GitHub.

Hi there, just upgraded to Ollama v0.1.9 and tried to run yi:6b-200k. And it seems to take forever to load after pulling the model. After that if I tried to run zephyr it also takes forever to load.

Then I restarted ollama, and ran zephyr successfully.

My env is mbp 14 with M1 pro and 16g ram.

Anything do I miss?

Thanks.

Originally created by @happy15 on GitHub (Nov 16, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1150 Originally assigned to: @BruceMacD on GitHub. Hi there, just upgraded to Ollama v0.1.9 and tried to run yi:6b-200k. And it seems to take forever to load after pulling the model. After that if I tried to run zephyr it also takes forever to load. Then I restarted ollama, and ran zephyr successfully. My env is mbp 14 with M1 pro and 16g ram. Anything do I miss? Thanks.
GiteaMirror added the bug label 2026-04-28 03:06:04 -05:00
Author
Owner

@igorschlum commented on GitHub (Nov 16, 2023):

You have to check the memory usage on your computer. Have you tried to restart your mac and launch only Ollama?

<!-- gh-comment-id:1814299507 --> @igorschlum commented on GitHub (Nov 16, 2023): You have to check the memory usage on your computer. Have you tried to restart your mac and launch only Ollama?
Author
Owner

@easp commented on GitHub (Nov 16, 2023):

I had trouble when trying to load that model on my 32GB system. From ~/.ollama/logs/server it appeared that it couldn't allocate enough memory for GPU use, but then rather than exiting with an error, it just hung. In its hung state it couldn't respond to requests to run other models.

Ultimately though, you don't have enough memory. 200k context takes a lot of memory. I remember it being ~10-12GB. Also, I'm not sure any of the 20K+ context models are really working reliably yet.

<!-- gh-comment-id:1814789855 --> @easp commented on GitHub (Nov 16, 2023): I had trouble when trying to load that model on my 32GB system. From ~/.ollama/logs/server it appeared that it couldn't allocate enough memory for GPU use, but then rather than exiting with an error, it just hung. In its hung state it couldn't respond to requests to run other models. Ultimately though, you don't have enough memory. 200k context takes a lot of memory. I remember it being ~10-12GB. Also, I'm not sure any of the 20K+ context models are really working reliably yet.
Author
Owner

@mchiang0610 commented on GitHub (Nov 16, 2023):

Thanks @easp, I'm able to reproduce on my end as well. Adding logs to this:

ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 12920.75 MB
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3318.16 MB, ( 3318.78 / 10922.67)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  8192.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  4308.03 MB, offs =   8589918208, (15818.81 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  8192.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  4722.66 MB, offs =   8589918208, (28733.47 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_get_buffer: error: buffer is nil
2023/11/16 11:18:42 llama.go:438: error starting llama runner: timed out waiting for llama runner to start
[GIN] 2023/11/16 - 11:30:49 | 200 |   34.591458ms |       127.0.0.1 | GET      "/api/tags"
<!-- gh-comment-id:1814806047 --> @mchiang0610 commented on GitHub (Nov 16, 2023): Thanks @easp, I'm able to reproduce on my end as well. Adding logs to this: ``` ggml_metal_init: GPU name: Apple M1 Pro ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) ggml_metal_init: hasUnifiedMemory = true ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB ggml_metal_init: maxTransferRate = built-in GPU llama_new_context_with_model: compute buffer total size = 12920.75 MB llama_new_context_with_model: max tensor size = 205.08 MB ggml_metal_add_buffer: allocated 'data ' buffer, size = 3318.16 MB, ( 3318.78 / 10922.67) ggml_metal_add_buffer: allocated 'kv ' buffer, size = 8192.00 MB, offs = 0 ggml_metal_add_buffer: allocated 'kv ' buffer, size = 4308.03 MB, offs = 8589918208, (15818.81 / 10922.67), warning: current allocated size is greater than the recommended max working set size ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 8192.00 MB, offs = 0 ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 4722.66 MB, offs = 8589918208, (28733.47 / 10922.67), warning: current allocated size is greater than the recommended max working set size ggml_metal_get_buffer: error: buffer is nil 2023/11/16 11:18:42 llama.go:438: error starting llama runner: timed out waiting for llama runner to start [GIN] 2023/11/16 - 11:30:49 | 200 | 34.591458ms | 127.0.0.1 | GET "/api/tags" ```
Author
Owner

@BruceMacD commented on GitHub (Nov 16, 2023):

Thanks for the report, this appears to happen on systems that don't have enough memory to load the large model context. I'm looking into handling this better.

For future readers, if you see this error it means you dont have enough memory to run the model, as easp mentioned.

<!-- gh-comment-id:1815316118 --> @BruceMacD commented on GitHub (Nov 16, 2023): Thanks for the report, this appears to happen on systems that don't have enough memory to load the large model context. I'm looking into handling this better. For future readers, if you see this error it means you dont have enough memory to run the model, as easp mentioned.
Author
Owner

@happy15 commented on GitHub (Nov 17, 2023):

Thanks @easp, I'm able to reproduce on my end as well. Adding logs to this:

ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007)
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 10922.67 MB
ggml_metal_init: maxTransferRate               = built-in GPU
llama_new_context_with_model: compute buffer total size = 12920.75 MB
llama_new_context_with_model: max tensor size =   205.08 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3318.16 MB, ( 3318.78 / 10922.67)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  8192.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =  4308.03 MB, offs =   8589918208, (15818.81 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  8192.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'alloc           ' buffer, size =  4722.66 MB, offs =   8589918208, (28733.47 / 10922.67), warning: current allocated size is greater than the recommended max working set size
ggml_metal_get_buffer: error: buffer is nil
2023/11/16 11:18:42 llama.go:438: error starting llama runner: timed out waiting for llama runner to start
[GIN] 2023/11/16 - 11:30:49 | 200 |   34.591458ms |       127.0.0.1 | GET      "/api/tags"

Thanks, checked the log and saw the error.

<!-- gh-comment-id:1815699341 --> @happy15 commented on GitHub (Nov 17, 2023): > Thanks @easp, I'm able to reproduce on my end as well. Adding logs to this: > > ``` > ggml_metal_init: GPU name: Apple M1 Pro > ggml_metal_init: GPU family: MTLGPUFamilyApple7 (1007) > ggml_metal_init: hasUnifiedMemory = true > ggml_metal_init: recommendedMaxWorkingSetSize = 10922.67 MB > ggml_metal_init: maxTransferRate = built-in GPU > llama_new_context_with_model: compute buffer total size = 12920.75 MB > llama_new_context_with_model: max tensor size = 205.08 MB > ggml_metal_add_buffer: allocated 'data ' buffer, size = 3318.16 MB, ( 3318.78 / 10922.67) > ggml_metal_add_buffer: allocated 'kv ' buffer, size = 8192.00 MB, offs = 0 > ggml_metal_add_buffer: allocated 'kv ' buffer, size = 4308.03 MB, offs = 8589918208, (15818.81 / 10922.67), warning: current allocated size is greater than the recommended max working set size > ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 8192.00 MB, offs = 0 > ggml_metal_add_buffer: allocated 'alloc ' buffer, size = 4722.66 MB, offs = 8589918208, (28733.47 / 10922.67), warning: current allocated size is greater than the recommended max working set size > ggml_metal_get_buffer: error: buffer is nil > 2023/11/16 11:18:42 llama.go:438: error starting llama runner: timed out waiting for llama runner to start > [GIN] 2023/11/16 - 11:30:49 | 200 | 34.591458ms | 127.0.0.1 | GET "/api/tags" > ``` Thanks, checked the log and saw the error.
Author
Owner

@happy15 commented on GitHub (Nov 17, 2023):

I had trouble when trying to load that model on my 32GB system. From ~/.ollama/logs/server it appeared that it couldn't allocate enough memory for GPU use, but then rather than exiting with an error, it just hung. In its hung state it couldn't respond to requests to run other models.

Ultimately though, you don't have enough memory. 200k context takes a lot of memory. I remember it being ~10-12GB. Also, I'm not sure any of the 20K+ context models are really working reliably yet.

Thanks for the info.

<!-- gh-comment-id:1815700830 --> @happy15 commented on GitHub (Nov 17, 2023): > I had trouble when trying to load that model on my 32GB system. From ~/.ollama/logs/server it appeared that it couldn't allocate enough memory for GPU use, but then rather than exiting with an error, it just hung. In its hung state it couldn't respond to requests to run other models. > > Ultimately though, you don't have enough memory. 200k context takes a lot of memory. I remember it being ~10-12GB. Also, I'm not sure any of the 20K+ context models are really working reliably yet. Thanks for the info.
Author
Owner

@happy15 commented on GitHub (Nov 17, 2023):

Thanks everyone for diagnosis and info.

<!-- gh-comment-id:1815701224 --> @happy15 commented on GitHub (Nov 17, 2023): Thanks everyone for diagnosis and info.
Author
Owner

@happy15 commented on GitHub (Nov 17, 2023):

https://github.com/01-ai/Yi/issues/56#issuecomment-1800872082

cross link related discussion on yi model's official repo.

They mentioned to install flash attention to control the memory usage. Have ollama already done it?

<!-- gh-comment-id:1815705753 --> @happy15 commented on GitHub (Nov 17, 2023): https://github.com/01-ai/Yi/issues/56#issuecomment-1800872082 cross link related discussion on yi model's official repo. They mentioned to install flash attention to control the memory usage. Have ollama already done it?
Author
Owner

@BruceMacD commented on GitHub (Nov 17, 2023):

Thanks for the heads up @happy15 the linked issue wouldn't apply to Ollama as far as I can tell, it seems to be related to a missing package in their python project which we dont use.

<!-- gh-comment-id:1817171565 --> @BruceMacD commented on GitHub (Nov 17, 2023): Thanks for the heads up @happy15 the linked issue wouldn't apply to Ollama as far as I can tell, it seems to be related to a missing package in their python project which we dont use.
Author
Owner

@easp commented on GitHub (Nov 21, 2023):

The underlying issue not handling this situation gracefully still exists.

Should this be reopened @BruceMacD ?

<!-- gh-comment-id:1821461372 --> @easp commented on GitHub (Nov 21, 2023): The underlying issue not handling this situation gracefully still exists. Should this be reopened @BruceMacD ?
Author
Owner

@igorschlum commented on GitHub (Nov 21, 2023):

With Ollama 0.1.10 and Ollama 0.1.11 Mac 32Go M1Pro I have this error:

(base) igor@macIgor ~ % ollama run yi:6b-200k
pulling manifest
pulling 0177ca5616b4... 100% ▕█████████████▏ (3.5/3.5 GB, 5.1 MB/s)
pulling 8773f7716220... 100% ▕████████████████████▏ (17/17 kB, 12 kB/s)
pulling b3736cdce03e... 100% ▕██████████████████████▏ (18/18 B, 9 B/s)
pulling c93fb84f6006... 100% ▕███████████████████▏ (381/381 B, 227 B/s)
verifying sha256 digest
writing manifest
removing any unused layers
success
⠧ Error: llama runner process has terminated

<!-- gh-comment-id:1821640409 --> @igorschlum commented on GitHub (Nov 21, 2023): With Ollama 0.1.10 and Ollama 0.1.11 Mac 32Go M1Pro I have this error: (base) igor@macIgor ~ % ollama run yi:6b-200k pulling manifest pulling 0177ca5616b4... 100% ▕█████████████▏ (3.5/3.5 GB, 5.1 MB/s) pulling 8773f7716220... 100% ▕████████████████████▏ (17/17 kB, 12 kB/s) pulling b3736cdce03e... 100% ▕██████████████████████▏ (18/18 B, 9 B/s) pulling c93fb84f6006... 100% ▕███████████████████▏ (381/381 B, 227 B/s) verifying sha256 digest writing manifest removing any unused layers success ⠧ Error: llama runner process has terminated
Author
Owner

@BruceMacD commented on GitHub (Mar 11, 2024):

Thanks to everyone for helping with adding info here. We've changed how Ollama run models since this issue, so the LLM runner process will no longer terminate during loading, so I'm going to resolve this issue for now.

<!-- gh-comment-id:1989161735 --> @BruceMacD commented on GitHub (Mar 11, 2024): Thanks to everyone for helping with adding info here. We've changed how Ollama run models since this issue, so the LLM runner process will no longer terminate during loading, so I'm going to resolve this issue for now.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#47093