[GH-ISSUE #10111] llama3-gradient:1048k stuck at loading model #32392

Closed
opened 2026-04-22 13:36:14 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @AlbertoSinigaglia on GitHub (Apr 3, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10111

What is the issue?

Hi, I was trying to use llama3-gradient:1048k from here.

The model is downloaded correctly:

OLLAMA_HOST="localhost:6000" ollama pull llama3-gradient:1048k
pulling manifest 
pulling 011c3962dbd7... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB                         
pulling 4fa551d4f938... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  12 KB                         
pulling 8ab4849b038c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  254 B                         
pulling 577073ffcc6c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  110 B                         
pulling 6e00eacbd779... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏  483 B                         
verifying sha256 digest 
writing manifest 
success 

However, if I try to run it, it completely stalls in the loading:

OLLAMA_HOST="localhost:6000" ollama run llama3-gradient:1048k
⠙ 

No sign of the model in the GPU

Image

Relevant log output

Logs show no major thing/error, just the following lines are added when I try to run the model:

[GIN] 2025/04/03 - 13:05:28 | 200 |      57.921µs |       127.0.0.1 | HEAD     "/"
[GIN] 2025/04/03 - 13:05:28 | 200 |   43.724053ms |       127.0.0.1 | POST     "/api/show"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

ollama version is 0.6.3

Originally created by @AlbertoSinigaglia on GitHub (Apr 3, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10111 ### What is the issue? Hi, I was trying to use `llama3-gradient:1048k` from [here](https://ollama.com/library/llama3-gradient:1048k). The model is downloaded correctly: ``` OLLAMA_HOST="localhost:6000" ollama pull llama3-gradient:1048k pulling manifest pulling 011c3962dbd7... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB pulling 4fa551d4f938... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB pulling 8ab4849b038c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 254 B pulling 577073ffcc6c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B pulling 6e00eacbd779... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 483 B verifying sha256 digest writing manifest success ``` However, if I try to run it, it completely stalls in the loading: ``` OLLAMA_HOST="localhost:6000" ollama run llama3-gradient:1048k ⠙ ``` No sign of the model in the GPU <img width="1684" alt="Image" src="https://github.com/user-attachments/assets/c9b9a499-abff-43c4-8f6d-6a0f914b87e2" /> ### Relevant log output ```shell Logs show no major thing/error, just the following lines are added when I try to run the model: [GIN] 2025/04/03 - 13:05:28 | 200 | 57.921µs | 127.0.0.1 | HEAD "/" [GIN] 2025/04/03 - 13:05:28 | 200 | 43.724053ms | 127.0.0.1 | POST "/api/show" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version `ollama version is 0.6.3`
GiteaMirror added the bug label 2026-04-22 13:36:14 -05:00
Author
Owner

@rick-github commented on GitHub (Apr 3, 2025):

Set OLLAMA_DEBUG=1 in the server environment, the progress of the load will be logged.

<!-- gh-comment-id:2775755793 --> @rick-github commented on GitHub (Apr 3, 2025): Set `OLLAMA_DEBUG=1` in the server environment, the progress of the load will be logged.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Apr 3, 2025):

@rick-github is it trying to load it in CPU?...

Image

I'm restarting ollama with that env var set to 1, give me some minutes


It actually makes sense... it's reserving 201Gb of RAM memory. Probably with the multi-GPU overhead, it seeds more than that, and the 260Gb of VRAM is not enough?

<!-- gh-comment-id:2775764454 --> @AlbertoSinigaglia commented on GitHub (Apr 3, 2025): @rick-github is it trying to load it in CPU?... <img width="800" alt="Image" src="https://github.com/user-attachments/assets/b2bb5aca-419b-4164-8f82-df7f514b7f94" /> I'm restarting ollama with that env var set to 1, give me some minutes ---- It actually makes sense... it's reserving 201Gb of RAM memory. Probably with the multi-GPU overhead, it seeds more than that, and the 260Gb of VRAM is not enough?
Author
Owner

@rick-github commented on GitHub (Apr 3, 2025):

Have you set the context length to something large? If so, yes, the model will not fit in GPU and the system is busy allocating hundreds of gigabytes of RAM to load the model.

<!-- gh-comment-id:2775772164 --> @rick-github commented on GitHub (Apr 3, 2025): Have you set the context length to something large? If so, yes, the model will not fit in GPU and the system is busy allocating hundreds of gigabytes of RAM to load the model.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Apr 3, 2025):

@rick-github eheh yes, I wanted to try the 1M context length, but probably the O(n^2) is not really scalable at the end of the day lol

<!-- gh-comment-id:2775790105 --> @AlbertoSinigaglia commented on GitHub (Apr 3, 2025): @rick-github eheh yes, I wanted to try the 1M context length, but probably the `O(n^2)` is not really scalable at the end of the day lol
Author
Owner

@rick-github commented on GitHub (Apr 3, 2025):

If you haven't already, you can try setting OLLAMA_FLASH_ATTENTION=1 and experimenting with OLLAMA_KV_CACHE_TYPE.

<!-- gh-comment-id:2775797409 --> @rick-github commented on GitHub (Apr 3, 2025): If you haven't already, you can try setting `OLLAMA_FLASH_ATTENTION=1` and experimenting with [`OLLAMA_KV_CACHE_TYPE`](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache).
Author
Owner

@AlbertoSinigaglia commented on GitHub (Apr 3, 2025):

@rick-github indeed it was the memory, this is using 512K context length... 1M it's still largely far from being approachable lol

Image

Thanks for having recalled me the existence of OLLAMA_KV_CACHE_TYPE, I forget every time that it exists, I'll try it again immediately.

<!-- gh-comment-id:2775816733 --> @AlbertoSinigaglia commented on GitHub (Apr 3, 2025): @rick-github indeed it was the memory, this is using 512K context length... 1M it's still largely far from being approachable lol ![Image](https://github.com/user-attachments/assets/4a021362-ccec-41e8-bc50-2abd5015832c) Thanks for having recalled me the existence of [OLLAMA_KV_CACHE_TYPE](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-set-the-quantization-type-for-the-kv-cache), I forget every time that it exists, I'll try it again immediately.
Author
Owner

@AlbertoSinigaglia commented on GitHub (Apr 3, 2025):

q_8 quantization of the KV cache significantly drops it indeed (this is using 512K context length), thanks @rick-github, you are always amazing!

Image

<!-- gh-comment-id:2775823793 --> @AlbertoSinigaglia commented on GitHub (Apr 3, 2025): `q_8` quantization of the KV cache significantly drops it indeed (this is using 512K context length), thanks @rick-github, you are always amazing! ![Image](https://github.com/user-attachments/assets/0f3568d7-149e-4ed3-a731-4273738e002b)
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32392