[GH-ISSUE #8144] When models don't fit in VRAM, Issue alert/confirmation instead of running and freezing computer for hours #51709

Open
opened 2026-04-28 20:46:58 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @Mugane on GitHub (Dec 17, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/8144

Originally assigned to: @jessegross on GitHub.

What is the issue?

When a model is selected that does not fit in VRAM, it runs on the CPU. This is a ridiculous fallback that freezes the whole computer, it should just fail. Or actually use the GPU with shared memory instead of falling back to the CPU only.

OS

Windows 11 Pro

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

Originally created by @Mugane on GitHub (Dec 17, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/8144 Originally assigned to: @jessegross on GitHub. ### What is the issue? When a model is selected that does not fit in VRAM, it runs on the CPU. This is a ridiculous fallback that freezes the whole computer, it should just fail. Or actually use the GPU with shared memory instead of falling back to the CPU only. ### OS Windows 11 Pro ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.14
GiteaMirror added the needs more infobug labels 2026-04-28 20:46:58 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 17, 2024):

ollama will load as much of the model into VRAM as will fit, it shouldn't fall back to CPU only. If the model does not fit in RAM+VRAM+swap, the model load will fail. It shouldn't make the machine unusable. Server logs will aid in debugging.

<!-- gh-comment-id:2549889584 --> @rick-github commented on GitHub (Dec 17, 2024): ollama will load as much of the model into VRAM as will fit, it shouldn't fall back to CPU only. If the model does not fit in RAM+VRAM+swap, the model load will fail. It shouldn't make the machine unusable. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@pdevine commented on GitHub (Dec 18, 2024):

It definitely shouldn't be "freezing", but if you've run out of VRAM it will try to run inference on the CPU instead as @rick-github mentioned. Can you also give the output of ollama ps?

<!-- gh-comment-id:2550001207 --> @pdevine commented on GitHub (Dec 18, 2024): It definitely shouldn't be "freezing", but if you've run out of VRAM it will try to run inference on the CPU instead as @rick-github mentioned. Can you also give the output of `ollama ps`?
Author
Owner

@YonTracks commented on GitHub (Dec 18, 2024):

I think I understand, extreme example. For me if I run ollama run llama3.3 or a 70b but my hardware is maxed gpu 12gb vram and 32gb ram then it still will run llm_load_print_meta: general.name = Llama 3.1 70B Instruct 2024 12 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 632 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead ggml_cuda_host_malloc: failed to allocate 31393.86 MiB of pinned memory: out of memory llm_load_tensors: offloading 17 repeating layers to GPU llm_load_tensors: offloaded 17/81 layers to GPU llm_load_tensors: CPU model buffer size = 563.62 MiB llm_load_tensors: CPU model buffer size = 31393.86 MiB llm_load_tensors: CUDA0 model buffer size = 8585.63 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: no device found for buffer type CPU for async uploads time=2024-12-18T14:08:46.357+10:00 level=DEBUG source=server.go:713 msg="model load progress 0.01" time=2024-12-18T14:08:46.858+10:00 level=DEBUG source=server.go:713 msg="model load progress 0.04"

and if not a way to stop the model load, it seems like freezing, if I have other models and things happening on the system it can slow and get laggy.

llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
time=2024-12-18T14:21:01.921+10:00 level=DEBUG source=server.go:713 msg="model load progress 1.00"
time=2024-12-18T14:21:02.295+10:00 level=DEBUG source=server.go:716 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_init:        CPU KV buffer size =   496.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   144.00 MiB
llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.52 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1088.45 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    20.01 MiB
llama_new_context_with_model: graph nodes  = 2247
llama_new_context_with_model: graph splits = 687 (with bs=512), 3 (with bs=1)
time=2024-12-18T14:21:02.566+10:00 level=INFO source=server.go:707 msg="llama runner started in 84.50 seconds" pid=7960
time=2024-12-18T14:21:02.570+10:00 level=DEBUG source=sched.go:467 msg="finished setting up runner" model=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d
time=2024-12-18T14:21:02.594+10:00 level=DEBUG source=sched.go:471 msg="context for request finished"
[GIN] 2024/12/18 - 14:21:02 | 200 |         1m24s |       127.0.0.1 | POST     "/api/generate"
time=2024-12-18T14:21:02.594+10:00 level=DEBUG source=sched.go:344 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d duration=5m0s
time=2024-12-18T14:21:02.598+10:00 level=DEBUG source=sched.go:362 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d refCount=0
time=2024-12-18T14:23:13.412+10:00 level=DEBUG source=sched.go:580 msg="evaluating already loaded" model=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d
time=2024-12-18T14:23:13.424+10:00 level=DEBUG source=routes.go:1652 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
time=2024-12-18T14:23:13.437+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11
[GIN] 2024/12/18 - 14:26:02 | 200 |         2m49s |       127.0.0.1 | POST     "/api/chat"
time=2024-12-18T14:26:02.802+10:00 level=DEBUG source=sched.go:412 msg="context for request finished"
time=2024-12-18T14:26:02.806+10:00 level=DEBUG source=sched.go:344 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d duration=5m0s
time=2024-12-18T14:26:02.808+10:00 level=DEBUG source=sched.go:362 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d refCount=0

but I love that. ollama lets it run, if it did not or failed? I would try to get it to run lol.
I think good feature but can make ollama look bad?
good luck.

llama3.3:latest a6eb4748fd29 46 GB 75%/25% CPU/GPU 4 minutes from now

<!-- gh-comment-id:2550321568 --> @YonTracks commented on GitHub (Dec 18, 2024): I think I understand, extreme example. For me if I run `ollama run llama3.3` or a 70b but my hardware is maxed gpu 12gb vram and 32gb ram then it still will run ```llm_load_print_meta: general.name = Llama 3.1 70B Instruct 2024 12 llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOM token = 128008 '<|eom_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOG token = 128008 '<|eom_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 632 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead ggml_cuda_host_malloc: failed to allocate 31393.86 MiB of pinned memory: out of memory llm_load_tensors: offloading 17 repeating layers to GPU llm_load_tensors: offloaded 17/81 layers to GPU llm_load_tensors: CPU model buffer size = 563.62 MiB llm_load_tensors: CPU model buffer size = 31393.86 MiB llm_load_tensors: CUDA0 model buffer size = 8585.63 MiB load_all_data: no device found for buffer type CPU for async uploads load_all_data: no device found for buffer type CPU for async uploads time=2024-12-18T14:08:46.357+10:00 level=DEBUG source=server.go:713 msg="model load progress 0.01" time=2024-12-18T14:08:46.858+10:00 level=DEBUG source=server.go:713 msg="model load progress 0.04"``` and if not a way to stop the model load, it seems like freezing, if I have other models and things happening on the system it can slow and get laggy. ``` llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized time=2024-12-18T14:21:01.921+10:00 level=DEBUG source=server.go:713 msg="model load progress 1.00" time=2024-12-18T14:21:02.295+10:00 level=DEBUG source=server.go:716 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_init: CPU KV buffer size = 496.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 144.00 MiB llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 20.01 MiB llama_new_context_with_model: graph nodes = 2247 llama_new_context_with_model: graph splits = 687 (with bs=512), 3 (with bs=1) time=2024-12-18T14:21:02.566+10:00 level=INFO source=server.go:707 msg="llama runner started in 84.50 seconds" pid=7960 time=2024-12-18T14:21:02.570+10:00 level=DEBUG source=sched.go:467 msg="finished setting up runner" model=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d time=2024-12-18T14:21:02.594+10:00 level=DEBUG source=sched.go:471 msg="context for request finished" [GIN] 2024/12/18 - 14:21:02 | 200 | 1m24s | 127.0.0.1 | POST "/api/generate" time=2024-12-18T14:21:02.594+10:00 level=DEBUG source=sched.go:344 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d duration=5m0s time=2024-12-18T14:21:02.598+10:00 level=DEBUG source=sched.go:362 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d refCount=0 time=2024-12-18T14:23:13.412+10:00 level=DEBUG source=sched.go:580 msg="evaluating already loaded" model=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d time=2024-12-18T14:23:13.424+10:00 level=DEBUG source=routes.go:1652 msg="chat request" images=0 prompt="<|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" time=2024-12-18T14:23:13.437+10:00 level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=11 used=0 remaining=11 [GIN] 2024/12/18 - 14:26:02 | 200 | 2m49s | 127.0.0.1 | POST "/api/chat" time=2024-12-18T14:26:02.802+10:00 level=DEBUG source=sched.go:412 msg="context for request finished" time=2024-12-18T14:26:02.806+10:00 level=DEBUG source=sched.go:344 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d duration=5m0s time=2024-12-18T14:26:02.808+10:00 level=DEBUG source=sched.go:362 msg="after processing request finished event" modelPath=C:\Users\clint\.ollama\models\blobs\sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d refCount=0 ``` but I love that. ollama lets it run, if it did not or failed? I would try to get it to run lol. I think good feature but can make ollama look bad? good luck. llama3.3:latest a6eb4748fd29 46 GB 75%/25% CPU/GPU 4 minutes from now
Author
Owner

@YonTracks commented on GitHub (Dec 18, 2024):

for the maintainers: I can't capture the logs, but I have seen a case where ollama detects the gpu, but does not use it.
when ollama detects the gpu and thinks it is using it happy and allocates the resources along with the cpu? but it is only using the cpu, then the system becomes unusable (laggy, frozen etc).
hard to capture the logs at this time.
possibly force that behavior for a test.
good luck.

<!-- gh-comment-id:2550335056 --> @YonTracks commented on GitHub (Dec 18, 2024): for the maintainers: I can't capture the logs, but I have seen a case where ollama detects the gpu, but does not use it. when ollama detects the gpu and thinks it is using it happy and allocates the resources along with the cpu? but it is only using the cpu, then the system becomes unusable (laggy, frozen etc). hard to capture the logs at this time. possibly force that behavior for a test. good luck.
Author
Owner

@YonTracks commented on GitHub (Dec 18, 2024):

memory.go.txt
heres my attempt so far time=2024-12-18T16:40:36.327+10:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="31.9 GiB" before.free="26.6 GiB" before.free_swap="53.2 GiB" now.total="31.9 GiB" now.free="26.6 GiB" now.free_swap="53.2 GiB" time=2024-12-18T16:40:36.341+10:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="115.5 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="924.9 MiB" time=2024-12-18T16:40:36.342+10:00 level=INFO source=memory.go:364 msg="offload to cuda" warning="not all layers can be offloaded effciently This is might be a problem!" warning="layers.model is more than double layers.offload and memory.required.full exceeds memory.weights.total will be slow!" layers.requested=-1 layers.model=81 layers.offload=18 layers.split="" memory.available="[11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.7 GiB" memory.required.partial="11.0 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[11.0 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"

<!-- gh-comment-id:2550515364 --> @YonTracks commented on GitHub (Dec 18, 2024): [memory.go.txt](https://github.com/user-attachments/files/18176597/memory.go.txt) heres my attempt so far ```time=2024-12-18T16:40:36.327+10:00 level=DEBUG source=gpu.go:406 msg="updating system memory data" before.total="31.9 GiB" before.free="26.6 GiB" before.free_swap="53.2 GiB" now.total="31.9 GiB" now.free="26.6 GiB" now.free_swap="53.2 GiB" time=2024-12-18T16:40:36.341+10:00 level=DEBUG source=gpu.go:456 msg="updating cuda memory data" gpu=GPU-ff81c5c3-1705-5338-5a99-ebf6ae2dfea2 name="NVIDIA GeForce RTX 3060" overhead="115.5 MiB" before.total="12.0 GiB" before.free="11.0 GiB" now.total="12.0 GiB" now.free="11.0 GiB" now.used="924.9 MiB" time=2024-12-18T16:40:36.342+10:00 level=INFO source=memory.go:364 msg="offload to cuda" warning="not all layers can be offloaded effciently This is might be a problem!" warning="layers.model is more than double layers.offload and memory.required.full exceeds memory.weights.total will be slow!" layers.requested=-1 layers.model=81 layers.offload=18 layers.split="" memory.available="[11.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="43.7 GiB" memory.required.partial="11.0 GiB" memory.required.kv="640.0 MiB" memory.required.allocations="[11.0 GiB]" memory.weights.total="38.9 GiB" memory.weights.repeating="38.1 GiB" memory.weights.nonrepeating="822.0 MiB" memory.graph.full="324.0 MiB" memory.graph.partial="1.1 GiB"```
Author
Owner

@MarlNox commented on GitHub (Dec 18, 2024):

What is the issue?

When a model is selected that does not fit in VRAM, it runs on the CPU. This is a ridiculous fallback that freezes the whole computer, it should just fail. Or actually use the GPU with shared memory instead of falling back to the CPU only.

OS

Windows 11 Pro

GPU

Nvidia

CPU

Intel

Ollama version

0.3.14

The model should offload layers to RAM/SWAP when it doesnt fully fit the VRAM, currently it falls back onto pure RAM/SWAP, while the GPU seems to be unused.

<!-- gh-comment-id:2551263747 --> @MarlNox commented on GitHub (Dec 18, 2024): > ### What is the issue? > > When a model is selected that does not fit in VRAM, it runs on the CPU. This is a ridiculous fallback that freezes the whole computer, it should just fail. Or actually use the GPU with shared memory instead of falling back to the CPU only. > ### OS > > Windows 11 Pro > ### GPU > > Nvidia > ### CPU > > Intel > ### Ollama version > > 0.3.14 The model should offload layers to RAM/SWAP when it doesnt fully fit the VRAM, currently it falls back onto pure RAM/SWAP, while the GPU seems to be unused.
Author
Owner

@rick-github commented on GitHub (Dec 18, 2024):

ollama will load as much of the model into VRAM as will fit, it shouldn't fall back to CPU only. If you find that's not the case, server logs would help with debugging.

<!-- gh-comment-id:2551733582 --> @rick-github commented on GitHub (Dec 18, 2024): ollama will load as much of the model into VRAM as will fit, it shouldn't fall back to CPU only. If you find that's not the case, server logs would help with debugging.
Author
Owner

@vnicolici commented on GitHub (Dec 22, 2024):

This happened to me too. Of course, it's partly my fault for trying to run such large models, but I think it might be worth it to prevent this behavior by default (not load a model that is too large compared to the available RAM and VRAM), maybe with an option to override the default behavior if you really want to run your models out of the system page file.

My Windows 11 computer has 24 GB VRAM, 64 GB RAM and 79 GB paging file, and I believe those 79 GB were allocated last time I wanted to run a large model. In particular, I wanted to run the mixtral:8x22b model. I thought it is a 22b model, which should run quite well on my system. What I didn't realize is that the 8x22 means 8 times 22, so in fact it was a 166b model that has 79 GB in size on the disk:

mixtral:8x22b                e8479ee1cb51    79 GB     2 days ago

In any case, aside from completely making my computer unusable for about 5 minutes, running the model actually worked:

C:\Users\vladn\.ollama> ollama run --verbose mixtral:8x22b
>>> why did the chicken cross the road?
 That's a classic joke! The chicken crossed the road to get to the other
side. This phrase is an example of anti-humor, where the interest and
humor come from the surprise of a simple or literal explanation to a
question that typically has a humorous punchline.

total duration:       3m34.032943s
load duration:        56.6599ms
prompt eval count:    12 token(s)
prompt eval duration: 33.889s
prompt eval rate:     0.35 tokens/s
eval count:           58 token(s)
eval duration:        3m0.042s
eval rate:            0.32 tokens/s
>>>
PS C:\Users\vladn> ollama ps
NAME             ID              SIZE     PROCESSOR          UNTIL

mixtral:8x22b    e8479ee1cb51    83 GB    77%/23% CPU/GPU    4 minutes from now

Regarding not using VRAM, this happens too, even with smaller models, if you run a model with a large context window, for example mistral-nemo-12b with a 250000 num_ctx parameter. Reducing it to 125000 makes it use the GPU RAM too:

C:\Users\vladn\.ollama>copy demo.ModelFile con
FROM mistral-nemo:12b

PARAMETER num_ctx 250000
        1 file(s) copied.

C:\Users\vladn\.ollama>ollama create -f demo.ModelFile demo
transferring model data
using existing layer sha256:b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
using existing layer sha256:f023d1ce0e55d0dcdeaf70ad81555c2a20822ed607a7abd8de3c3131360f5f0a
using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1
creating new layer sha256:1ea54a6eec1baa8beeea5c4e7ce7ddc73084d06d310c1433cdad367110730832
creating new layer sha256:0c25e00f1add8344c696a5e183feaef33284f3576fd72c073c913b23f623d4ee
writing manifest
success

C:\Users\vladn\.ollama>ollama run demo
>>>

C:\Users\vladn\.ollama>ollama ps
NAME           ID              SIZE     PROCESSOR    UNTIL
demo:latest    0fd7ecd717a7    47 GB    100% CPU     4 minutes from now

C:\Users\vladn\.ollama>copy demo.ModelFile con
FROM mistral-nemo:12b

PARAMETER num_ctx 125000
        1 file(s) copied.

C:\Users\vladn\.ollama>ollama create -f demo.ModelFile demo
transferring model data
using existing layer sha256:b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94
using existing layer sha256:f023d1ce0e55d0dcdeaf70ad81555c2a20822ed607a7abd8de3c3131360f5f0a
using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1
creating new layer sha256:20044ba05f395301dddc9b49594c93ae1eb09440d24d8e57033a172897af1608
creating new layer sha256:64172a7cee69e379612a22d4c11ba436da0483b4af2cfe2f0d9b8826d7219d4c
writing manifest
success

C:\Users\vladn\.ollama>ollama run demo
>>>

C:\Users\vladn\.ollama>ollama ps
NAME           ID              SIZE     PROCESSOR          UNTIL
demo:latest    198c31d8d8f3    37 GB    49%/51% CPU/GPU    4 minutes from now

And the log when it's using 100% CPU and 0%GPU:

[GIN] 2024/12/22 - 09:07:34 | 200 |     63.8553ms |       127.0.0.1 | POST     "/api/create"
[GIN] 2024/12/22 - 09:07:37 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/12/22 - 09:07:37 | 200 |     18.4045ms |       127.0.0.1 | POST     "/api/show"
time=2024-12-22T09:07:39.354+02:00 level=INFO source=server.go:104 msg="system memory" total="63.8 GiB" free="49.5 GiB" free_swap="92.8 GiB"
time=2024-12-22T09:07:39.355+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[18.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="44.4 GiB" memory.required.partial="0 B" memory.required.kv="38.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="43.9 GiB" memory.weights.repeating="43.3 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="15.8 GiB" memory.graph.partial="17.0 GiB"
time=2024-12-22T09:07:39.364+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\vladn\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe runner --model C:\\Users\\vladn\\.ollama\\models\\blobs\\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 --ctx-size 250000 --batch-size 512 --threads 8 --no-mmap --parallel 1 --port 51129"
time=2024-12-22T09:07:39.366+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-12-22T09:07:39.366+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
time=2024-12-22T09:07:39.367+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
time=2024-12-22T09:07:39.375+02:00 level=INFO source=runner.go:945 msg="starting go runner"
time=2024-12-22T09:07:39.377+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8
time=2024-12-22T09:07:39.377+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:51129"
llama_model_loader: loaded meta data with 35 key-value pairs and 363 tensors from C:\Users\vladn\.ollama\models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Mistral Nemo Instruct 2407
llama_model_loader: - kv   3:                            general.version str              = 2407
llama_model_loader: - kv   4:                           general.finetune str              = Instruct
llama_model_loader: - kv   5:                           general.basename str              = Mistral-Nemo
llama_model_loader: - kv   6:                         general.size_label str              = 12B
llama_model_loader: - kv   7:                            general.license str              = apache-2.0
llama_model_loader: - kv   8:                          general.languages arr[str,9]       = ["en", "fr", "de", "es", "it", "pt", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 40
llama_model_loader: - kv  10:                       llama.context_length u32              = 1024000
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  18:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  19:                          general.file_type u32              = 2
llama_model_loader: - kv  20:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  21:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  22:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,269443]  = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ ®..
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  29:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  30:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {%- if messages[0]['role'] == 'system...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_0:  281 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 1000
time=2024-12-22T09:07:39.618+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: token to piece cache size = 0.8498 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 131072
llm_load_print_meta: n_merges         = 269443
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 1024000
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 1024000
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 12.25 B
llm_load_print_meta: model size       = 6.58 GiB (4.61 BPW) 
llm_load_print_meta: general.name     = Mistral Nemo Instruct 2407
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 1196 'Ä'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 150
llm_load_tensors:          CPU model buffer size =   886.58 MiB
llm_load_tensors:  CPU_AARCH64 model buffer size =  5850.00 MiB
llama_new_context_with_model: n_seq_max     = 1
llama_new_context_with_model: n_ctx         = 250016
llama_new_context_with_model: n_ctx_per_seq = 250016
llama_new_context_with_model: n_batch       = 512
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (250016) < n_ctx_train (1024000) -- the full capacity of the model will not be utilized
llama_kv_cache_init:        CPU KV buffer size = 39065.00 MiB
llama_new_context_with_model: KV self size  = 39065.00 MiB, K (f16): 19532.50 MiB, V (f16): 19532.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.52 MiB
llama_new_context_with_model:        CPU compute buffer size = 16150.32 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1
time=2024-12-22T09:07:48.885+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.52 seconds"
[GIN] 2024/12/22 - 09:07:48 | 200 |   11.1710193s |       127.0.0.1 | POST     "/api/generate"
[GIN] 2024/12/22 - 09:07:52 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/12/22 - 09:07:52 | 200 |            0s |       127.0.0.1 | GET      "/api/ps"
<!-- gh-comment-id:2558358154 --> @vnicolici commented on GitHub (Dec 22, 2024): This happened to me too. Of course, it's partly my fault for trying to run such large models, but I think it might be worth it to prevent this behavior by default (not load a model that is too large compared to the available RAM and VRAM), maybe with an option to override the default behavior if you really want to run your models out of the system page file. My Windows 11 computer has 24 GB VRAM, 64 GB RAM and 79 GB paging file, and I believe those 79 GB were allocated last time I wanted to run a large model. In particular, I wanted to run the mixtral:8x22b model. I thought it is a 22b model, which should run quite well on my system. What I didn't realize is that the 8x22 means 8 times 22, so in fact it was a 166b model that has 79 GB in size on the disk: ``` mixtral:8x22b e8479ee1cb51 79 GB 2 days ago ``` In any case, aside from completely making my computer unusable for about 5 minutes, running the model actually worked: ``` C:\Users\vladn\.ollama> ollama run --verbose mixtral:8x22b >>> why did the chicken cross the road? That's a classic joke! The chicken crossed the road to get to the other side. This phrase is an example of anti-humor, where the interest and humor come from the surprise of a simple or literal explanation to a question that typically has a humorous punchline. total duration: 3m34.032943s load duration: 56.6599ms prompt eval count: 12 token(s) prompt eval duration: 33.889s prompt eval rate: 0.35 tokens/s eval count: 58 token(s) eval duration: 3m0.042s eval rate: 0.32 tokens/s >>> PS C:\Users\vladn> ollama ps NAME ID SIZE PROCESSOR UNTIL mixtral:8x22b e8479ee1cb51 83 GB 77%/23% CPU/GPU 4 minutes from now ``` Regarding not using VRAM, this happens too, even with smaller models, if you run a model with a large context window, for example `mistral-nemo-12b` with a 250000 `num_ctx` parameter. Reducing it to 125000 makes it use the GPU RAM too: ``` C:\Users\vladn\.ollama>copy demo.ModelFile con FROM mistral-nemo:12b PARAMETER num_ctx 250000 1 file(s) copied. C:\Users\vladn\.ollama>ollama create -f demo.ModelFile demo transferring model data using existing layer sha256:b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 using existing layer sha256:f023d1ce0e55d0dcdeaf70ad81555c2a20822ed607a7abd8de3c3131360f5f0a using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1 creating new layer sha256:1ea54a6eec1baa8beeea5c4e7ce7ddc73084d06d310c1433cdad367110730832 creating new layer sha256:0c25e00f1add8344c696a5e183feaef33284f3576fd72c073c913b23f623d4ee writing manifest success C:\Users\vladn\.ollama>ollama run demo >>> C:\Users\vladn\.ollama>ollama ps NAME ID SIZE PROCESSOR UNTIL demo:latest 0fd7ecd717a7 47 GB 100% CPU 4 minutes from now C:\Users\vladn\.ollama>copy demo.ModelFile con FROM mistral-nemo:12b PARAMETER num_ctx 125000 1 file(s) copied. C:\Users\vladn\.ollama>ollama create -f demo.ModelFile demo transferring model data using existing layer sha256:b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 using existing layer sha256:f023d1ce0e55d0dcdeaf70ad81555c2a20822ed607a7abd8de3c3131360f5f0a using existing layer sha256:43070e2d4e532684de521b885f385d0841030efa2b1a20bafb76133a5e1379c1 creating new layer sha256:20044ba05f395301dddc9b49594c93ae1eb09440d24d8e57033a172897af1608 creating new layer sha256:64172a7cee69e379612a22d4c11ba436da0483b4af2cfe2f0d9b8826d7219d4c writing manifest success C:\Users\vladn\.ollama>ollama run demo >>> C:\Users\vladn\.ollama>ollama ps NAME ID SIZE PROCESSOR UNTIL demo:latest 198c31d8d8f3 37 GB 49%/51% CPU/GPU 4 minutes from now ``` And the log when it's using 100% CPU and 0%GPU: ``` [GIN] 2024/12/22 - 09:07:34 | 200 | 63.8553ms | 127.0.0.1 | POST "/api/create" [GIN] 2024/12/22 - 09:07:37 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/22 - 09:07:37 | 200 | 18.4045ms | 127.0.0.1 | POST "/api/show" time=2024-12-22T09:07:39.354+02:00 level=INFO source=server.go:104 msg="system memory" total="63.8 GiB" free="49.5 GiB" free_swap="92.8 GiB" time=2024-12-22T09:07:39.355+02:00 level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=41 layers.offload=0 layers.split="" memory.available="[18.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="44.4 GiB" memory.required.partial="0 B" memory.required.kv="38.1 GiB" memory.required.allocations="[0 B]" memory.weights.total="43.9 GiB" memory.weights.repeating="43.3 GiB" memory.weights.nonrepeating="525.0 MiB" memory.graph.full="15.8 GiB" memory.graph.partial="17.0 GiB" time=2024-12-22T09:07:39.364+02:00 level=INFO source=server.go:376 msg="starting llama server" cmd="C:\\Users\\vladn\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe runner --model C:\\Users\\vladn\\.ollama\\models\\blobs\\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 --ctx-size 250000 --batch-size 512 --threads 8 --no-mmap --parallel 1 --port 51129" time=2024-12-22T09:07:39.366+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-12-22T09:07:39.366+02:00 level=INFO source=server.go:555 msg="waiting for llama runner to start responding" time=2024-12-22T09:07:39.367+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" time=2024-12-22T09:07:39.375+02:00 level=INFO source=runner.go:945 msg="starting go runner" time=2024-12-22T09:07:39.377+02:00 level=INFO source=runner.go:946 msg=system info="CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(clang)" threads=8 time=2024-12-22T09:07:39.377+02:00 level=INFO source=.:0 msg="Server listening on 127.0.0.1:51129" llama_model_loader: loaded meta data with 35 key-value pairs and 363 tensors from C:\Users\vladn\.ollama\models\blobs\sha256-b559938ab7a0392fc9ea9675b82280f2a15669ec3e0e0fc491c9cb0a7681cf94 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Mistral Nemo Instruct 2407 llama_model_loader: - kv 3: general.version str = 2407 llama_model_loader: - kv 4: general.finetune str = Instruct llama_model_loader: - kv 5: general.basename str = Mistral-Nemo llama_model_loader: - kv 6: general.size_label str = 12B llama_model_loader: - kv 7: general.license str = apache-2.0 llama_model_loader: - kv 8: general.languages arr[str,9] = ["en", "fr", "de", "es", "it", "pt", ... llama_model_loader: - kv 9: llama.block_count u32 = 40 llama_model_loader: - kv 10: llama.context_length u32 = 1024000 llama_model_loader: - kv 11: llama.embedding_length u32 = 5120 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: llama.attention.key_length u32 = 128 llama_model_loader: - kv 18: llama.attention.value_length u32 = 128 llama_model_loader: - kv 19: general.file_type u32 = 2 llama_model_loader: - kv 20: llama.vocab_size u32 = 131072 llama_model_loader: - kv 21: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 22: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 23: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 24: tokenizer.ggml.pre str = tekken llama_model_loader: - kv 25: tokenizer.ggml.tokens arr[str,131072] = ["<unk>", "<s>", "</s>", "[INST]", "[... llama_model_loader: - kv 26: tokenizer.ggml.token_type arr[i32,131072] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 27: tokenizer.ggml.merges arr[str,269443] = ["Ġ Ġ", "Ġ t", "e r", "i n", "Ġ ®.. llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 30: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 31: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 32: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if messages[0]['role'] == 'system... llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - type f32: 81 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect llm_load_vocab: special tokens cache size = 1000 time=2024-12-22T09:07:39.618+02:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: token to piece cache size = 0.8498 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 131072 llm_load_print_meta: n_merges = 269443 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 1024000 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 1024000 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 12.25 B llm_load_print_meta: model size = 6.58 GiB (4.61 BPW) llm_load_print_meta: general.name = Mistral Nemo Instruct 2407 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 1196 'Ä' llm_load_print_meta: EOG token = 2 '</s>' llm_load_print_meta: max token length = 150 llm_load_tensors: CPU model buffer size = 886.58 MiB llm_load_tensors: CPU_AARCH64 model buffer size = 5850.00 MiB llama_new_context_with_model: n_seq_max = 1 llama_new_context_with_model: n_ctx = 250016 llama_new_context_with_model: n_ctx_per_seq = 250016 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_new_context_with_model: n_ctx_per_seq (250016) < n_ctx_train (1024000) -- the full capacity of the model will not be utilized llama_kv_cache_init: CPU KV buffer size = 39065.00 MiB llama_new_context_with_model: KV self size = 39065.00 MiB, K (f16): 19532.50 MiB, V (f16): 19532.50 MiB llama_new_context_with_model: CPU output buffer size = 0.52 MiB llama_new_context_with_model: CPU compute buffer size = 16150.32 MiB llama_new_context_with_model: graph nodes = 1286 llama_new_context_with_model: graph splits = 1 time=2024-12-22T09:07:48.885+02:00 level=INFO source=server.go:594 msg="llama runner started in 9.52 seconds" [GIN] 2024/12/22 - 09:07:48 | 200 | 11.1710193s | 127.0.0.1 | POST "/api/generate" [GIN] 2024/12/22 - 09:07:52 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/12/22 - 09:07:52 | 200 | 0s | 127.0.0.1 | GET "/api/ps" ```
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

Yes, a context window of 250000 tokens requires a KV cache of 38.1G, which won't fit on the GPU. If the KV cache can't fit in the GPU, then the entire model will run in RAM.

<!-- gh-comment-id:2558379120 --> @rick-github commented on GitHub (Dec 22, 2024): Yes, a context window of 250000 tokens requires a KV cache of 38.1G, which won't fit on the GPU. If the KV cache can't fit in the GPU, then the entire model will run in RAM.
Author
Owner

@YonTracks commented on GitHub (Dec 22, 2024):

seems all normal good behavior to me, other than, needing a good warning for when it's going to be slow.

but what to look for with the bug is:
something like [49%/51% CPU/GPU](demo:latest 198c31d8d8f3 37 GB 49%/51% CPU/GPU 4 minutes from now)
and I think the logs were even normal but, monitoring the system, the gpu was not being used, and the reason was because of orphaned processes still holding the vram (for some reason ollama restarted silently / crashed) and the problem only gets worse until restart the system, possibly the embedding issue with many many request and timeout and then it loops with closing / expiring runner loop, wrong context tokens invalid, other reasons also, just sayin.

expire runner ref invalid count and other related, but as I keep saying if ollama silently restarts hard to capture the logs.

holidays, so not a good time, but it will be solved soon enough.
good luck

<!-- gh-comment-id:2558392696 --> @YonTracks commented on GitHub (Dec 22, 2024): seems all normal good behavior to me, other than, needing a good warning for when it's going to be slow. but what to look for with the bug is: something like `[49%/51% CPU/GPU](demo:latest 198c31d8d8f3 37 GB 49%/51% CPU/GPU 4 minutes from now)` and I think the logs were even normal but, monitoring the system, the gpu was not being used, and the reason was because of orphaned processes still holding the vram (for some reason ollama restarted silently / crashed) and the problem only gets worse until restart the system, possibly the embedding issue with many many request and timeout and then it loops with closing / expiring runner loop, wrong context tokens invalid, other reasons also, just sayin. expire runner ref invalid count and other related, but as I keep saying if ollama silently restarts hard to capture the logs. holidays, so not a good time, but it will be solved soon enough. good luck
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

No, if there are orphaned ollama processes they won't show up in the ollama ps output.

<!-- gh-comment-id:2558393877 --> @rick-github commented on GitHub (Dec 22, 2024): No, if there are orphaned ollama processes they won't show up in the `ollama ps` output.
Author
Owner

@YonTracks commented on GitHub (Dec 22, 2024):

correct they don't, but the new one does with incorrect info? it thinks it has? so slow as... but all seems good. until you find the orphaned processes or restart the system.
for windows that is? and goroutines fun as... not. they seem to lock things and create loops and lagging etc...
but good and powerful I bet.
I think anyway (the go runner bit), don't get me wrong lol.

I know I'm a bad communicator srry, I am helping because the issue has been for a while now and the updates have made things way better, and logging seems better, embeddings seem way better with the context tokens being corrected somewhere also so better with multiple things, but now because ollama is working good, some things / issues are starting to show, but may make ollama seem worse and I believe it's the go runner update or somewhere around there go runner related and the silent crashing bit (harder to make happen, the context tokens being wrong was a way to force it).

I hope I'm clear enough.
super cheers, love ollama lots.

<!-- gh-comment-id:2558394388 --> @YonTracks commented on GitHub (Dec 22, 2024): correct they don't, but the new one does with incorrect info? it thinks it has? so slow as... but all seems good. until you find the orphaned processes or restart the system. for windows that is? and goroutines fun as... not. they seem to lock things and create loops and lagging etc... but good and powerful I bet. I think anyway (the go runner bit), don't get me wrong lol. I know I'm a bad communicator srry, I am helping because the issue has been for a while now and the updates have made things way better, and logging seems better, embeddings seem way better with the context tokens being corrected somewhere also so better with multiple things, but now because ollama is working good, some things / issues are starting to show, but may make ollama seem worse and I believe it's the go runner update or somewhere around there go runner related and the silent crashing bit (harder to make happen, the context tokens being wrong was a way to force it). I hope I'm clear enough. super cheers, love ollama lots.
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

If they are orphaned processes, ollama by definition doesn't know about them, and they will not show up in ollama ps. They will use resources and will be visible in nvidia-smi or the process task list (ps or the windows equivalent).

<!-- gh-comment-id:2558410546 --> @rick-github commented on GitHub (Dec 22, 2024): If they are orphaned processes, ollama by definition doesn't know about them, and they will not show up in `ollama ps`. They will use resources and will be visible in `nvidia-smi` or the process task list (`ps` or the windows equivalent).
Author
Owner

@YonTracks commented on GitHub (Dec 22, 2024):

If they are orphaned processes, ollama by definition doesn't know about them, and they will not show up in ollama ps. They will use resources and will be visible in nvidia-smi or the process task list (ps or the windows equivalent).

here's how I did it for windows so far, but I think the issue is the goroutine bit idk newbie lol
I think that's why I need to change from viewing the log in one tab then change to a new tab and back sometimes to get the updated logs? (is that the type of issue here, blocking, locking),
and seems the pid tracking gets lost somewhere because of it, further back in the code, and a break, or return, somewhere is doing something lol, and from dev mode to normal also, srry bad communicator lol. 9d960d3683

<!-- gh-comment-id:2558413827 --> @YonTracks commented on GitHub (Dec 22, 2024): > If they are orphaned processes, ollama by definition doesn't know about them, and they will not show up in `ollama ps`. They will use resources and will be visible in `nvidia-smi` or the process task list (`ps` or the windows equivalent). here's how I did it for windows so far, but I think the issue is the goroutine bit idk newbie lol I think that's why I need to change from viewing the log in one tab then change to a new tab and back sometimes to get the updated logs? (is that the type of issue here, blocking, locking), and seems the pid tracking gets lost somewhere because of it, further back in the code, and a break, or return, somewhere is doing something lol, and from dev mode to normal also, srry bad communicator lol. https://github.com/ollama/ollama/commit/9d960d3683659ad2098fd51b861346d547ff0145
Author
Owner

@Mugane commented on GitHub (Jan 23, 2025):

Sorry I didn't circle back on this with details people asked for yet - I actually replaced my computer with one with a much bigger GPU and am still working out other kinks. But, to clarify, it does not technically make the computer unusable, just locks up ollama (can't cancel or run other models until it has painstakingly completed falling back to CPU/RAM). And this may be desirable, but a warning/option to stop when this happens would be really useful. This somewhat relates to the fact that in open-webui model sizes are not displayed in the chat interface, so it is difficult to assess if a model is too big or not, unless you only download models that fit (there are reasons to not do so). I'm not sure which project should prioritize this check, perhaps both. On the Ollama side perhaps send a response with the warning and then wait 5s before falling back, to allow time to cancel?

Also I understand now that the "shared" GPU memory is basically a hoax, it is completely unusable for ML/inference due to bus speeds. A pcie-attached RAM (some are working on it) would be the solution, or a chipset that puts everything closer together (Mac, but I think I would rather shoot myself in the face than buy one).

<!-- gh-comment-id:2610564525 --> @Mugane commented on GitHub (Jan 23, 2025): Sorry I didn't circle back on this with details people asked for yet - I actually replaced my computer with one with a much bigger GPU and am still working out other kinks. But, to clarify, it does not technically make the computer unusable, just locks up ollama (can't cancel or run other models until it has painstakingly completed falling back to CPU/RAM). And this may be desirable, but a warning/option to stop when this happens would be really useful. This somewhat relates to the fact that in open-webui model sizes are not displayed in the chat interface, so it is difficult to assess if a model is too big or not, unless you only download models that fit (there are reasons to not do so). I'm not sure which project should prioritize this check, perhaps both. On the Ollama side perhaps send a response with the warning and then wait 5s before falling back, to allow time to cancel? Also I understand now that the "shared" GPU memory is basically a hoax, it is completely unusable for ML/inference due to bus speeds. A pcie-attached RAM (some are working on it) would be the solution, or a chipset that puts everything closer together (Mac, but I think I would rather shoot myself in the face than buy one).
Author
Owner

@rick-github commented on GitHub (Jan 23, 2025):

Your point about the poor model visibility from open-webui is a good one, it was annoying me yesterday. Until a safety valve is added to ollama, there are operating system mechanisms that can be used to limit RAM usage, for example prlimit for Linux and process governor for windows. There's likely something similar for MacOS.

<!-- gh-comment-id:2610768723 --> @rick-github commented on GitHub (Jan 23, 2025): Your point about the poor model visibility from open-webui is a good one, it was annoying me yesterday. Until a safety valve is added to ollama, there are operating system mechanisms that can be used to limit RAM usage, for example [prlimit](https://man7.org/linux/man-pages/man1/prlimit.1.html) for Linux and [process governor](https://github.com/lowleveldesign/process-governor) for windows. There's likely something similar for MacOS.
Author
Owner

@Mugane commented on GitHub (Apr 2, 2025):

@rick-github OS mechanisms are not useful because that would be a blanket on/off solution and the choice is one that needs to be made per individual chat request (e.g. to use the slow model when the small fast ones that fit in VRAM do not provide a satisfactory response). We really need better visibility of model size in the front-end though. I created a feature request but they moved it to a discussion and nobody has even looked at it...

<!-- gh-comment-id:2773220290 --> @Mugane commented on GitHub (Apr 2, 2025): @rick-github OS mechanisms are not useful because that would be a blanket on/off solution and the choice is one that needs to be made per individual chat request (e.g. to use the slow model when the small fast ones that fit in VRAM do not provide a satisfactory response). We really need better visibility of model size in the front-end though. I created a feature request but they moved it to a [discussion](https://github.com/open-webui/open-webui/discussions/7924) and nobody has even looked at it...
Author
Owner

@rick-github commented on GitHub (Apr 2, 2025):

OS mechanisms are for protecting the machine, so you're right in that they're not appropriate for fine-grained control.

Open-webui displays the model size (parameters and byte size) in the model selector but it's not really a good measure of how it's going to affect the machine. The actual resources consumed by the model depend on more than just the size of the model weights. Context size, parallelism, flash attention, KV cache quantization, sliding window configuration, number of devices, other clients of GPU VRAM, etc all affect the resource allocation, and many of these are not surfaced in a way that a client can access for decisions on model selection.

In theory you could adjust the response from /api/tags to return the estimated loaded size of the model, but that would require touching every model to determine the model parameters. Caching would help but for large model collections it might be a performance drag when the client refreshes the model list.

<!-- gh-comment-id:2773339482 --> @rick-github commented on GitHub (Apr 2, 2025): OS mechanisms are for protecting the machine, so you're right in that they're not appropriate for fine-grained control. Open-webui displays the model size (parameters and byte size) in the model selector but it's not really a good measure of how it's going to affect the machine. The actual resources consumed by the model depend on more than just the size of the model weights. Context size, parallelism, flash attention, KV cache quantization, sliding window configuration, number of devices, other clients of GPU VRAM, etc all affect the resource allocation, and many of these are not surfaced in a way that a client can access for decisions on model selection. In theory you could adjust the response from `/api/tags` to return the estimated loaded size of the model, but that would require touching every model to determine the model parameters. Caching would help but for large model collections it might be a performance drag when the client refreshes the model list.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#51709