[GH-ISSUE #9005] The custom model of deepseek - r1 - 70b is very slow. #5852

Closed
opened 2026-04-12 17:11:32 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @goactiongo on GitHub (Feb 11, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9005

What is the issue?

linux,ollama0.5.4

I have 4 A30 cards.

When I tested with the OpenAI interface of deepseek-r1-70b, I found that the model's response speed was acceptable, and it occupied GPU resources Log file as followed
deepseek-r1.txt

2月 11 14:49:00 gpu ollama[7229]: llm_load_print_meta: max token length = 256
2月 11 14:49:00 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading 80 repeating layers to GPU
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading output layer to GPU
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloaded 81/81 layers to GPU
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:   CPU_Mapped model buffer size =   563.62 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA0 model buffer size = 10425.88 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA1 model buffer size =  9612.94 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA2 model buffer size =  9612.94 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA3 model buffer size = 10327.73 MiB
...omitted..


2月 11 14:49:00 gpu ollama[7229]: llm_load_print_meta: max token length = 256
2月 11 14:49:00 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading 80 repeating layers to GPU
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading output layer to GPU
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloaded 81/81 layers to GPU
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:   CPU_Mapped model buffer size =   563.62 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA0 model buffer size = 10425.88 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA1 model buffer size =  9612.94 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA2 model buffer size =  9612.94 MiB
2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors:        CUDA3 model buffer size = 10327.73 MiB
...omitted...
2月 11 14:49:09 gpu ollama[7229]: time=2025-02-11T14:49:09.978+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server not responding"
2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.265+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.265+08:00 level=DEBUG source=server.go:600 msg="model load progress 1.00"
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_seq_max     = 1
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ctx         = 2048
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq = 2048
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_batch       = 512
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ubatch      = 512
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: flash_attn    = 1
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: freq_base     = 500000.0
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: freq_scale    = 1
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init:      CUDA0 KV buffer size =   168.00 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init:      CUDA1 KV buffer size =   160.00 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init:      CUDA2 KV buffer size =   160.00 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init:      CUDA3 KV buffer size =   152.00 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.52 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model:      CUDA0 compute buffer size =   242.01 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model:      CUDA1 compute buffer size =   232.01 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model:      CUDA2 compute buffer size =   232.01 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model:      CUDA3 compute buffer size =   338.52 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model:  CUDA_Host compute buffer size =    32.02 MiB
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: graph nodes  = 2247
2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: graph splits = 5
2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.516+08:00 level=INFO source=server.go:594 msg="llama runner started in 12.57 seconds"
2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.516+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-4cd576d9aa16961244012223abf01445567b061f1814b57dfef699e4cf8df339

However, when I created a new model deepseek-r1-70b_121234 with the settings PARAMETER num_predict -1 and PARAMETER num_ctx 121234, during the interface test, the model took a long time to respond, and the content output was very slow, with almost one character being output per second. Also, it was found that almost no GPU resources were used (.
Log file as followed

NAME                                 ID              SIZE      PROCESSOR    UNTIL
deepseek-r1:70b                      0c1615a8ca32    51 GB     100% GPU     4 minutes from now
deepseek-r1:70b_121234       ffb79c778740    156 GB    41%/59% CPU/GPU    4 minutes from now

deepseek-31-withCTX.txt

2月 11 14:44:31 gpu ollama[7229]: llm_load_print_meta: max token length = 256
2月 11 14:44:31 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 612 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: offloading 19 repeating layers to GPU
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: offloaded 19/81 layers to GPU
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors:          CPU model buffer size =   563.62 MiB
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors:    CUDA_Host model buffer size = 30473.73 MiB
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors:        CUDA0 model buffer size =   460.06 MiB
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors:        CUDA1 model buffer size =  2878.00 MiB
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors:        CUDA2 model buffer size =  3054.44 MiB
2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors:        CUDA3 model buffer size =  3113.25 MiB
2月 11 14:44:42 gpu ollama[7229]: load_all_data: no device found for buffer type CPU for async uploads
2月 11 14:44:31 gpu ollama[7229]: llm_load_print_meta: max token length = 256
2月 11 14:44:31 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 612 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
.....omitted....
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_seq_max     = 1
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ctx         = 121344
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq = 121344
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_batch       = 512
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ubatch      = 512
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: flash_attn    = 1
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: freq_base     = 500000.0
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: freq_scale    = 1
2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq (121344) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
2月 11 14:44:50 gpu ollama[7229]: time=2025-02-11T14:44:50.558+08:00 level=DEBUG source=server.go:600 msg="model load progress 1.00"
2月 11 14:44:50 gpu ollama[7229]: time=2025-02-11T14:44:50.810+08:00 level=DEBUG source=server.go:603 msg="model load completed, waiting for server to become available" status="llm server loading model"
2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init:        CPU KV buffer size = 28914.00 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init:      CUDA0 KV buffer size =   474.00 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init:      CUDA1 KV buffer size =  2844.00 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init:      CUDA2 KV buffer size =  2844.00 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init:      CUDA3 KV buffer size =  2844.00 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: KV self size  = 37920.00 MiB, K (f16): 18960.00 MiB, V (f16): 18960.00 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model:        CPU  output buffer size =     0.52 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model:      CUDA0 compute buffer size =  1088.45 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model:      CUDA1 compute buffer size =   262.50 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model:      CUDA2 compute buffer size =   262.50 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model:      CUDA3 compute buffer size =   262.50 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model:  CUDA_Host compute buffer size =   253.01 MiB
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: graph nodes  = 2247
2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: graph splits = 679 (with bs=512), 6 (with bs=1)
2月 11 14:44:59 gpu ollama[7229]: time=2025-02-11T14:44:59.101+08:00 level=INFO source=server.go:594 msg="llama runner started in 28.91 seconds"
2月 11 14:44:59 gpu ollama[7229]: time=2025-02-11T14:44:59.101+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-4cd576d9aa16961244012223abf01445567b061f1814b57dfef699e4cf8df339

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.4

Originally created by @goactiongo on GitHub (Feb 11, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9005 ### What is the issue? linux,ollama0.5.4 I have 4 A30 cards. When I tested with the OpenAI interface of deepseek-r1-70b, I found that the model's response speed was acceptable, and it occupied GPU resources Log file as followed [deepseek-r1.txt](https://github.com/user-attachments/files/18746242/deepseek-r1.txt) ``` 2月 11 14:49:00 gpu ollama[7229]: llm_load_print_meta: max token length = 256 2月 11 14:49:00 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading 80 repeating layers to GPU 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading output layer to GPU 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloaded 81/81 layers to GPU 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CPU_Mapped model buffer size = 563.62 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA0 model buffer size = 10425.88 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA1 model buffer size = 9612.94 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA2 model buffer size = 9612.94 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA3 model buffer size = 10327.73 MiB ...omitted.. 2月 11 14:49:00 gpu ollama[7229]: llm_load_print_meta: max token length = 256 2月 11 14:49:00 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading 80 repeating layers to GPU 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloading output layer to GPU 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: offloaded 81/81 layers to GPU 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CPU_Mapped model buffer size = 563.62 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA0 model buffer size = 10425.88 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA1 model buffer size = 9612.94 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA2 model buffer size = 9612.94 MiB 2月 11 14:49:03 gpu ollama[7229]: llm_load_tensors: CUDA3 model buffer size = 10327.73 MiB ...omitted... 2月 11 14:49:09 gpu ollama[7229]: time=2025-02-11T14:49:09.978+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server not responding" 2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.265+08:00 level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.265+08:00 level=DEBUG source=server.go:600 msg="model load progress 1.00" 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_seq_max = 1 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ctx = 2048 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq = 2048 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_batch = 512 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ubatch = 512 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: flash_attn = 1 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: freq_base = 500000.0 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: freq_scale = 1 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized 2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init: CUDA0 KV buffer size = 168.00 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init: CUDA1 KV buffer size = 160.00 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init: CUDA2 KV buffer size = 160.00 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_kv_cache_init: CUDA3 KV buffer size = 152.00 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: KV self size = 640.00 MiB, K (f16): 320.00 MiB, V (f16): 320.00 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: CUDA_Host output buffer size = 0.52 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: pipeline parallelism enabled (n_copies=4) 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: CUDA0 compute buffer size = 242.01 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: CUDA1 compute buffer size = 232.01 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: CUDA2 compute buffer size = 232.01 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: CUDA3 compute buffer size = 338.52 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: CUDA_Host compute buffer size = 32.02 MiB 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: graph nodes = 2247 2月 11 14:49:10 gpu ollama[7229]: llama_new_context_with_model: graph splits = 5 2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.516+08:00 level=INFO source=server.go:594 msg="llama runner started in 12.57 seconds" 2月 11 14:49:10 gpu ollama[7229]: time=2025-02-11T14:49:10.516+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-4cd576d9aa16961244012223abf01445567b061f1814b57dfef699e4cf8df339 ``` However, when I created a new model deepseek-r1-70b_121234 with the settings ```PARAMETER num_predict -1``` and ```PARAMETER num_ctx 121234```, during the interface test, the model took a long time to respond, and the content output was very slow, with almost one character being output per second. Also, it was found that almost no GPU resources were used (. Log file as followed ``` NAME ID SIZE PROCESSOR UNTIL deepseek-r1:70b 0c1615a8ca32 51 GB 100% GPU 4 minutes from now deepseek-r1:70b_121234 ffb79c778740 156 GB 41%/59% CPU/GPU 4 minutes from now ``` [deepseek-31-withCTX.txt](https://github.com/user-attachments/files/18746244/deepseek-31-withCTX.txt) ``` 2月 11 14:44:31 gpu ollama[7229]: llm_load_print_meta: max token length = 256 2月 11 14:44:31 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 612 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: offloading 19 repeating layers to GPU 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: offloaded 19/81 layers to GPU 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: CPU model buffer size = 563.62 MiB 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: CUDA_Host model buffer size = 30473.73 MiB 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: CUDA0 model buffer size = 460.06 MiB 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: CUDA1 model buffer size = 2878.00 MiB 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: CUDA2 model buffer size = 3054.44 MiB 2月 11 14:44:42 gpu ollama[7229]: llm_load_tensors: CUDA3 model buffer size = 3113.25 MiB 2月 11 14:44:42 gpu ollama[7229]: load_all_data: no device found for buffer type CPU for async uploads 2月 11 14:44:31 gpu ollama[7229]: llm_load_print_meta: max token length = 256 2月 11 14:44:31 gpu ollama[7229]: llm_load_tensors: tensor 'token_embd.weight' (q4_K) (and 612 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead .....omitted.... 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_seq_max = 1 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ctx = 121344 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq = 121344 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_batch = 512 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ubatch = 512 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: flash_attn = 1 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: freq_base = 500000.0 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: freq_scale = 1 2月 11 14:44:50 gpu ollama[7229]: llama_new_context_with_model: n_ctx_per_seq (121344) < n_ctx_train (131072) -- the full capacity of the model will not be utilized 2月 11 14:44:50 gpu ollama[7229]: time=2025-02-11T14:44:50.558+08:00 level=DEBUG source=server.go:600 msg="model load progress 1.00" 2月 11 14:44:50 gpu ollama[7229]: time=2025-02-11T14:44:50.810+08:00 level=DEBUG source=server.go:603 msg="model load completed, waiting for server to become available" status="llm server loading model" 2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init: CPU KV buffer size = 28914.00 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init: CUDA0 KV buffer size = 474.00 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init: CUDA1 KV buffer size = 2844.00 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init: CUDA2 KV buffer size = 2844.00 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_kv_cache_init: CUDA3 KV buffer size = 2844.00 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: KV self size = 37920.00 MiB, K (f16): 18960.00 MiB, V (f16): 18960.00 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: CPU output buffer size = 0.52 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: CUDA0 compute buffer size = 1088.45 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: CUDA1 compute buffer size = 262.50 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: CUDA2 compute buffer size = 262.50 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: CUDA3 compute buffer size = 262.50 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: CUDA_Host compute buffer size = 253.01 MiB 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: graph nodes = 2247 2月 11 14:44:58 gpu ollama[7229]: llama_new_context_with_model: graph splits = 679 (with bs=512), 6 (with bs=1) 2月 11 14:44:59 gpu ollama[7229]: time=2025-02-11T14:44:59.101+08:00 level=INFO source=server.go:594 msg="llama runner started in 28.91 seconds" 2月 11 14:44:59 gpu ollama[7229]: time=2025-02-11T14:44:59.101+08:00 level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-4cd576d9aa16961244012223abf01445567b061f1814b57dfef699e4cf8df339 ``` ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.4
GiteaMirror added the bug label 2026-04-12 17:11:32 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 11, 2025):

You have increased the size of the context buffer with the result that not all of the model fits in the GPU, so 41% has spilled to system RAM where the CPU does the processing. The CPU is slower than the GPU, so token generation is slower. GPU utilization is lower because it is waiting for the CPU.

Also, with num_predict:-1 and num_ctx: 121234 you are making the model more likely to fail from k-shift errors (https://github.com/ollama/ollama/issues/5975) with long prompts and long results.

<!-- gh-comment-id:2650212223 --> @rick-github commented on GitHub (Feb 11, 2025): You have increased the size of the context buffer with the result that not all of the model fits in the GPU, so 41% has spilled to system RAM where the CPU does the processing. The CPU is slower than the GPU, so token generation is slower. GPU utilization is lower because it is waiting for the CPU. Also, with `num_predict:-1` and `num_ctx: 121234` you are making the model more likely to fail from k-shift errors (https://github.com/ollama/ollama/issues/5975) with long prompts and long results.
Author
Owner

@goactiongo commented on GitHub (Feb 11, 2025):

I'm confusedso how to resolve the issue about k-shift

<!-- gh-comment-id:2650358494 --> @goactiongo commented on GitHub (Feb 11, 2025): I'm confusedso how to resolve the issue about k-shift
Author
Owner

@rick-github commented on GitHub (Feb 11, 2025):

K-shift failure occurs when the context buffer fills up during token generation and the inference engine wants to shift the buffer to make room for more tokens. Deepseek doesn't support shifting, so the context buffer has to be big enough to contain the sum of the number of input tokens and output tokens. So, if you expect a user input of up to 5,000 tokens, and you want to generate up to 2,000 tokens in response, you set num_ctx:7000 and num_predict:2000.

<!-- gh-comment-id:2650375320 --> @rick-github commented on GitHub (Feb 11, 2025): K-shift failure occurs when the context buffer fills up during token generation and the inference engine wants to shift the buffer to make room for more tokens. Deepseek doesn't support shifting, so the context buffer has to be big enough to contain the sum of the number of input tokens and output tokens. So, if you expect a user input of up to 5,000 tokens, and you want to generate up to 2,000 tokens in response, you set `num_ctx:7000` and `num_predict:2000`.
Author
Owner

@goactiongo commented on GitHub (Feb 11, 2025):

Thank you for your reply. I understand what you mean, but my requirement is for DeepSeek to summarize longer documents, which might be quite lengthy. Given that DeepSeek can support a context length of 131,072, does this mean that as long as num_ctx + num_predict < 131,072, it would be acceptable (as in the example I mentioned above)? Additionally, in the issue mentioned above, why did changing num_ctx and num_predict result in spilling over to system RAM?

<!-- gh-comment-id:2650409325 --> @goactiongo commented on GitHub (Feb 11, 2025): Thank you for your reply. I understand what you mean, but my requirement is for DeepSeek to summarize longer documents, which might be quite lengthy. Given that DeepSeek can support a context length of 131,072, does this mean that as long as num_ctx + num_predict < 131,072, it would be acceptable (as in the example I mentioned above)? Additionally, in the issue mentioned above, why did changing num_ctx and num_predict result in spilling over to system RAM?
Author
Owner

@rick-github commented on GitHub (Feb 11, 2025):

Given that DeepSeek can support a context length of 131,072, does this mean that as long as num_ctx + num_predict < 131,072

No. num_input_tokens + num_output_tokens < num_ctx is the constraint that needs to be satisfied. You can set num_ctx:131072 and num_predict:31072 and as long as the input text does not exceed 100000 tokens, it will be fine.

why did changing num_ctx and num_predict result in spilling over to system RAM?

num_ctx is the size of the context buffer that holds the input and output tokens. Memory needs to be allocated to hold this buffer. When num_ctx is increased, the amount of VRAM needed to hold the buffer increases. If the size of the VRAM used by the context buffer increases, there is less VRAM for model weights. If the model weights don't fit in VRAM, they are loaded in to system RAM.

<!-- gh-comment-id:2650450346 --> @rick-github commented on GitHub (Feb 11, 2025): > Given that DeepSeek can support a context length of 131,072, does this mean that as long as num_ctx + num_predict < 131,072 No. num_input_tokens + num_output_tokens < num_ctx is the constraint that needs to be satisfied. You can set `num_ctx:131072` and `num_predict:31072` and as long as the input text does not exceed 100000 tokens, it will be fine. > why did changing num_ctx and num_predict result in spilling over to system RAM? `num_ctx` is the size of the context buffer that holds the input and output tokens. Memory needs to be allocated to hold this buffer. When `num_ctx` is increased, the amount of VRAM needed to hold the buffer increases. If the size of the VRAM used by the context buffer increases, there is less VRAM for model weights. If the model weights don't fit in VRAM, they are loaded in to system RAM.
Author
Owner

@goactiongo commented on GitHub (Feb 11, 2025):

thanks gus

<!-- gh-comment-id:2650741466 --> @goactiongo commented on GitHub (Feb 11, 2025): thanks gus
Author
Owner

@rick-github commented on GitHub (Feb 11, 2025):

Sorry, you can ignore all I wrote about k-shift. You are using the 70b model which is not deepseek architecture, it is a finetuned llama model. llama does not have a k-shift restriction. There are so many tickets filed about deepseek-r1 that are caused by k-shift that I overlooked the 70b tag.

<!-- gh-comment-id:2650741627 --> @rick-github commented on GitHub (Feb 11, 2025): Sorry, you can ignore all I wrote about k-shift. You are using the 70b model which is not deepseek architecture, it is a finetuned llama model. llama does not have a k-shift restriction. There are so many tickets filed about deepseek-r1 that are caused by k-shift that I overlooked the 70b tag.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5852