[GH-ISSUE #4212] Long context models don't split memory correctly leads to OOM error #28384

Open
opened 2026-04-22 06:32:36 -05:00 by GiteaMirror · 0 comments
Owner

Originally created by @kungfu-eric on GitHub (May 6, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4212

Originally assigned to: @mxyng on GitHub.

What is the issue?

Using mixtral default 2048 ctx splits memory across 2x GPUs ~12 GBs each. When extending context to 12k, it dumps all mem on one GPU using 29 GB. Ideally, would want to split equally as before to push to higher 16k context without OOM. Using 2x 48 GB A6000. Issue is possibly related to https://github.com/ollama/ollama/issues/1341

[GIN] 2024/05/05 - 23:38:16 | 200 |  24.89660366s |      172.17.0.1 | POST     "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42754,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42755,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42756,"tid":"139643039125504","timestamp":1714977496}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496}
{"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":15,"n_past_se":0,"n_prompt_tokens_processed":15377,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":15,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496}
{"function":"print_timings","level":"INFO","line":269,"msg":"prompt eval time     =   24744.71 ms / 15377 tokens (    1.61 ms per token,   621.43 tokens per second)","n_prompt_tokens_processed":15377,"n_tokens_second":621.4256758605363,"slot_id":0,"t_prompt_processing":24744.713,"t_token":1.6092029004357156,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":283,"msg":"generation eval time =   13787.47 ms /   550 runs   (   25.07 ms per token,    39.89 tokens per second)","n_decoded":550,"n_tokens_second":39.891292601180645,"slot_id":0,"t_token":25.06812727272727,"t_token_generation":13787.47,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"print_timings","level":"INFO","line":293,"msg":"          total time =   38532.18 ms","slot_id":0,"t_prompt_processing":24744.713,"t_token_generation":13787.47,"t_total":38532.183,"task_id":42757,"tid":"139643039125504","timestamp":1714977535}
{"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":15942,"n_ctx":16384,"n_past":15941,"n_system_tokens":0,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977535,"truncated":false}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977535}
[GIN] 2024/05/05 - 23:38:55 | 200 | 38.672720153s |      172.17.0.1 | POST     "/api/chat"
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43310,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43311,"tid":"139643039125504","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535}
{"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43312,"tid":"139643039125504","timestamp":17149775llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8x7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 46.70 B
llm_load_print_meta: model size       = 24.62 GiB (4.53 BPW)
llm_load_print_meta: general.name     = mistralai
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.42 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:  CUDA_Host buffer size = 25215.87 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1145.00 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1200621568
llama_new_context_with_model: failed to allocate compute buffers
llama_init_from_gpt_params: error: failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'
{"function":"load_model","level":"ERR","line":410,"model":"/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b","msg":"unable to load model","tid":"140631930466304","timestamp":1714999945}
time=2024-05-06T05:52:25.670-07:00 level=ERROR source=sched.go:333 msg="error loading llama server" error="llama runner process no longer running: 1 error:failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'"
[GIN] 2024/05/06 - 05:52:25 | 500 | 20.037722871s |      172.17.0.1 | POST     "/api/chat"

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.1.33

Originally created by @kungfu-eric on GitHub (May 6, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4212 Originally assigned to: @mxyng on GitHub. ### What is the issue? Using mixtral default 2048 ctx splits memory across 2x GPUs ~12 GBs each. When extending context to 12k, it dumps all mem on one GPU using 29 GB. Ideally, would want to split equally as before to push to higher 16k context without OOM. Using 2x 48 GB A6000. Issue is possibly related to https://github.com/ollama/ollama/issues/1341 ``` [GIN] 2024/05/05 - 23:38:16 | 200 | 24.89660366s | 172.17.0.1 | POST "/api/chat" {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42754,"tid":"139643039125504","timestamp":1714977496} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42755,"tid":"139643039125504","timestamp":1714977496} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496} {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":42756,"tid":"139643039125504","timestamp":1714977496} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977496} {"function":"launch_slot_with_data","level":"INFO","line":830,"msg":"slot is processing task","slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496} {"function":"update_slots","ga_i":0,"level":"INFO","line":1809,"msg":"slot progression","n_past":15,"n_past_se":0,"n_prompt_tokens_processed":15377,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496} {"function":"update_slots","level":"INFO","line":1836,"msg":"kv cache rm [p0, end)","p0":15,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977496} {"function":"print_timings","level":"INFO","line":269,"msg":"prompt eval time = 24744.71 ms / 15377 tokens ( 1.61 ms per token, 621.43 tokens per second)","n_prompt_tokens_processed":15377,"n_tokens_second":621.4256758605363,"slot_id":0,"t_prompt_processing":24744.713,"t_token":1.6092029004357156,"task_id":42757,"tid":"139643039125504","timestamp":1714977535} {"function":"print_timings","level":"INFO","line":283,"msg":"generation eval time = 13787.47 ms / 550 runs ( 25.07 ms per token, 39.89 tokens per second)","n_decoded":550,"n_tokens_second":39.891292601180645,"slot_id":0,"t_token":25.06812727272727,"t_token_generation":13787.47,"task_id":42757,"tid":"139643039125504","timestamp":1714977535} {"function":"print_timings","level":"INFO","line":293,"msg":" total time = 38532.18 ms","slot_id":0,"t_prompt_processing":24744.713,"t_token_generation":13787.47,"t_total":38532.183,"task_id":42757,"tid":"139643039125504","timestamp":1714977535} {"function":"update_slots","level":"INFO","line":1640,"msg":"slot released","n_cache_tokens":15942,"n_ctx":16384,"n_past":15941,"n_system_tokens":0,"slot_id":0,"task_id":42757,"tid":"139643039125504","timestamp":1714977535,"truncated":false} {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/completion","remote_addr":"127.0.0.1","remote_port":56082,"status":200,"tid":"139637473294080","timestamp":1714977535} [GIN] 2024/05/05 - 23:38:55 | 200 | 38.672720153s | 172.17.0.1 | POST "/api/chat" {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43310,"tid":"139643039125504","timestamp":1714977535} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43311,"tid":"139643039125504","timestamp":1714977535} {"function":"log_server_request","level":"INFO","line":2734,"method":"GET","msg":"request","params":{},"path":"/health","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535} {"function":"log_server_request","level":"INFO","line":2734,"method":"POST","msg":"request","params":{},"path":"/tokenize","remote_addr":"127.0.0.1","remote_port":35540,"status":200,"tid":"139637464901376","timestamp":1714977535} {"function":"process_single_task","level":"INFO","line":1506,"msg":"slot data","n_idle_slots":1,"n_processing_slots":0,"task_id":43312,"tid":"139643039125504","timestamp":17149775llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8x7B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 46.70 B llm_load_print_meta: model size = 24.62 GiB (4.53 BPW) llm_load_print_meta: general.name = mistralai llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes ggml_cuda_init: CUDA_USE_TENSOR_CORES: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.42 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 25215.87 MiB .................................................................................................... llama_new_context_with_model: n_ctx = 16384 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 2048.00 MiB llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1145.00 MiB on device 0: cudaMalloc failed: out of memory ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 1200621568 llama_new_context_with_model: failed to allocate compute buffers llama_init_from_gpt_params: error: failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b' {"function":"load_model","level":"ERR","line":410,"model":"/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b","msg":"unable to load model","tid":"140631930466304","timestamp":1714999945} time=2024-05-06T05:52:25.670-07:00 level=ERROR source=sched.go:333 msg="error loading llama server" error="llama runner process no longer running: 1 error:failed to create context with model '/root/.ollama/models/blobs/sha256-e9e56e8bb5f0fcd4860675e6837a8f6a94e659f5fa7dce6a1076279336320f2b'" [GIN] 2024/05/06 - 05:52:25 | 500 | 20.037722871s | 172.17.0.1 | POST "/api/chat" ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.1.33
GiteaMirror added the gpunvidiabug labels 2026-04-22 06:32:36 -05:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28384