[GH-ISSUE #8447] error "cudaMalloc failed: out of memory"; can't configure Ollama to valid CPU/GPU offloading #5432

Closed
opened 2026-04-12 16:40:06 -05:00 by GiteaMirror · 12 comments
Owner

Originally created by @SlavikCA on GitHub (Jan 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8447

What is the issue?

System:

  • Ubuntu 22 server with Docker
  • 640GB RAM
  • Nvidia RTX 3090 with 24GB VRAM
  • 2x Intel Xeon Gold 5218

Docker compose:

services:
  ollama:
    image: ollama/ollama:0.5.6
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - 11434:11434
    tty: true
    environment:
      - 'OLLAMA_FLASH_ATTENTION=1'  # only GPUs with compute capability 7+ support flash attention
      - 'OLLAMA_KV_CACHE_TYPE=q8_0' # Quantization type for the K/V cache (default: f16)
      - 'OLLAMA_NUM_PARALLEL=1'     # The maximum number of parallel requests each model will process at the same time
      - 'OLLAMA_GPU_OVERHEAD=8G'    # Reserve a portion of VRAM per GPU (bytes)

Steps:

  • download Deepseek V3 model (default q4_K_M quantization)
  • run the query

Error:

ollama | ggml_cuda_init: found 1 CUDA devices:
ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16
ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667"
ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free
ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest))
...
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory
ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer
ollama | llama_load_model_from_file: failed to load model
ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517
ollama |
ollama | goroutine 7 [running]:
ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...)
ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad
ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d
ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model"

I tried OLLAMA_GPU_OVERHEAD with 8G, and without OLLAMA_GPU_OVERHEAD - same result. Looks like OLLAMA_GPU_OVERHEAD doesn't do anything.

I can run the model WITHOUT GPU. It works. Can't offload anything on it.

With GPU, according to logs above it tried to use 35GB of VRAM, when I only have 24GB.

Is there a way to configure Ollama to better calculate VRAM allocation?

Ollama version

0.5.6

Originally created by @SlavikCA on GitHub (Jan 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8447 ### What is the issue? System: - Ubuntu 22 server with Docker - 640GB RAM - Nvidia RTX 3090 with 24GB VRAM - 2x Intel Xeon Gold 5218 Docker compose: ``` services: ollama: image: ollama/ollama:0.5.6 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ports: - 11434:11434 tty: true environment: - 'OLLAMA_FLASH_ATTENTION=1' # only GPUs with compute capability 7+ support flash attention - 'OLLAMA_KV_CACHE_TYPE=q8_0' # Quantization type for the K/V cache (default: f16) - 'OLLAMA_NUM_PARALLEL=1' # The maximum number of parallel requests each model will process at the same time - 'OLLAMA_GPU_OVERHEAD=8G' # Reserve a portion of VRAM per GPU (bytes) ``` Steps: - download Deepseek V3 model (default q4_K_M quantization) - run the query Error: ollama | ggml_cuda_init: found 1 CUDA devices: ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16 ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667" ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest)) ... ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer ollama | llama_load_model_from_file: failed to load model ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 ollama | ollama | goroutine 7 [running]: ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...) ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model" I tried `OLLAMA_GPU_OVERHEAD` with 8G, and without `OLLAMA_GPU_OVERHEAD` - same result. Looks like `OLLAMA_GPU_OVERHEAD` doesn't do anything. I can run the model WITHOUT GPU. It works. Can't offload anything on it. With GPU, according to logs above it tried to use 35GB of VRAM, when I only have 24GB. Is there a way to configure Ollama to better calculate VRAM allocation? ### Ollama version 0.5.6
GiteaMirror added the bug label 2026-04-12 16:40:06 -05:00
Author
Owner

@rick-github commented on GitHub (Jan 16, 2025):

Full logs will help in debugging. What's the query you are sending? Have you modified any settings like num_ctx or num_gpu?

<!-- gh-comment-id:2594540827 --> @rick-github commented on GitHub (Jan 16, 2025): Full logs will help in debugging. What's the query you are sending? Have you modified any settings like `num_ctx` or `num_gpu`?
Author
Owner

@SlavikCA commented on GitHub (Jan 16, 2025):

Query: why the sky is blue?

All parameters, including num_ctx or num_gpu - set to default.

Full log:

ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | 2025/01/16 04:29:39 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
ollama | time=2025-01-16T04:29:39.657Z level=INFO source=images.go:432 msg="total blobs: 42"
ollama | time=2025-01-16T04:29:39.658Z level=INFO source=images.go:439 msg="total unused blobs removed: 0"
ollama | [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
ollama |
ollama | [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
ollama | - using env: export GIN_MODE=release
ollama | - using code: gin.SetMode(gin.ReleaseMode)
ollama |
ollama | [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
ollama | [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
ollama | [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
ollama | [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
ollama | [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
ollama | [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
ollama | [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
ollama | [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
ollama | [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
ollama | [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
ollama | [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
ollama | [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
ollama | [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
ollama | [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
ollama | [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
ollama | time=2025-01-16T04:29:39.659Z level=INFO source=routes.go:1238 msg="Listening on [::]:11434 (version 0.5.6-0-g2539f2d-dirty)"
ollama | time=2025-01-16T04:29:39.661Z level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx cpu]"
ollama | time=2025-01-16T04:29:39.661Z level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
ollama | time=2025-01-16T04:29:39.960Z level=INFO source=types.go:131 msg="inference compute" id=GPU-832fc4ab-1e74-2a7f-773b-27cbd204bebf library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
ollama | time=2025-01-16T04:29:50.196Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.395Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.741Z level=INFO source=server.go:104 msg="system memory" total="628.5 GiB" free="623.2 GiB" free_swap="0 B"
ollama | time=2025-01-16T04:29:50.742Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.909Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.909Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=5 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="415.7 GiB" memory.required.partial="17.6 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[17.6 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="654.0 MiB" memory.graph.partial="1019.5 MiB"
ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:216 msg="flash attention enabled but not supported by model"
ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:234 msg="quantized kv cache requested but flash attention disabled" type=q8_0
ollama | time=2025-01-16T04:29:50.909Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 --ctx-size 2048 --batch-size 512 --n-gpu-layers 5 --threads 16 --parallel 1 --port 45667"
ollama | time=2025-01-16T04:29:50.910Z level=INFO source=sched.go:449 msg="loaded runners" count=1
ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama | time=2025-01-16T04:29:50.970Z level=INFO source=runner.go:936 msg="starting go runner"
ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama | ggml_cuda_init: found 1 CUDA devices:
ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16
ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667"
ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free
ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest))
ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama | llama_model_loader: - kv 0: general.architecture str = deepseek2
ollama | llama_model_loader: - kv 1: general.type str = model
ollama | llama_model_loader: - kv 2: general.size_label str = 256x20B
ollama | llama_model_loader: - kv 3: deepseek2.block_count u32 = 61
ollama | llama_model_loader: - kv 4: deepseek2.context_length u32 = 163840
ollama | llama_model_loader: - kv 5: deepseek2.embedding_length u32 = 7168
ollama | llama_model_loader: - kv 6: deepseek2.feed_forward_length u32 = 18432
ollama | llama_model_loader: - kv 7: deepseek2.attention.head_count u32 = 128
ollama | llama_model_loader: - kv 8: deepseek2.attention.head_count_kv u32 = 128
ollama | llama_model_loader: - kv 9: deepseek2.rope.freq_base f32 = 10000.000000
ollama | llama_model_loader: - kv 10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
ollama | llama_model_loader: - kv 11: deepseek2.expert_used_count u32 = 8
ollama | llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 3
ollama | llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 129280
ollama | llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536
ollama | llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512
ollama | llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192
ollama | llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128
ollama | llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 2048
ollama | llama_model_loader: - kv 19: deepseek2.expert_count u32 = 256
ollama | llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 1
ollama | llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 2.500000
ollama | llama_model_loader: - kv 22: deepseek2.expert_weights_norm bool = true
ollama | llama_model_loader: - kv 23: deepseek2.expert_gating_func u32 = 2
ollama | llama_model_loader: - kv 24: deepseek2.rope.dimension_count u32 = 64
ollama | llama_model_loader: - kv 25: deepseek2.rope.scaling.type str = yarn
ollama | llama_model_loader: - kv 26: deepseek2.rope.scaling.factor f32 = 40.000000
ollama | llama_model_loader: - kv 27: deepseek2.rope.scaling.original_context_length u32 = 4096
ollama | llama_model_loader: - kv 28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
ollama | llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
ollama | llama_model_loader: - kv 30: tokenizer.ggml.pre str = deepseek-v3
ollama | time=2025-01-16T04:29:51.163Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
ollama | llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�...
ollama | llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama | llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
ollama | llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 0
ollama | llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 1
ollama | llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 1
ollama | llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = true
ollama | llama_model_loader: - kv 38: tokenizer.ggml.add_eos_token bool = false
ollama | llama_model_loader: - kv 39: tokenizer.chat_template str = {% if not add_generation_prompt is de...
ollama | llama_model_loader: - kv 40: general.quantization_version u32 = 2
ollama | llama_model_loader: - kv 41: general.file_type u32 = 15
ollama | llama_model_loader: - type f32: 361 tensors
ollama | llama_model_loader: - type q4_K: 606 tensors
ollama | llama_model_loader: - type q6_K: 58 tensors
ollama | llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
ollama | llm_load_vocab: special tokens cache size = 818
ollama | llm_load_vocab: token to piece cache size = 0.8223 MB
ollama | llm_load_print_meta: format = GGUF V3 (latest)
ollama | llm_load_print_meta: arch = deepseek2
ollama | llm_load_print_meta: vocab type = BPE
ollama | llm_load_print_meta: n_vocab = 129280
ollama | llm_load_print_meta: n_merges = 127741
ollama | llm_load_print_meta: vocab_only = 0
ollama | llm_load_print_meta: n_ctx_train = 163840
ollama | llm_load_print_meta: n_embd = 7168
ollama | llm_load_print_meta: n_layer = 61
ollama | llm_load_print_meta: n_head = 128
ollama | llm_load_print_meta: n_head_kv = 128
ollama | llm_load_print_meta: n_rot = 64
ollama | llm_load_print_meta: n_swa = 0
ollama | llm_load_print_meta: n_embd_head_k = 192
ollama | llm_load_print_meta: n_embd_head_v = 128
ollama | llm_load_print_meta: n_gqa = 1
ollama | llm_load_print_meta: n_embd_k_gqa = 24576
ollama | llm_load_print_meta: n_embd_v_gqa = 16384
ollama | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06
ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama | llm_load_print_meta: n_ff = 18432
ollama | llm_load_print_meta: n_expert = 256
ollama | llm_load_print_meta: n_expert_used = 8
ollama | llm_load_print_meta: causal attn = 1
ollama | llm_load_print_meta: pooling type = 0
ollama | llm_load_print_meta: rope type = 0
ollama | llm_load_print_meta: rope scaling = yarn
ollama | llm_load_print_meta: freq_base_train = 10000.0
ollama | llm_load_print_meta: freq_scale_train = 0.025
ollama | llm_load_print_meta: n_ctx_orig_yarn = 4096
ollama | llm_load_print_meta: rope_finetuned = unknown
ollama | llm_load_print_meta: ssm_d_conv = 0
ollama | llm_load_print_meta: ssm_d_inner = 0
ollama | llm_load_print_meta: ssm_d_state = 0
ollama | llm_load_print_meta: ssm_dt_rank = 0
ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0
ollama | llm_load_print_meta: model type = 671B
ollama | llm_load_print_meta: model ftype = Q4_K - Medium
ollama | llm_load_print_meta: model params = 671.03 B
ollama | llm_load_print_meta: model size = 376.65 GiB (4.82 BPW)
ollama | llm_load_print_meta: general.name = n/a
ollama | llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>'
ollama | llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>'
ollama | llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>'
ollama | llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>'
ollama | llm_load_print_meta: LF token = 131 'Ä'
ollama | llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>'
ollama | llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>'
ollama | llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>'
ollama | llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>'
ollama | llm_load_print_meta: max token length = 256
ollama | llm_load_print_meta: n_layer_dense_lead = 3
ollama | llm_load_print_meta: n_lora_q = 1536
ollama | llm_load_print_meta: n_lora_kv = 512
ollama | llm_load_print_meta: n_ff_exp = 2048
ollama | llm_load_print_meta: n_expert_shared = 1
ollama | llm_load_print_meta: expert_weights_scale = 2.5
ollama | llm_load_print_meta: expert_weights_norm = 1
ollama | llm_load_print_meta: expert_gating_func = sigmoid
ollama | llm_load_print_meta: rope_yarn_log_mul = 0.1000
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory
ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer
ollama | llama_load_model_from_file: failed to load model
ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517
ollama |
ollama | goroutine 7 [running]:
ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...)
ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad
ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d
ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model"
ollama | [GIN] 2025/01/16 - 04:30:11 | 500 | 21.870863244s | 192.168.0.75 | POST "/api/chat"
ollama | time=2025-01-16T04:30:16.971Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.180716149 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517

<!-- gh-comment-id:2594545699 --> @SlavikCA commented on GitHub (Jan 16, 2025): Query: `why the sky is blue?` All parameters, including `num_ctx` or `num_gpu` - set to default. Full log: ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | 2025/01/16 04:29:39 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" ollama | time=2025-01-16T04:29:39.657Z level=INFO source=images.go:432 msg="total blobs: 42" ollama | time=2025-01-16T04:29:39.658Z level=INFO source=images.go:439 msg="total unused blobs removed: 0" ollama | [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. ollama | ollama | [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. ollama | - using env: export GIN_MODE=release ollama | - using code: gin.SetMode(gin.ReleaseMode) ollama | ollama | [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) ollama | [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) ollama | [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) ollama | [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) ollama | [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) ollama | [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) ollama | [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) ollama | [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) ollama | [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) ollama | [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) ollama | [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) ollama | [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) ollama | [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) ollama | [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) ollama | [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) ollama | time=2025-01-16T04:29:39.659Z level=INFO source=routes.go:1238 msg="Listening on [::]:11434 (version 0.5.6-0-g2539f2d-dirty)" ollama | time=2025-01-16T04:29:39.661Z level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx cpu]" ollama | time=2025-01-16T04:29:39.661Z level=INFO source=gpu.go:226 msg="looking for compatible GPUs" ollama | time=2025-01-16T04:29:39.960Z level=INFO source=types.go:131 msg="inference compute" id=GPU-832fc4ab-1e74-2a7f-773b-27cbd204bebf library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" ollama | time=2025-01-16T04:29:50.196Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.395Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.741Z level=INFO source=server.go:104 msg="system memory" total="628.5 GiB" free="623.2 GiB" free_swap="0 B" ollama | time=2025-01-16T04:29:50.742Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.909Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.909Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=5 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="415.7 GiB" memory.required.partial="17.6 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[17.6 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="654.0 MiB" memory.graph.partial="1019.5 MiB" ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:216 msg="flash attention enabled but not supported by model" ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:234 msg="quantized kv cache requested but flash attention disabled" type=q8_0 ollama | time=2025-01-16T04:29:50.909Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 --ctx-size 2048 --batch-size 512 --n-gpu-layers 5 --threads 16 --parallel 1 --port 45667" ollama | time=2025-01-16T04:29:50.910Z level=INFO source=sched.go:449 msg="loaded runners" count=1 ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-16T04:29:50.970Z level=INFO source=runner.go:936 msg="starting go runner" ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ollama | ggml_cuda_init: found 1 CUDA devices: ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16 ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667" ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest)) ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama | llama_model_loader: - kv 0: general.architecture str = deepseek2 ollama | llama_model_loader: - kv 1: general.type str = model ollama | llama_model_loader: - kv 2: general.size_label str = 256x20B ollama | llama_model_loader: - kv 3: deepseek2.block_count u32 = 61 ollama | llama_model_loader: - kv 4: deepseek2.context_length u32 = 163840 ollama | llama_model_loader: - kv 5: deepseek2.embedding_length u32 = 7168 ollama | llama_model_loader: - kv 6: deepseek2.feed_forward_length u32 = 18432 ollama | llama_model_loader: - kv 7: deepseek2.attention.head_count u32 = 128 ollama | llama_model_loader: - kv 8: deepseek2.attention.head_count_kv u32 = 128 ollama | llama_model_loader: - kv 9: deepseek2.rope.freq_base f32 = 10000.000000 ollama | llama_model_loader: - kv 10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 ollama | llama_model_loader: - kv 11: deepseek2.expert_used_count u32 = 8 ollama | llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 3 ollama | llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 129280 ollama | llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536 ollama | llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512 ollama | llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192 ollama | llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128 ollama | llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 2048 ollama | llama_model_loader: - kv 19: deepseek2.expert_count u32 = 256 ollama | llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 1 ollama | llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 2.500000 ollama | llama_model_loader: - kv 22: deepseek2.expert_weights_norm bool = true ollama | llama_model_loader: - kv 23: deepseek2.expert_gating_func u32 = 2 ollama | llama_model_loader: - kv 24: deepseek2.rope.dimension_count u32 = 64 ollama | llama_model_loader: - kv 25: deepseek2.rope.scaling.type str = yarn ollama | llama_model_loader: - kv 26: deepseek2.rope.scaling.factor f32 = 40.000000 ollama | llama_model_loader: - kv 27: deepseek2.rope.scaling.original_context_length u32 = 4096 ollama | llama_model_loader: - kv 28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 ollama | llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2 ollama | llama_model_loader: - kv 30: tokenizer.ggml.pre str = deepseek-v3 ollama | time=2025-01-16T04:29:51.163Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" ollama | llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<�... ollama | llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama | llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... ollama | llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 0 ollama | llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 1 ollama | llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 1 ollama | llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = true ollama | llama_model_loader: - kv 38: tokenizer.ggml.add_eos_token bool = false ollama | llama_model_loader: - kv 39: tokenizer.chat_template str = {% if not add_generation_prompt is de... ollama | llama_model_loader: - kv 40: general.quantization_version u32 = 2 ollama | llama_model_loader: - kv 41: general.file_type u32 = 15 ollama | llama_model_loader: - type f32: 361 tensors ollama | llama_model_loader: - type q4_K: 606 tensors ollama | llama_model_loader: - type q6_K: 58 tensors ollama | llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect ollama | llm_load_vocab: special tokens cache size = 818 ollama | llm_load_vocab: token to piece cache size = 0.8223 MB ollama | llm_load_print_meta: format = GGUF V3 (latest) ollama | llm_load_print_meta: arch = deepseek2 ollama | llm_load_print_meta: vocab type = BPE ollama | llm_load_print_meta: n_vocab = 129280 ollama | llm_load_print_meta: n_merges = 127741 ollama | llm_load_print_meta: vocab_only = 0 ollama | llm_load_print_meta: n_ctx_train = 163840 ollama | llm_load_print_meta: n_embd = 7168 ollama | llm_load_print_meta: n_layer = 61 ollama | llm_load_print_meta: n_head = 128 ollama | llm_load_print_meta: n_head_kv = 128 ollama | llm_load_print_meta: n_rot = 64 ollama | llm_load_print_meta: n_swa = 0 ollama | llm_load_print_meta: n_embd_head_k = 192 ollama | llm_load_print_meta: n_embd_head_v = 128 ollama | llm_load_print_meta: n_gqa = 1 ollama | llm_load_print_meta: n_embd_k_gqa = 24576 ollama | llm_load_print_meta: n_embd_v_gqa = 16384 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama | llm_load_print_meta: n_ff = 18432 ollama | llm_load_print_meta: n_expert = 256 ollama | llm_load_print_meta: n_expert_used = 8 ollama | llm_load_print_meta: causal attn = 1 ollama | llm_load_print_meta: pooling type = 0 ollama | llm_load_print_meta: rope type = 0 ollama | llm_load_print_meta: rope scaling = yarn ollama | llm_load_print_meta: freq_base_train = 10000.0 ollama | llm_load_print_meta: freq_scale_train = 0.025 ollama | llm_load_print_meta: n_ctx_orig_yarn = 4096 ollama | llm_load_print_meta: rope_finetuned = unknown ollama | llm_load_print_meta: ssm_d_conv = 0 ollama | llm_load_print_meta: ssm_d_inner = 0 ollama | llm_load_print_meta: ssm_d_state = 0 ollama | llm_load_print_meta: ssm_dt_rank = 0 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 ollama | llm_load_print_meta: model type = 671B ollama | llm_load_print_meta: model ftype = Q4_K - Medium ollama | llm_load_print_meta: model params = 671.03 B ollama | llm_load_print_meta: model size = 376.65 GiB (4.82 BPW) ollama | llm_load_print_meta: general.name = n/a ollama | llm_load_print_meta: BOS token = 0 '<|begin▁of▁sentence|>' ollama | llm_load_print_meta: EOS token = 1 '<|end▁of▁sentence|>' ollama | llm_load_print_meta: EOT token = 1 '<|end▁of▁sentence|>' ollama | llm_load_print_meta: PAD token = 1 '<|end▁of▁sentence|>' ollama | llm_load_print_meta: LF token = 131 'Ä' ollama | llm_load_print_meta: FIM PRE token = 128801 '<|fim▁begin|>' ollama | llm_load_print_meta: FIM SUF token = 128800 '<|fim▁hole|>' ollama | llm_load_print_meta: FIM MID token = 128802 '<|fim▁end|>' ollama | llm_load_print_meta: EOG token = 1 '<|end▁of▁sentence|>' ollama | llm_load_print_meta: max token length = 256 ollama | llm_load_print_meta: n_layer_dense_lead = 3 ollama | llm_load_print_meta: n_lora_q = 1536 ollama | llm_load_print_meta: n_lora_kv = 512 ollama | llm_load_print_meta: n_ff_exp = 2048 ollama | llm_load_print_meta: n_expert_shared = 1 ollama | llm_load_print_meta: expert_weights_scale = 2.5 ollama | llm_load_print_meta: expert_weights_norm = 1 ollama | llm_load_print_meta: expert_gating_func = sigmoid ollama | llm_load_print_meta: rope_yarn_log_mul = 0.1000 ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer ollama | llama_load_model_from_file: failed to load model ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 ollama | ollama | goroutine 7 [running]: ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...) ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model" ollama | [GIN] 2025/01/16 - 04:30:11 | 500 | 21.870863244s | 192.168.0.75 | POST "/api/chat" ollama | time=2025-01-16T04:30:16.971Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.180716149 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517
Author
Owner

@rick-github commented on GitHub (Jan 16, 2025):

It's unclear why llama.cpp is overallocating. The deepseek class of models has always had problematical allocations, the usual way to deal with it is to reduce the number of layers offloaded. Setting OLLAMA_GPU_OVERHEAD is one way to do that, but it't takes a value in bytes, not human readable string. So try OLLAMA_GPU_OVERHEAD=8589934592.

<!-- gh-comment-id:2594555589 --> @rick-github commented on GitHub (Jan 16, 2025): It's unclear why llama.cpp is overallocating. The deepseek class of models has always had problematical allocations, the usual way to deal with it is to reduce the number of layers offloaded. Setting `OLLAMA_GPU_OVERHEAD` is one way to do that, but it't takes a value in bytes, not human readable string. So try `OLLAMA_GPU_OVERHEAD=8589934592`.
Author
Owner

@SlavikCA commented on GitHub (Jan 16, 2025):

ok, I see my mistake with

invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0

Trying now with bytes, not G,

I tried to enter bytes:

  - 'OLLAMA_GPU_OVERHEAD=8032385536'   

ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 28426.68 MiB on device 0: cudaMalloc failed: out of memory

  - 'OLLAMA_GPU_OVERHEAD=16032385536'   

ollama | llm_load_tensors: offloading 3 repeating layers to GPU
ollama | llm_load_tensors: offloaded 3/62 layers to GPU
ollama | llm_load_tensors: CPU_Mapped model buffer size = 364369.62 MiB
ollama | llm_load_tensors: CUDA0 model buffer size = 21320.01 MiB
ollama | llama_new_context_with_model: n_seq_max = 1
ollama | llama_new_context_with_model: n_ctx = 2048
ollama | llama_new_context_with_model: n_ctx_per_seq = 2048
ollama | llama_new_context_with_model: n_batch = 512
ollama | llama_new_context_with_model: n_ubatch = 512
ollama | llama_new_context_with_model: flash_attn = 0
ollama | llama_new_context_with_model: freq_base = 10000.0
ollama | llama_new_context_with_model: freq_scale = 0.025
ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
ollama | llama_kv_cache_init: CPU KV buffer size = 9280.00 MiB
ollama | llama_kv_cache_init: CUDA0 KV buffer size = 480.00 MiB
ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5030.00 MiB on device 0: cudaMalloc failed: out of memory
ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 5274339328
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2952.66 MiB on device 0: cudaMalloc failed: out of memory
ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 3096088576
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8686.04 MiB on device 0: cudaMalloc failed: out of memory
ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096
ollama | llama_new_context_with_model: failed to allocate compute buffers
ollama | panic: unable to create llama context
ollama |
ollama | goroutine 7 [running]:
ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x3, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...)
ollama | github.com/ollama/ollama/llama/runner/runner.go:858 +0x39c
ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d
ollama | time=2025-01-16T05:14:12.717Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server not responding"
ollama | time=2025-01-16T05:14:18.388Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096\nllama_new_context_with_model: failed to allocate compute buffers"
ollama | [GIN] 2025/01/16 - 05:14:18 | 500 | 31.279653601s | 192.168.0.75 | POST "/api/chat"
ollama | time=2025-01-16T05:14:23.428Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.039409053 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517

<!-- gh-comment-id:2594556683 --> @SlavikCA commented on GitHub (Jan 16, 2025): ok, I see my mistake with > invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0 Trying now with bytes, not G, I tried to enter bytes: - 'OLLAMA_GPU_OVERHEAD=8032385536' > ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 28426.68 MiB on device 0: cudaMalloc failed: out of memory - 'OLLAMA_GPU_OVERHEAD=16032385536' ollama | llm_load_tensors: offloading 3 repeating layers to GPU ollama | llm_load_tensors: offloaded 3/62 layers to GPU ollama | llm_load_tensors: CPU_Mapped model buffer size = 364369.62 MiB ollama | llm_load_tensors: CUDA0 model buffer size = 21320.01 MiB ollama | llama_new_context_with_model: n_seq_max = 1 ollama | llama_new_context_with_model: n_ctx = 2048 ollama | llama_new_context_with_model: n_ctx_per_seq = 2048 ollama | llama_new_context_with_model: n_batch = 512 ollama | llama_new_context_with_model: n_ubatch = 512 ollama | llama_new_context_with_model: flash_attn = 0 ollama | llama_new_context_with_model: freq_base = 10000.0 ollama | llama_new_context_with_model: freq_scale = 0.025 ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 ollama | llama_kv_cache_init: CPU KV buffer size = 9280.00 MiB ollama | llama_kv_cache_init: CUDA0 KV buffer size = 480.00 MiB ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5030.00 MiB on device 0: cudaMalloc failed: out of memory ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 5274339328 ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2952.66 MiB on device 0: cudaMalloc failed: out of memory ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 3096088576 ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8686.04 MiB on device 0: cudaMalloc failed: out of memory ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096 ollama | llama_new_context_with_model: failed to allocate compute buffers ollama | panic: unable to create llama context ollama | ollama | goroutine 7 [running]: ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x3, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...) ollama | github.com/ollama/ollama/llama/runner/runner.go:858 +0x39c ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d ollama | time=2025-01-16T05:14:12.717Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server not responding" ollama | time=2025-01-16T05:14:18.388Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096\nllama_new_context_with_model: failed to allocate compute buffers" ollama | [GIN] 2025/01/16 - 05:14:18 | 500 | 31.279653601s | 192.168.0.75 | POST "/api/chat" ollama | time=2025-01-16T05:14:23.428Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.039409053 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517
Author
Owner

@rick-github commented on GitHub (Jan 16, 2025):

Also note that flash attention is not supported in deepseek architecture models.

<!-- gh-comment-id:2594558356 --> @rick-github commented on GitHub (Jan 16, 2025): Also note that flash attention is not supported in deepseek architecture models.
Author
Owner

@rick-github commented on GitHub (Jan 16, 2025):

OK, llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek. The model is 376G and has 61 layers so a layer is 6G on average. In its first go around it wanted to offload 5 layers and use 9.5G KV cache, ~39G is obviously not going to fit in 24G. In the second go it looks like KV cache got allocated but the model weights still don't fit, so you will have to increase OLLAMA_GPU_OVERHEAD or set num_gpu.

<!-- gh-comment-id:2594568845 --> @rick-github commented on GitHub (Jan 16, 2025): OK, llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek. The model is 376G and has 61 layers so a layer is 6G on average. In its first go around it wanted to offload 5 layers and use 9.5G KV cache, ~39G is obviously not going to fit in 24G. In the second go it looks like KV cache got allocated but the model weights still don't fit, so you will have to increase `OLLAMA_GPU_OVERHEAD` or set `num_gpu`.
Author
Owner

@SlavikCA commented on GitHub (Jan 16, 2025):

With
- 'OLLAMA_GPU_OVERHEAD=22032385536'

I finally was able to offload 1 layer to GPU.

ollama | llm_load_tensors: offloading 1 repeating layers to GPU
ollama | llm_load_tensors: offloaded 1/62 layers to GPU
ollama | llm_load_tensors: CPU_Mapped model buffer size = 378582.96 MiB
ollama | llm_load_tensors: CUDA0 model buffer size = 7106.67 MiB
ollama | llama_new_context_with_model: n_seq_max = 1
ollama | llama_new_context_with_model: n_ctx = 2048
ollama | llama_new_context_with_model: n_ctx_per_seq = 2048
ollama | llama_new_context_with_model: n_batch = 512
ollama | llama_new_context_with_model: n_ubatch = 512
ollama | llama_new_context_with_model: flash_attn = 0
ollama | llama_new_context_with_model: freq_base = 10000.0
ollama | llama_new_context_with_model: freq_scale = 0.025
ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
ollama | llama_kv_cache_init: CPU KV buffer size = 9600.00 MiB
ollama | llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB
ollama | llama_new_context_with_model: CUDA0 compute buffer size = 5030.00 MiB
ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 84.01 MiB
ollama | llama_new_context_with_model: graph nodes = 5025
ollama | llama_new_context_with_model: graph splits = 1129 (with bs=512), 3 (with bs=1)
ollama | time=2025-01-16T05:23:16.058Z level=INFO source=server.go:594 msg="llama runner started in 21.89 seconds"
ollama | [GIN] 2025/01/16 - 05:25:50 | 200 | 2m56s | 192.168.0.75 | POST "/api/chat"

Not much help from GPU for this model.
Anyway, some feedback from my, may be it can help with optimizing "calculating memory requirements for deepseek".

<!-- gh-comment-id:2594570941 --> @SlavikCA commented on GitHub (Jan 16, 2025): With - 'OLLAMA_GPU_OVERHEAD=22032385536' I finally was able to offload 1 layer to GPU. ollama | llm_load_tensors: offloading 1 repeating layers to GPU ollama | llm_load_tensors: offloaded 1/62 layers to GPU ollama | llm_load_tensors: CPU_Mapped model buffer size = 378582.96 MiB ollama | llm_load_tensors: CUDA0 model buffer size = 7106.67 MiB ollama | llama_new_context_with_model: n_seq_max = 1 ollama | llama_new_context_with_model: n_ctx = 2048 ollama | llama_new_context_with_model: n_ctx_per_seq = 2048 ollama | llama_new_context_with_model: n_batch = 512 ollama | llama_new_context_with_model: n_ubatch = 512 ollama | llama_new_context_with_model: flash_attn = 0 ollama | llama_new_context_with_model: freq_base = 10000.0 ollama | llama_new_context_with_model: freq_scale = 0.025 ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 ollama | llama_kv_cache_init: CPU KV buffer size = 9600.00 MiB ollama | llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB ollama | llama_new_context_with_model: CUDA0 compute buffer size = 5030.00 MiB ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 84.01 MiB ollama | llama_new_context_with_model: graph nodes = 5025 ollama | llama_new_context_with_model: graph splits = 1129 (with bs=512), 3 (with bs=1) ollama | time=2025-01-16T05:23:16.058Z level=INFO source=server.go:594 msg="llama runner started in 21.89 seconds" ollama | [GIN] 2025/01/16 - 05:25:50 | 200 | 2m56s | 192.168.0.75 | POST "/api/chat" Not much help from GPU for this model. Anyway, some feedback from my, may be it can help with optimizing "calculating memory requirements for deepseek".
Author
Owner

@SlavikCA commented on GitHub (Jan 16, 2025):

The problem with figuring out OLLAMA_GPU_OVERHEAD value is that it's calculations for one model (Deepseek V3) significantly differs from what it needs to be for other models (for example qwen2.5).

So, it sounds like, OLLAMA_GPU_OVERHEAD needs to be not the global parameter, but per model.

Otherwise, currently I need to restart the Ollama if I want to use one model or another one.

<!-- gh-comment-id:2595857146 --> @SlavikCA commented on GitHub (Jan 16, 2025): The problem with figuring out OLLAMA_GPU_OVERHEAD value is that it's calculations for one model (Deepseek V3) significantly differs from what it needs to be for other models (for example qwen2.5). So, it sounds like, OLLAMA_GPU_OVERHEAD needs to be not the global parameter, but per model. Otherwise, currently I need to restart the Ollama if I want to use one model or another one.
Author
Owner

@rick-github commented on GitHub (Jan 16, 2025):

Since you know now how many layers you can offload, you can set num_gpu in the Modelfile. But ideally ollama should get the memory calculations correct. It's been an issue for a while and I haven't had the cycles to look at, I'll see if I can poke around in the near future.

<!-- gh-comment-id:2595877521 --> @rick-github commented on GitHub (Jan 16, 2025): Since you know now how many layers you can offload, you can set `num_gpu` in the Modelfile. But ideally ollama should get the memory calculations correct. It's been an issue for a while and I haven't had the cycles to look at, I'll see if I can poke around in the near future.
Author
Owner

@SlavikCA commented on GitHub (Jan 16, 2025):

I was thinking that num_gpu defines number of GPUs used:

Set the number of GPU devices used for computation. This option controls how many GPU devices (if available) are used to process incoming requests. Increasing this value can significantly improve performance for models that are optimized for GPU acceleration but may also consume more power and GPU resources.

But there in another place, it has different meaning:
a420a453b4/cmd/interactive.go (L106)

The number of layers to send to the GPU

Perhaps, different name can be used for that parameter?

<!-- gh-comment-id:2595897021 --> @SlavikCA commented on GitHub (Jan 16, 2025): I was thinking that `num_gpu` defines number of GPUs used: > Set the number of GPU devices used for computation. This option controls how many GPU devices (if available) are used to process incoming requests. Increasing this value can significantly improve performance for models that are optimized for GPU acceleration but may also consume more power and GPU resources. But there in another place, it has different meaning: https://github.com/ollama/ollama/blob/a420a453b4783841e3e79c248ef0fe9548df6914/cmd/interactive.go#L106 > The number of layers to send to the GPU Perhaps, different name can be used for that parameter?
Author
Owner

@rick-github commented on GitHub (Jan 16, 2025):

open-webui is incorrect. num_gpu is well established, changing the name will break clients and Modelfiles.

<!-- gh-comment-id:2595926603 --> @rick-github commented on GitHub (Jan 16, 2025): open-webui is incorrect. `num_gpu` is well established, changing the name will break clients and Modelfiles.
Author
Owner

@SlavikCA commented on GitHub (Jan 19, 2025):

Closing issue, as I found that VRAM usage can be managed with num_gpu.

This issue still unresolved:

llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek

But I do not understand it, and it probably should be separate issue.

<!-- gh-comment-id:2600322364 --> @SlavikCA commented on GitHub (Jan 19, 2025): Closing issue, as I found that VRAM usage can be managed with `num_gpu`. This issue still unresolved: > llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek But I do not understand it, and it probably should be separate issue.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#5432