[GH-ISSUE #8447] error "cudaMalloc failed: out of memory"; can't configure Ollama to valid CPU/GPU offloading #5432

New Issue

GiteaMirror · 2026-04-12T16:40:06-05:00

GiteaMirror commented

2026-04-12 16:40:06 -05:00

Originally created by @SlavikCA on GitHub (Jan 16, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/8447

What is the issue?

System:

Ubuntu 22 server with Docker
640GB RAM
Nvidia RTX 3090 with 24GB VRAM
2x Intel Xeon Gold 5218

Docker compose:

services:
  ollama:
    image: ollama/ollama:0.5.6
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - 11434:11434
    tty: true
    environment:
      - 'OLLAMA_FLASH_ATTENTION=1'  # only GPUs with compute capability 7+ support flash attention
      - 'OLLAMA_KV_CACHE_TYPE=q8_0' # Quantization type for the K/V cache (default: f16)
      - 'OLLAMA_NUM_PARALLEL=1'     # The maximum number of parallel requests each model will process at the same time
      - 'OLLAMA_GPU_OVERHEAD=8G'    # Reserve a portion of VRAM per GPU (bytes)

Steps:

download Deepseek V3 model (default q4_K_M quantization)
run the query

Error:

ollama | ggml_cuda_init: found 1 CUDA devices:
ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16
ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667"
ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free
ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest))
...
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory
ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer
ollama | llama_load_model_from_file: failed to load model
ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517
ollama |
ollama | goroutine 7 [running]:
ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...)
ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad
ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d
ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model"

I tried OLLAMA_GPU_OVERHEAD with 8G, and without OLLAMA_GPU_OVERHEAD - same result. Looks like OLLAMA_GPU_OVERHEAD doesn't do anything.

I can run the model WITHOUT GPU. It works. Can't offload anything on it.

With GPU, according to logs above it tried to use 35GB of VRAM, when I only have 24GB.

Is there a way to configure Ollama to better calculate VRAM allocation?

Ollama version

0.5.6

Originally created by @SlavikCA on GitHub (Jan 16, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/8447 ### What is the issue? System: - Ubuntu 22 server with Docker - 640GB RAM - Nvidia RTX 3090 with 24GB VRAM - 2x Intel Xeon Gold 5218 Docker compose: ``` services: ollama: image: ollama/ollama:0.5.6 deploy: resources: reservations: devices: - driver: nvidia count: 1 capabilities: [gpu] ports: - 11434:11434 tty: true environment: - 'OLLAMA_FLASH_ATTENTION=1' # only GPUs with compute capability 7+ support flash attention - 'OLLAMA_KV_CACHE_TYPE=q8_0' # Quantization type for the K/V cache (default: f16) - 'OLLAMA_NUM_PARALLEL=1' # The maximum number of parallel requests each model will process at the same time - 'OLLAMA_GPU_OVERHEAD=8G' # Reserve a portion of VRAM per GPU (bytes) ``` Steps: - download Deepseek V3 model (default q4_K_M quantization) - run the query Error: ollama | ggml_cuda_init: found 1 CUDA devices: ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16 ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667" ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest)) ... ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer ollama | llama_load_model_from_file: failed to load model ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 ollama | ollama | goroutine 7 [running]: ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...) ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model" I tried `OLLAMA_GPU_OVERHEAD` with 8G, and without `OLLAMA_GPU_OVERHEAD` - same result. Looks like `OLLAMA_GPU_OVERHEAD` doesn't do anything. I can run the model WITHOUT GPU. It works. Can't offload anything on it. With GPU, according to logs above it tried to use 35GB of VRAM, when I only have 24GB. Is there a way to configure Ollama to better calculate VRAM allocation? ### Ollama version 0.5.6

GiteaMirror added the bug label 2026-04-12 16:40:06 -05:00

GiteaMirror closed this issue

2026-04-12 16:40:07 -05:00

GiteaMirror commented

2026-04-12 16:40:07 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

Full logs will help in debugging. What's the query you are sending? Have you modified any settings like num_ctx or num_gpu?

@rick-github commented on GitHub (Jan 16, 2025): Full logs will help in debugging. What's the query you are sending? Have you modified any settings like `num_ctx` or `num_gpu`?

GiteaMirror commented

2026-04-12 16:40:08 -05:00

@SlavikCA commented on GitHub (Jan 16, 2025):

Query: why the sky is blue?

All parameters, including num_ctx or num_gpu - set to default.

Full log:

ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | 2025/01/16 04:29:39 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
ollama | time=2025-01-16T04:29:39.657Z level=INFO source=images.go:432 msg="total blobs: 42"
ollama | time=2025-01-16T04:29:39.658Z level=INFO source=images.go:439 msg="total unused blobs removed: 0"
ollama | [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached.
ollama |
ollama | [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production.
ollama | - using env: export GIN_MODE=release
ollama | - using code: gin.SetMode(gin.ReleaseMode)
ollama |
ollama | [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers)
ollama | [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers)
ollama | [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers)
ollama | [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers)
ollama | [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers)
ollama | [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers)
ollama | [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers)
ollama | [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers)
ollama | [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers)
ollama | [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers)
ollama | [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
ollama | [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
ollama | [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
ollama | [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers)
ollama | [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers)
ollama | [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers)
ollama | time=2025-01-16T04:29:39.659Z level=INFO source=routes.go:1238 msg="Listening on [::]:11434 (version 0.5.6-0-g2539f2d-dirty)"
ollama | time=2025-01-16T04:29:39.661Z level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx cpu]"
ollama | time=2025-01-16T04:29:39.661Z level=INFO source=gpu.go:226 msg="looking for compatible GPUs"
ollama | time=2025-01-16T04:29:39.960Z level=INFO source=types.go:131 msg="inference compute" id=GPU-832fc4ab-1e74-2a7f-773b-27cbd204bebf library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB"
ollama | time=2025-01-16T04:29:50.196Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.395Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.741Z level=INFO source=server.go:104 msg="system memory" total="628.5 GiB" free="623.2 GiB" free_swap="0 B"
ollama | time=2025-01-16T04:29:50.742Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.909Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0
ollama | time=2025-01-16T04:29:50.909Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=5 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="415.7 GiB" memory.required.partial="17.6 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[17.6 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="654.0 MiB" memory.graph.partial="1019.5 MiB"
ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:216 msg="flash attention enabled but not supported by model"
ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:234 msg="quantized kv cache requested but flash attention disabled" type=q8_0
ollama | time=2025-01-16T04:29:50.909Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 --ctx-size 2048 --batch-size 512 --n-gpu-layers 5 --threads 16 --parallel 1 --port 45667"
ollama | time=2025-01-16T04:29:50.910Z level=INFO source=sched.go:449 msg="loaded runners" count=1
ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding"
ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama | time=2025-01-16T04:29:50.970Z level=INFO source=runner.go:936 msg="starting go runner"
ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ollama | ggml_cuda_init: found 1 CUDA devices:
ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16
ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667"
ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free
ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest))
ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama | llama_model_loader: - kv 0: general.architecture str = deepseek2
ollama | llama_model_loader: - kv 1: general.type str = model
ollama | llama_model_loader: - kv 2: general.size_label str = 256x20B
ollama | llama_model_loader: - kv 3: deepseek2.block_count u32 = 61
ollama | llama_model_loader: - kv 4: deepseek2.context_length u32 = 163840
ollama | llama_model_loader: - kv 5: deepseek2.embedding_length u32 = 7168
ollama | llama_model_loader: - kv 6: deepseek2.feed_forward_length u32 = 18432
ollama | llama_model_loader: - kv 7: deepseek2.attention.head_count u32 = 128
ollama | llama_model_loader: - kv 8: deepseek2.attention.head_count_kv u32 = 128
ollama | llama_model_loader: - kv 9: deepseek2.rope.freq_base f32 = 10000.000000
ollama | llama_model_loader: - kv 10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
ollama | llama_model_loader: - kv 11: deepseek2.expert_used_count u32 = 8
ollama | llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 3
ollama | llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 129280
ollama | llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536
ollama | llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512
ollama | llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192
ollama | llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128
ollama | llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 2048
ollama | llama_model_loader: - kv 19: deepseek2.expert_count u32 = 256
ollama | llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 1
ollama | llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 2.500000
ollama | llama_model_loader: - kv 22: deepseek2.expert_weights_norm bool = true
ollama | llama_model_loader: - kv 23: deepseek2.expert_gating_func u32 = 2
ollama | llama_model_loader: - kv 24: deepseek2.rope.dimension_count u32 = 64
ollama | llama_model_loader: - kv 25: deepseek2.rope.scaling.type str = yarn
ollama | llama_model_loader: - kv 26: deepseek2.rope.scaling.factor f32 = 40.000000
ollama | llama_model_loader: - kv 27: deepseek2.rope.scaling.original_context_length u32 = 4096
ollama | llama_model_loader: - kv 28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
ollama | llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2
ollama | llama_model_loader: - kv 30: tokenizer.ggml.pre str = deepseek-v3
ollama | time=2025-01-16T04:29:51.163Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model"
ollama | llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,129280] = ["<｜begin▁of▁sentence｜>", "<�...
ollama | llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama | llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
ollama | llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 0
ollama | llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 1
ollama | llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 1
ollama | llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = true
ollama | llama_model_loader: - kv 38: tokenizer.ggml.add_eos_token bool = false
ollama | llama_model_loader: - kv 39: tokenizer.chat_template str = {% if not add_generation_prompt is de...
ollama | llama_model_loader: - kv 40: general.quantization_version u32 = 2
ollama | llama_model_loader: - kv 41: general.file_type u32 = 15
ollama | llama_model_loader: - type f32: 361 tensors
ollama | llama_model_loader: - type q4_K: 606 tensors
ollama | llama_model_loader: - type q6_K: 58 tensors
ollama | llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
ollama | llm_load_vocab: special tokens cache size = 818
ollama | llm_load_vocab: token to piece cache size = 0.8223 MB
ollama | llm_load_print_meta: format = GGUF V3 (latest)
ollama | llm_load_print_meta: arch = deepseek2
ollama | llm_load_print_meta: vocab type = BPE
ollama | llm_load_print_meta: n_vocab = 129280
ollama | llm_load_print_meta: n_merges = 127741
ollama | llm_load_print_meta: vocab_only = 0
ollama | llm_load_print_meta: n_ctx_train = 163840
ollama | llm_load_print_meta: n_embd = 7168
ollama | llm_load_print_meta: n_layer = 61
ollama | llm_load_print_meta: n_head = 128
ollama | llm_load_print_meta: n_head_kv = 128
ollama | llm_load_print_meta: n_rot = 64
ollama | llm_load_print_meta: n_swa = 0
ollama | llm_load_print_meta: n_embd_head_k = 192
ollama | llm_load_print_meta: n_embd_head_v = 128
ollama | llm_load_print_meta: n_gqa = 1
ollama | llm_load_print_meta: n_embd_k_gqa = 24576
ollama | llm_load_print_meta: n_embd_v_gqa = 16384
ollama | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06
ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama | llm_load_print_meta: n_ff = 18432
ollama | llm_load_print_meta: n_expert = 256
ollama | llm_load_print_meta: n_expert_used = 8
ollama | llm_load_print_meta: causal attn = 1
ollama | llm_load_print_meta: pooling type = 0
ollama | llm_load_print_meta: rope type = 0
ollama | llm_load_print_meta: rope scaling = yarn
ollama | llm_load_print_meta: freq_base_train = 10000.0
ollama | llm_load_print_meta: freq_scale_train = 0.025
ollama | llm_load_print_meta: n_ctx_orig_yarn = 4096
ollama | llm_load_print_meta: rope_finetuned = unknown
ollama | llm_load_print_meta: ssm_d_conv = 0
ollama | llm_load_print_meta: ssm_d_inner = 0
ollama | llm_load_print_meta: ssm_d_state = 0
ollama | llm_load_print_meta: ssm_dt_rank = 0
ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0
ollama | llm_load_print_meta: model type = 671B
ollama | llm_load_print_meta: model ftype = Q4_K - Medium
ollama | llm_load_print_meta: model params = 671.03 B
ollama | llm_load_print_meta: model size = 376.65 GiB (4.82 BPW)
ollama | llm_load_print_meta: general.name = n/a
ollama | llm_load_print_meta: BOS token = 0 '<｜begin▁of▁sentence｜>'
ollama | llm_load_print_meta: EOS token = 1 '<｜end▁of▁sentence｜>'
ollama | llm_load_print_meta: EOT token = 1 '<｜end▁of▁sentence｜>'
ollama | llm_load_print_meta: PAD token = 1 '<｜end▁of▁sentence｜>'
ollama | llm_load_print_meta: LF token = 131 'Ä'
ollama | llm_load_print_meta: FIM PRE token = 128801 '<｜fim▁begin｜>'
ollama | llm_load_print_meta: FIM SUF token = 128800 '<｜fim▁hole｜>'
ollama | llm_load_print_meta: FIM MID token = 128802 '<｜fim▁end｜>'
ollama | llm_load_print_meta: EOG token = 1 '<｜end▁of▁sentence｜>'
ollama | llm_load_print_meta: max token length = 256
ollama | llm_load_print_meta: n_layer_dense_lead = 3
ollama | llm_load_print_meta: n_lora_q = 1536
ollama | llm_load_print_meta: n_lora_kv = 512
ollama | llm_load_print_meta: n_ff_exp = 2048
ollama | llm_load_print_meta: n_expert_shared = 1
ollama | llm_load_print_meta: expert_weights_scale = 2.5
ollama | llm_load_print_meta: expert_weights_norm = 1
ollama | llm_load_print_meta: expert_gating_func = sigmoid
ollama | llm_load_print_meta: rope_yarn_log_mul = 0.1000
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory
ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer
ollama | llama_load_model_from_file: failed to load model
ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517
ollama |
ollama | goroutine 7 [running]:
ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...)
ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad
ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d
ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error"
ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model"
ollama | [GIN] 2025/01/16 - 04:30:11 | 500 | 21.870863244s | 192.168.0.75 | POST "/api/chat"
ollama | time=2025-01-16T04:30:16.971Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.180716149 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517

@SlavikCA commented on GitHub (Jan 16, 2025): Query: `why the sky is blue?` All parameters, including `num_ctx` or `num_gpu` - set to default. Full log: ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | 2025/01/16 04:29:39 config.go:215: WARN invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | 2025/01/16 04:29:39 routes.go:1187: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" ollama | time=2025-01-16T04:29:39.657Z level=INFO source=images.go:432 msg="total blobs: 42" ollama | time=2025-01-16T04:29:39.658Z level=INFO source=images.go:439 msg="total unused blobs removed: 0" ollama | [GIN-debug] [WARNING] Creating an Engine instance with the Logger and Recovery middleware already attached. ollama | ollama | [GIN-debug] [WARNING] Running in "debug" mode. Switch to "release" mode in production. ollama | - using env: export GIN_MODE=release ollama | - using code: gin.SetMode(gin.ReleaseMode) ollama | ollama | [GIN-debug] POST /api/pull --> github.com/ollama/ollama/server.(*Server).PullHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/generate --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/chat --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/embed --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/embeddings --> github.com/ollama/ollama/server.(*Server).EmbeddingsHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/create --> github.com/ollama/ollama/server.(*Server).CreateHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/push --> github.com/ollama/ollama/server.(*Server).PushHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/copy --> github.com/ollama/ollama/server.(*Server).CopyHandler-fm (5 handlers) ollama | [GIN-debug] DELETE /api/delete --> github.com/ollama/ollama/server.(*Server).DeleteHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/show --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (5 handlers) ollama | [GIN-debug] POST /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).CreateBlobHandler-fm (5 handlers) ollama | [GIN-debug] HEAD /api/blobs/:digest --> github.com/ollama/ollama/server.(*Server).HeadBlobHandler-fm (5 handlers) ollama | [GIN-debug] GET /api/ps --> github.com/ollama/ollama/server.(*Server).PsHandler-fm (5 handlers) ollama | [GIN-debug] POST /v1/chat/completions --> github.com/ollama/ollama/server.(*Server).ChatHandler-fm (6 handlers) ollama | [GIN-debug] POST /v1/completions --> github.com/ollama/ollama/server.(*Server).GenerateHandler-fm (6 handlers) ollama | [GIN-debug] POST /v1/embeddings --> github.com/ollama/ollama/server.(*Server).EmbedHandler-fm (6 handlers) ollama | [GIN-debug] GET /v1/models --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (6 handlers) ollama | [GIN-debug] GET /v1/models/:model --> github.com/ollama/ollama/server.(*Server).ShowHandler-fm (6 handlers) ollama | [GIN-debug] GET / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) ollama | [GIN-debug] GET /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) ollama | [GIN-debug] GET /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) ollama | [GIN-debug] HEAD / --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func1 (5 handlers) ollama | [GIN-debug] HEAD /api/tags --> github.com/ollama/ollama/server.(*Server).ListHandler-fm (5 handlers) ollama | [GIN-debug] HEAD /api/version --> github.com/ollama/ollama/server.(*Server).GenerateRoutes.func2 (5 handlers) ollama | time=2025-01-16T04:29:39.659Z level=INFO source=routes.go:1238 msg="Listening on [::]:11434 (version 0.5.6-0-g2539f2d-dirty)" ollama | time=2025-01-16T04:29:39.661Z level=INFO source=routes.go:1267 msg="Dynamic LLM libraries" runners="[cpu_avx cpu_avx2 cuda_v11_avx cuda_v12_avx cpu]" ollama | time=2025-01-16T04:29:39.661Z level=INFO source=gpu.go:226 msg="looking for compatible GPUs" ollama | time=2025-01-16T04:29:39.960Z level=INFO source=types.go:131 msg="inference compute" id=GPU-832fc4ab-1e74-2a7f-773b-27cbd204bebf library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3090" total="23.6 GiB" available="23.3 GiB" ollama | time=2025-01-16T04:29:50.196Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.395Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.741Z level=INFO source=server.go:104 msg="system memory" total="628.5 GiB" free="623.2 GiB" free_swap="0 B" ollama | time=2025-01-16T04:29:50.742Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.909Z level=WARN source=config.go:215 msg="invalid environment variable, using default" key=OLLAMA_GPU_OVERHEAD value=8G default=0 ollama | time=2025-01-16T04:29:50.909Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=62 layers.offload=5 layers.split="" memory.available="[23.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="415.7 GiB" memory.required.partial="17.6 GiB" memory.required.kv="9.5 GiB" memory.required.allocations="[17.6 GiB]" memory.weights.total="385.0 GiB" memory.weights.repeating="384.3 GiB" memory.weights.nonrepeating="725.0 MiB" memory.graph.full="654.0 MiB" memory.graph.partial="1019.5 MiB" ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:216 msg="flash attention enabled but not supported by model" ollama | time=2025-01-16T04:29:50.909Z level=WARN source=server.go:234 msg="quantized kv cache requested but flash attention disabled" type=q8_0 ollama | time=2025-01-16T04:29:50.909Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 --ctx-size 2048 --batch-size 512 --n-gpu-layers 5 --threads 16 --parallel 1 --port 45667" ollama | time=2025-01-16T04:29:50.910Z level=INFO source=sched.go:449 msg="loaded runners" count=1 ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" ollama | time=2025-01-16T04:29:50.910Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-16T04:29:50.970Z level=INFO source=runner.go:936 msg="starting go runner" ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ollama | ggml_cuda_init: found 1 CUDA devices: ollama | Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes ollama | time=2025-01-16T04:29:50.986Z level=INFO source=runner.go:937 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=16 ollama | time=2025-01-16T04:29:50.987Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:45667" ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 3090) - 23891 MiB free ollama | llama_model_loader: loaded meta data with 42 key-value pairs and 1025 tensors from /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 (version GGUF V3 (latest)) ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama | llama_model_loader: - kv 0: general.architecture str = deepseek2 ollama | llama_model_loader: - kv 1: general.type str = model ollama | llama_model_loader: - kv 2: general.size_label str = 256x20B ollama | llama_model_loader: - kv 3: deepseek2.block_count u32 = 61 ollama | llama_model_loader: - kv 4: deepseek2.context_length u32 = 163840 ollama | llama_model_loader: - kv 5: deepseek2.embedding_length u32 = 7168 ollama | llama_model_loader: - kv 6: deepseek2.feed_forward_length u32 = 18432 ollama | llama_model_loader: - kv 7: deepseek2.attention.head_count u32 = 128 ollama | llama_model_loader: - kv 8: deepseek2.attention.head_count_kv u32 = 128 ollama | llama_model_loader: - kv 9: deepseek2.rope.freq_base f32 = 10000.000000 ollama | llama_model_loader: - kv 10: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001 ollama | llama_model_loader: - kv 11: deepseek2.expert_used_count u32 = 8 ollama | llama_model_loader: - kv 12: deepseek2.leading_dense_block_count u32 = 3 ollama | llama_model_loader: - kv 13: deepseek2.vocab_size u32 = 129280 ollama | llama_model_loader: - kv 14: deepseek2.attention.q_lora_rank u32 = 1536 ollama | llama_model_loader: - kv 15: deepseek2.attention.kv_lora_rank u32 = 512 ollama | llama_model_loader: - kv 16: deepseek2.attention.key_length u32 = 192 ollama | llama_model_loader: - kv 17: deepseek2.attention.value_length u32 = 128 ollama | llama_model_loader: - kv 18: deepseek2.expert_feed_forward_length u32 = 2048 ollama | llama_model_loader: - kv 19: deepseek2.expert_count u32 = 256 ollama | llama_model_loader: - kv 20: deepseek2.expert_shared_count u32 = 1 ollama | llama_model_loader: - kv 21: deepseek2.expert_weights_scale f32 = 2.500000 ollama | llama_model_loader: - kv 22: deepseek2.expert_weights_norm bool = true ollama | llama_model_loader: - kv 23: deepseek2.expert_gating_func u32 = 2 ollama | llama_model_loader: - kv 24: deepseek2.rope.dimension_count u32 = 64 ollama | llama_model_loader: - kv 25: deepseek2.rope.scaling.type str = yarn ollama | llama_model_loader: - kv 26: deepseek2.rope.scaling.factor f32 = 40.000000 ollama | llama_model_loader: - kv 27: deepseek2.rope.scaling.original_context_length u32 = 4096 ollama | llama_model_loader: - kv 28: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000 ollama | llama_model_loader: - kv 29: tokenizer.ggml.model str = gpt2 ollama | llama_model_loader: - kv 30: tokenizer.ggml.pre str = deepseek-v3 ollama | time=2025-01-16T04:29:51.163Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" ollama | llama_model_loader: - kv 31: tokenizer.ggml.tokens arr[str,129280] = ["<｜begin▁of▁sentence｜>", "<�... ollama | llama_model_loader: - kv 32: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama | llama_model_loader: - kv 33: tokenizer.ggml.merges arr[str,127741] = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e... ollama | llama_model_loader: - kv 34: tokenizer.ggml.bos_token_id u32 = 0 ollama | llama_model_loader: - kv 35: tokenizer.ggml.eos_token_id u32 = 1 ollama | llama_model_loader: - kv 36: tokenizer.ggml.padding_token_id u32 = 1 ollama | llama_model_loader: - kv 37: tokenizer.ggml.add_bos_token bool = true ollama | llama_model_loader: - kv 38: tokenizer.ggml.add_eos_token bool = false ollama | llama_model_loader: - kv 39: tokenizer.chat_template str = {% if not add_generation_prompt is de... ollama | llama_model_loader: - kv 40: general.quantization_version u32 = 2 ollama | llama_model_loader: - kv 41: general.file_type u32 = 15 ollama | llama_model_loader: - type f32: 361 tensors ollama | llama_model_loader: - type q4_K: 606 tensors ollama | llama_model_loader: - type q6_K: 58 tensors ollama | llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect ollama | llm_load_vocab: special tokens cache size = 818 ollama | llm_load_vocab: token to piece cache size = 0.8223 MB ollama | llm_load_print_meta: format = GGUF V3 (latest) ollama | llm_load_print_meta: arch = deepseek2 ollama | llm_load_print_meta: vocab type = BPE ollama | llm_load_print_meta: n_vocab = 129280 ollama | llm_load_print_meta: n_merges = 127741 ollama | llm_load_print_meta: vocab_only = 0 ollama | llm_load_print_meta: n_ctx_train = 163840 ollama | llm_load_print_meta: n_embd = 7168 ollama | llm_load_print_meta: n_layer = 61 ollama | llm_load_print_meta: n_head = 128 ollama | llm_load_print_meta: n_head_kv = 128 ollama | llm_load_print_meta: n_rot = 64 ollama | llm_load_print_meta: n_swa = 0 ollama | llm_load_print_meta: n_embd_head_k = 192 ollama | llm_load_print_meta: n_embd_head_v = 128 ollama | llm_load_print_meta: n_gqa = 1 ollama | llm_load_print_meta: n_embd_k_gqa = 24576 ollama | llm_load_print_meta: n_embd_v_gqa = 16384 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama | llm_load_print_meta: n_ff = 18432 ollama | llm_load_print_meta: n_expert = 256 ollama | llm_load_print_meta: n_expert_used = 8 ollama | llm_load_print_meta: causal attn = 1 ollama | llm_load_print_meta: pooling type = 0 ollama | llm_load_print_meta: rope type = 0 ollama | llm_load_print_meta: rope scaling = yarn ollama | llm_load_print_meta: freq_base_train = 10000.0 ollama | llm_load_print_meta: freq_scale_train = 0.025 ollama | llm_load_print_meta: n_ctx_orig_yarn = 4096 ollama | llm_load_print_meta: rope_finetuned = unknown ollama | llm_load_print_meta: ssm_d_conv = 0 ollama | llm_load_print_meta: ssm_d_inner = 0 ollama | llm_load_print_meta: ssm_d_state = 0 ollama | llm_load_print_meta: ssm_dt_rank = 0 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 ollama | llm_load_print_meta: model type = 671B ollama | llm_load_print_meta: model ftype = Q4_K - Medium ollama | llm_load_print_meta: model params = 671.03 B ollama | llm_load_print_meta: model size = 376.65 GiB (4.82 BPW) ollama | llm_load_print_meta: general.name = n/a ollama | llm_load_print_meta: BOS token = 0 '<｜begin▁of▁sentence｜>' ollama | llm_load_print_meta: EOS token = 1 '<｜end▁of▁sentence｜>' ollama | llm_load_print_meta: EOT token = 1 '<｜end▁of▁sentence｜>' ollama | llm_load_print_meta: PAD token = 1 '<｜end▁of▁sentence｜>' ollama | llm_load_print_meta: LF token = 131 'Ä' ollama | llm_load_print_meta: FIM PRE token = 128801 '<｜fim▁begin｜>' ollama | llm_load_print_meta: FIM SUF token = 128800 '<｜fim▁hole｜>' ollama | llm_load_print_meta: FIM MID token = 128802 '<｜fim▁end｜>' ollama | llm_load_print_meta: EOG token = 1 '<｜end▁of▁sentence｜>' ollama | llm_load_print_meta: max token length = 256 ollama | llm_load_print_meta: n_layer_dense_lead = 3 ollama | llm_load_print_meta: n_lora_q = 1536 ollama | llm_load_print_meta: n_lora_kv = 512 ollama | llm_load_print_meta: n_ff_exp = 2048 ollama | llm_load_print_meta: n_expert_shared = 1 ollama | llm_load_print_meta: expert_weights_scale = 2.5 ollama | llm_load_print_meta: expert_weights_norm = 1 ollama | llm_load_print_meta: expert_gating_func = sigmoid ollama | llm_load_print_meta: rope_yarn_log_mul = 0.1000 ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 35533.34 MiB on device 0: cudaMalloc failed: out of memory ollama | llama_model_load: error loading model: unable to allocate CUDA0 buffer ollama | llama_load_model_from_file: failed to load model ollama | panic: unable to load model: /root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517 ollama | ollama | goroutine 7 [running]: ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x5, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...) ollama | github.com/ollama/ollama/llama/runner/runner.go:852 +0x3ad ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d ollama | time=2025-01-16T04:30:11.539Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" ollama | time=2025-01-16T04:30:11.790Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer\nllama_load_model_from_file: failed to load model" ollama | [GIN] 2025/01/16 - 04:30:11 | 500 | 21.870863244s | 192.168.0.75 | POST "/api/chat" ollama | time=2025-01-16T04:30:16.971Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.180716149 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517

GiteaMirror commented

2026-04-12 16:40:08 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

It's unclear why llama.cpp is overallocating. The deepseek class of models has always had problematical allocations, the usual way to deal with it is to reduce the number of layers offloaded. Setting OLLAMA_GPU_OVERHEAD is one way to do that, but it't takes a value in bytes, not human readable string. So try OLLAMA_GPU_OVERHEAD=8589934592.

@rick-github commented on GitHub (Jan 16, 2025): It's unclear why llama.cpp is overallocating. The deepseek class of models has always had problematical allocations, the usual way to deal with it is to reduce the number of layers offloaded. Setting `OLLAMA_GPU_OVERHEAD` is one way to do that, but it't takes a value in bytes, not human readable string. So try `OLLAMA_GPU_OVERHEAD=8589934592`.

GiteaMirror commented

2026-04-12 16:40:09 -05:00

@SlavikCA commented on GitHub (Jan 16, 2025):

ok, I see my mistake with

invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0

Trying now with bytes, not G,

I tried to enter bytes:

  - 'OLLAMA_GPU_OVERHEAD=8032385536'

ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 28426.68 MiB on device 0: cudaMalloc failed: out of memory

  - 'OLLAMA_GPU_OVERHEAD=16032385536'

ollama | llm_load_tensors: offloading 3 repeating layers to GPU
ollama | llm_load_tensors: offloaded 3/62 layers to GPU
ollama | llm_load_tensors: CPU_Mapped model buffer size = 364369.62 MiB
ollama | llm_load_tensors: CUDA0 model buffer size = 21320.01 MiB
ollama | llama_new_context_with_model: n_seq_max = 1
ollama | llama_new_context_with_model: n_ctx = 2048
ollama | llama_new_context_with_model: n_ctx_per_seq = 2048
ollama | llama_new_context_with_model: n_batch = 512
ollama | llama_new_context_with_model: n_ubatch = 512
ollama | llama_new_context_with_model: flash_attn = 0
ollama | llama_new_context_with_model: freq_base = 10000.0
ollama | llama_new_context_with_model: freq_scale = 0.025
ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
ollama | llama_kv_cache_init: CPU KV buffer size = 9280.00 MiB
ollama | llama_kv_cache_init: CUDA0 KV buffer size = 480.00 MiB
ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5030.00 MiB on device 0: cudaMalloc failed: out of memory
ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 5274339328
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2952.66 MiB on device 0: cudaMalloc failed: out of memory
ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 3096088576
ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8686.04 MiB on device 0: cudaMalloc failed: out of memory
ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096
ollama | llama_new_context_with_model: failed to allocate compute buffers
ollama | panic: unable to create llama context
ollama |
ollama | goroutine 7 [running]:
ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x3, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...)
ollama | github.com/ollama/ollama/llama/runner/runner.go:858 +0x39c
ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1
ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d
ollama | time=2025-01-16T05:14:12.717Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server not responding"
ollama | time=2025-01-16T05:14:18.388Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096\nllama_new_context_with_model: failed to allocate compute buffers"
ollama | [GIN] 2025/01/16 - 05:14:18 | 500 | 31.279653601s | 192.168.0.75 | POST "/api/chat"
ollama | time=2025-01-16T05:14:23.428Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.039409053 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517

@SlavikCA commented on GitHub (Jan 16, 2025): ok, I see my mistake with > invalid environment variable, using default key=OLLAMA_GPU_OVERHEAD value=8G default=0 Trying now with bytes, not G, I tried to enter bytes: - 'OLLAMA_GPU_OVERHEAD=8032385536' > ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 28426.68 MiB on device 0: cudaMalloc failed: out of memory - 'OLLAMA_GPU_OVERHEAD=16032385536' ollama | llm_load_tensors: offloading 3 repeating layers to GPU ollama | llm_load_tensors: offloaded 3/62 layers to GPU ollama | llm_load_tensors: CPU_Mapped model buffer size = 364369.62 MiB ollama | llm_load_tensors: CUDA0 model buffer size = 21320.01 MiB ollama | llama_new_context_with_model: n_seq_max = 1 ollama | llama_new_context_with_model: n_ctx = 2048 ollama | llama_new_context_with_model: n_ctx_per_seq = 2048 ollama | llama_new_context_with_model: n_batch = 512 ollama | llama_new_context_with_model: n_ubatch = 512 ollama | llama_new_context_with_model: flash_attn = 0 ollama | llama_new_context_with_model: freq_base = 10000.0 ollama | llama_new_context_with_model: freq_scale = 0.025 ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 ollama | llama_kv_cache_init: CPU KV buffer size = 9280.00 MiB ollama | llama_kv_cache_init: CUDA0 KV buffer size = 480.00 MiB ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 5030.00 MiB on device 0: cudaMalloc failed: out of memory ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 5274339328 ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2952.66 MiB on device 0: cudaMalloc failed: out of memory ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 3096088576 ollama | ggml_backend_cuda_buffer_type_alloc_buffer: allocating 8686.04 MiB on device 0: cudaMalloc failed: out of memory ollama | ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096 ollama | llama_new_context_with_model: failed to allocate compute buffers ollama | panic: unable to create llama context ollama | ollama | goroutine 7 [running]: ollama | github.com/ollama/ollama/llama/runner.(*Server).loadModel(0xc0000be1b0, {0x3, 0x0, 0x1, 0x0, {0x0, 0x0, 0x0}, 0xc000026200, 0x0}, ...) ollama | github.com/ollama/ollama/llama/runner/runner.go:858 +0x39c ollama | created by github.com/ollama/ollama/llama/runner.Execute in goroutine 1 ollama | github.com/ollama/ollama/llama/runner/runner.go:970 +0xd0d ollama | time=2025-01-16T05:14:12.717Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server not responding" ollama | time=2025-01-16T05:14:18.388Z level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: cudaMalloc failed: out of memory\nggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 9107972096\nllama_new_context_with_model: failed to allocate compute buffers" ollama | [GIN] 2025/01/16 - 05:14:18 | 500 | 31.279653601s | 192.168.0.75 | POST "/api/chat" ollama | time=2025-01-16T05:14:23.428Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.039409053 model=/root/.ollama/models/blobs/sha256-d83c18fb2a2ccca56d641920f01e6fe533dfb3fcf4b77c007c931497cd24a517

GiteaMirror commented

2026-04-12 16:40:09 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

Also note that flash attention is not supported in deepseek architecture models.

@rick-github commented on GitHub (Jan 16, 2025): Also note that flash attention is not supported in deepseek architecture models.

GiteaMirror commented

2026-04-12 16:40:09 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

OK, llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek. The model is 376G and has 61 layers so a layer is 6G on average. In its first go around it wanted to offload 5 layers and use 9.5G KV cache, ~39G is obviously not going to fit in 24G. In the second go it looks like KV cache got allocated but the model weights still don't fit, so you will have to increase OLLAMA_GPU_OVERHEAD or set num_gpu.

@rick-github commented on GitHub (Jan 16, 2025): OK, llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek. The model is 376G and has 61 layers so a layer is 6G on average. In its first go around it wanted to offload 5 layers and use 9.5G KV cache, ~39G is obviously not going to fit in 24G. In the second go it looks like KV cache got allocated but the model weights still don't fit, so you will have to increase `OLLAMA_GPU_OVERHEAD` or set `num_gpu`.

GiteaMirror commented

2026-04-12 16:40:10 -05:00

@SlavikCA commented on GitHub (Jan 16, 2025):

With
- 'OLLAMA_GPU_OVERHEAD=22032385536'

I finally was able to offload 1 layer to GPU.

ollama | llm_load_tensors: offloading 1 repeating layers to GPU
ollama | llm_load_tensors: offloaded 1/62 layers to GPU
ollama | llm_load_tensors: CPU_Mapped model buffer size = 378582.96 MiB
ollama | llm_load_tensors: CUDA0 model buffer size = 7106.67 MiB
ollama | llama_new_context_with_model: n_seq_max = 1
ollama | llama_new_context_with_model: n_ctx = 2048
ollama | llama_new_context_with_model: n_ctx_per_seq = 2048
ollama | llama_new_context_with_model: n_batch = 512
ollama | llama_new_context_with_model: n_ubatch = 512
ollama | llama_new_context_with_model: flash_attn = 0
ollama | llama_new_context_with_model: freq_base = 10000.0
ollama | llama_new_context_with_model: freq_scale = 0.025
ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
ollama | llama_kv_cache_init: CPU KV buffer size = 9600.00 MiB
ollama | llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB
ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB
ollama | llama_new_context_with_model: CUDA0 compute buffer size = 5030.00 MiB
ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 84.01 MiB
ollama | llama_new_context_with_model: graph nodes = 5025
ollama | llama_new_context_with_model: graph splits = 1129 (with bs=512), 3 (with bs=1)
ollama | time=2025-01-16T05:23:16.058Z level=INFO source=server.go:594 msg="llama runner started in 21.89 seconds"
ollama | [GIN] 2025/01/16 - 05:25:50 | 200 | 2m56s | 192.168.0.75 | POST "/api/chat"

Not much help from GPU for this model.
Anyway, some feedback from my, may be it can help with optimizing "calculating memory requirements for deepseek".

@SlavikCA commented on GitHub (Jan 16, 2025): With - 'OLLAMA_GPU_OVERHEAD=22032385536' I finally was able to offload 1 layer to GPU. ollama | llm_load_tensors: offloading 1 repeating layers to GPU ollama | llm_load_tensors: offloaded 1/62 layers to GPU ollama | llm_load_tensors: CPU_Mapped model buffer size = 378582.96 MiB ollama | llm_load_tensors: CUDA0 model buffer size = 7106.67 MiB ollama | llama_new_context_with_model: n_seq_max = 1 ollama | llama_new_context_with_model: n_ctx = 2048 ollama | llama_new_context_with_model: n_ctx_per_seq = 2048 ollama | llama_new_context_with_model: n_batch = 512 ollama | llama_new_context_with_model: n_ubatch = 512 ollama | llama_new_context_with_model: flash_attn = 0 ollama | llama_new_context_with_model: freq_base = 10000.0 ollama | llama_new_context_with_model: freq_scale = 0.025 ollama | llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized ollama | llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0 ollama | llama_kv_cache_init: CPU KV buffer size = 9600.00 MiB ollama | llama_kv_cache_init: CUDA0 KV buffer size = 160.00 MiB ollama | llama_new_context_with_model: KV self size = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB ollama | llama_new_context_with_model: CPU output buffer size = 0.52 MiB ollama | llama_new_context_with_model: CUDA0 compute buffer size = 5030.00 MiB ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 84.01 MiB ollama | llama_new_context_with_model: graph nodes = 5025 ollama | llama_new_context_with_model: graph splits = 1129 (with bs=512), 3 (with bs=1) ollama | time=2025-01-16T05:23:16.058Z level=INFO source=server.go:594 msg="llama runner started in 21.89 seconds" ollama | [GIN] 2025/01/16 - 05:25:50 | 200 | 2m56s | 192.168.0.75 | POST "/api/chat" Not much help from GPU for this model. Anyway, some feedback from my, may be it can help with optimizing "calculating memory requirements for deepseek".

GiteaMirror commented

2026-04-12 16:40:10 -05:00

@SlavikCA commented on GitHub (Jan 16, 2025):

The problem with figuring out OLLAMA_GPU_OVERHEAD value is that it's calculations for one model (Deepseek V3) significantly differs from what it needs to be for other models (for example qwen2.5).

So, it sounds like, OLLAMA_GPU_OVERHEAD needs to be not the global parameter, but per model.

Otherwise, currently I need to restart the Ollama if I want to use one model or another one.

@SlavikCA commented on GitHub (Jan 16, 2025): The problem with figuring out OLLAMA_GPU_OVERHEAD value is that it's calculations for one model (Deepseek V3) significantly differs from what it needs to be for other models (for example qwen2.5). So, it sounds like, OLLAMA_GPU_OVERHEAD needs to be not the global parameter, but per model. Otherwise, currently I need to restart the Ollama if I want to use one model or another one.

GiteaMirror commented

2026-04-12 16:40:11 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

Since you know now how many layers you can offload, you can set num_gpu in the Modelfile. But ideally ollama should get the memory calculations correct. It's been an issue for a while and I haven't had the cycles to look at, I'll see if I can poke around in the near future.

@rick-github commented on GitHub (Jan 16, 2025): Since you know now how many layers you can offload, you can set `num_gpu` in the Modelfile. But ideally ollama should get the memory calculations correct. It's been an issue for a while and I haven't had the cycles to look at, I'll see if I can poke around in the near future.

GiteaMirror commented

2026-04-12 16:40:11 -05:00

@SlavikCA commented on GitHub (Jan 16, 2025):

I was thinking that num_gpu defines number of GPUs used:

Set the number of GPU devices used for computation. This option controls how many GPU devices (if available) are used to process incoming requests. Increasing this value can significantly improve performance for models that are optimized for GPU acceleration but may also consume more power and GPU resources.

But there in another place, it has different meaning:
a420a453b4/cmd/interactive.go (L106)

The number of layers to send to the GPU

Perhaps, different name can be used for that parameter?

@SlavikCA commented on GitHub (Jan 16, 2025): I was thinking that `num_gpu` defines number of GPUs used: > Set the number of GPU devices used for computation. This option controls how many GPU devices (if available) are used to process incoming requests. Increasing this value can significantly improve performance for models that are optimized for GPU acceleration but may also consume more power and GPU resources. But there in another place, it has different meaning: https://github.com/ollama/ollama/blob/a420a453b4783841e3e79c248ef0fe9548df6914/cmd/interactive.go#L106 > The number of layers to send to the GPU Perhaps, different name can be used for that parameter?

GiteaMirror commented

2026-04-12 16:40:12 -05:00

@rick-github commented on GitHub (Jan 16, 2025):

open-webui is incorrect. num_gpu is well established, changing the name will break clients and Modelfiles.

@rick-github commented on GitHub (Jan 16, 2025): open-webui is incorrect. `num_gpu` is well established, changing the name will break clients and Modelfiles.

GiteaMirror commented

2026-04-12 16:40:12 -05:00

@SlavikCA commented on GitHub (Jan 19, 2025):

Closing issue, as I found that VRAM usage can be managed with num_gpu.

This issue still unresolved:

llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek

But I do not understand it, and it probably should be separate issue.

@SlavikCA commented on GitHub (Jan 19, 2025): Closing issue, as I found that VRAM usage can be managed with `num_gpu`. This issue still unresolved: > llama.cpp is overallocating because ollama sucks at calculating memory requirements for deepseek But I do not understand it, and it probably should be separate issue.

GiteaMirror referenced this issue

2026-04-22 07:50:48 -05:00

[GH-ISSUE #5432] level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process no longer running: -1 " #29161

GiteaMirror referenced this issue

2026-04-28 13:23:31 -05:00

[GH-ISSUE #5432] level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process no longer running: -1 " #49912

GiteaMirror referenced this issue

2026-05-03 21:17:15 -05:00

[GH-ISSUE #5432] level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process no longer running: -1 " #65438

Sign in to join this conversation.

Branches Tags

main

hoyyeva/anthropic-local-image-path

dhiltgen/ci

dhiltgen/llama-runner

parth-remove-claude-desktop-launch

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#5432