[GH-ISSUE #9898] A cuda error occurs when accessing the REST API, but using the "ollama run" command works fine #68536

Closed
opened 2026-05-04 14:21:42 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @SuYueQiuLiang on GitHub (Mar 20, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9898

What is the issue?

When I upgrade ollama from 0.5.12 to 0.5.13(or later,even the latest version),I can't access the REST API,but when I try "ollama run",everything works fine.
I have noticed that 0.5.13 upgraded the compiled version for NVIDIA Blackwell,could there be a connection between the two?
However, this issue is quite confusing, after all, 'ollama run' runs without issues.

Relevant log output

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: GRID P100-16Q, compute capability 6.0, VMM: no
load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
time=2025-03-17T13:27:38.345+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-03-17T13:27:38.686+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="7.6 GiB"
time=2025-03-17T13:27:38.686+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="787.5 MiB"
time=2025-03-17T13:29:07.552+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CUDA0 buffer_type=CUDA0
time=2025-03-17T13:29:07.552+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CPU buffer_type=CUDA_Host
time=2025-03-17T13:29:07.556+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-17T13:29:07.564+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false
time=2025-03-17T13:29:07.571+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000
time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06
time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1
time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256
time=2025-03-17T13:29:07.648+08:00 level=INFO source=server.go:624 msg="llama runner started in 98.18 seconds"
[GIN] 2025/03/17 - 13:29:07 | 200 |         1m38s |       127.0.0.1 | POST     "/api/generate"
time=2025-03-17T13:29:07.778+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-594bc79c-2687-11b2-b8fc-06941bda5d3f library=cuda total="16.0 GiB" available="3.3 GiB"
time=2025-03-17T13:29:07.778+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0
time=2025-03-17T13:29:07.779+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0
time=2025-03-17T13:29:07.780+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0
time=2025-03-17T13:29:07.781+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0
time=2025-03-17T13:29:12.796+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0141034 model=C:\Users\Administrator\.ollama\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3
time=2025-03-17T13:29:12.933+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0
time=2025-03-17T13:29:12.938+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0
time=2025-03-17T13:29:12.939+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac gpu=GPU-594bc79c-2687-11b2-b8fc-06941bda5d3f parallel=1 available=15680241664 required="13.5 GiB"
time=2025-03-17T13:29:12.961+08:00 level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="27.7 GiB" free_swap="37.0 GiB"
time=2025-03-17T13:29:12.961+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0
time=2025-03-17T13:29:12.961+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.5 GiB" memory.required.partial="13.5 GiB" memory.required.kv="736.0 MiB" memory.required.allocations="[13.5 GiB]" memory.weights.total="10.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="509.0 MiB" memory.graph.partial="1.4 GiB"
time=2025-03-17T13:29:13.048+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2658925 model=C:\Users\Administrator\.ollama\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3
llama_model_loader: loaded meta data with 33 key-value pairs and 508 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = gemma-2-27b-it
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 4608
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 46
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 36864
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 32
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 128
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 128
llama_model_loader: - kv  11:                          general.file_type u32              = 27
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/gemma-2-27b-it-GGUF/gemma...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 322
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  185 tensors
llama_model_loader: - type q4_K:   97 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  225 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ3_S mix - 3.66 bpw
print_info: file size   = 11.59 GiB (3.66 BPW) 
time=2025-03-17T13:29:13.296+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5145427 model=C:\Users\Administrator\.ollama\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 217
load: token to piece cache size = 1.6014 MB
print_info: arch             = gemma2
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 27.23 B
print_info: general.name     = gemma-2-27b-it
print_info: vocab type       = SPM
print_info: n_vocab          = 256000
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 107 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 227 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 107 '<end_of_turn>'
print_info: max token length = 48
llama_model_load: vocab only - skipping tensors
time=2025-03-17T13:29:13.561+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 47 --threads 8 --no-mmap --parallel 1 --port 4998"
time=2025-03-17T13:29:13.599+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-17T13:29:13.599+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding"
time=2025-03-17T13:29:13.600+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"
time=2025-03-17T13:29:13.635+08:00 level=INFO source=runner.go:931 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: GRID P100-16Q, compute capability 6.0, VMM: no
load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll
time=2025-03-17T13:29:13.887+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-03-17T13:29:13.888+08:00 level=INFO source=runner.go:991 msg="Server listening on 127.0.0.1:4998"
time=2025-03-17T13:29:14.103+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (GRID P100-16Q) - 14901 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 508 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma2
llama_model_loader: - kv   1:                               general.name str              = gemma-2-27b-it
llama_model_loader: - kv   2:                      gemma2.context_length u32              = 8192
llama_model_loader: - kv   3:                    gemma2.embedding_length u32              = 4608
llama_model_loader: - kv   4:                         gemma2.block_count u32              = 46
llama_model_loader: - kv   5:                 gemma2.feed_forward_length u32              = 36864
llama_model_loader: - kv   6:                gemma2.attention.head_count u32              = 32
llama_model_loader: - kv   7:             gemma2.attention.head_count_kv u32              = 16
llama_model_loader: - kv   8:    gemma2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv   9:                gemma2.attention.key_length u32              = 128
llama_model_loader: - kv  10:              gemma2.attention.value_length u32              = 128
llama_model_loader: - kv  11:                          general.file_type u32              = 27
llama_model_loader: - kv  12:              gemma2.attn_logit_softcapping f32              = 50.000000
llama_model_loader: - kv  13:             gemma2.final_logit_softcapping f32              = 30.000000
llama_model_loader: - kv  14:            gemma2.attention.sliding_window u32              = 4096
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  18:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  19:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  20:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  21:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  22:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  23:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  24:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  25:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  27:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/gemma-2-27b-it-GGUF/gemma...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 322
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 128
llama_model_loader: - type  f32:  185 tensors
llama_model_loader: - type q4_K:   97 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq3_s:  225 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = IQ3_S mix - 3.66 bpw
print_info: file size   = 11.59 GiB (3.66 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 217
load: token to piece cache size = 1.6014 MB
print_info: arch             = gemma2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 8192
print_info: n_embd           = 4608
print_info: n_layer          = 46
print_info: n_head           = 32
print_info: n_head_kv        = 16
print_info: n_rot            = 128
print_info: n_swa            = 4096
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 2
print_info: n_embd_k_gqa     = 2048
print_info: n_embd_v_gqa     = 2048
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 36864
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 8192
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 27B
print_info: model params     = 27.23 B
print_info: general.name     = gemma-2-27b-it
print_info: vocab type       = SPM
print_info: n_vocab          = 256000
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 107 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 227 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 107 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: offloading 46 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 47/47 layers to GPU
load_tensors:        CUDA0 model buffer size = 11872.07 MiB
load_tensors:          CPU model buffer size =   922.85 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 46, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   736.00 MiB
llama_init_from_model: KV self size  =  736.00 MiB, K (f16):  368.00 MiB, V (f16):  368.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.99 MiB
llama_init_from_model:      CUDA0 compute buffer size =   509.00 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    17.01 MiB
llama_init_from_model: graph nodes  = 1850
llama_init_from_model: graph splits = 2
time=2025-03-17T13:30:33.377+08:00 level=INFO source=server.go:624 msg="llama runner started in 79.78 seconds"
CUDA error: the requested functionality is not supported
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:1832
  cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP)
C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:73: CUDA error
[GIN] 2025/03/17 - 13:30:35 | 500 |         1m36s |   192.168.1.102 | POST     "/api/chat"
[GIN] 2025/03/17 - 13:30:35 | 500 |         2m19s |   192.168.1.102 | POST     "/api/chat"
time=2025-03-17T13:30:35.671+08:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.5.13+

Originally created by @SuYueQiuLiang on GitHub (Mar 20, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9898 ### What is the issue? When I upgrade ollama from 0.5.12 to 0.5.13(or later,even the latest version),I can't access the REST API,but when I try "ollama run",everything works fine. I have noticed that 0.5.13 upgraded the compiled version for NVIDIA Blackwell,could there be a connection between the two? However, this issue is quite confusing, after all, 'ollama run' runs without issues. ### Relevant log output ```shell ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: GRID P100-16Q, compute capability 6.0, VMM: no load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll time=2025-03-17T13:27:38.345+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-03-17T13:27:38.686+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CUDA0 size="7.6 GiB" time=2025-03-17T13:27:38.686+08:00 level=INFO source=ggml.go:289 msg="model weights" buffer=CPU size="787.5 MiB" time=2025-03-17T13:29:07.552+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CUDA0 buffer_type=CUDA0 time=2025-03-17T13:29:07.552+08:00 level=INFO source=ggml.go:356 msg="compute graph" backend=CPU buffer_type=CUDA_Host time=2025-03-17T13:29:07.556+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-17T13:29:07.564+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.add_eot_token default=false time=2025-03-17T13:29:07.571+08:00 level=WARN source=ggml.go:149 msg="key not found" key=tokenizer.ggml.pretokenizer default="(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+" time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.local.freq_base default=10000 time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.global.freq_base default=1e+06 time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.rope.freq_scale default=1 time=2025-03-17T13:29:07.585+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma3.mm_tokens_per_image default=256 time=2025-03-17T13:29:07.648+08:00 level=INFO source=server.go:624 msg="llama runner started in 98.18 seconds" [GIN] 2025/03/17 - 13:29:07 | 200 | 1m38s | 127.0.0.1 | POST "/api/generate" time=2025-03-17T13:29:07.778+08:00 level=INFO source=sched.go:508 msg="updated VRAM based on existing loaded models" gpu=GPU-594bc79c-2687-11b2-b8fc-06941bda5d3f library=cuda total="16.0 GiB" available="3.3 GiB" time=2025-03-17T13:29:07.778+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0 time=2025-03-17T13:29:07.779+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0 time=2025-03-17T13:29:07.780+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0 time=2025-03-17T13:29:07.781+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0 time=2025-03-17T13:29:12.796+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0141034 model=C:\Users\Administrator\.ollama\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 time=2025-03-17T13:29:12.933+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0 time=2025-03-17T13:29:12.938+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0 time=2025-03-17T13:29:12.939+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Administrator\.ollama\models\blobs\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac gpu=GPU-594bc79c-2687-11b2-b8fc-06941bda5d3f parallel=1 available=15680241664 required="13.5 GiB" time=2025-03-17T13:29:12.961+08:00 level=INFO source=server.go:105 msg="system memory" total="32.0 GiB" free="27.7 GiB" free_swap="37.0 GiB" time=2025-03-17T13:29:12.961+08:00 level=WARN source=ggml.go:149 msg="key not found" key=gemma2.vision.block_count default=0 time=2025-03-17T13:29:12.961+08:00 level=INFO source=server.go:138 msg=offload library=cuda layers.requested=-1 layers.model=47 layers.offload=47 layers.split="" memory.available="[14.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="13.5 GiB" memory.required.partial="13.5 GiB" memory.required.kv="736.0 MiB" memory.required.allocations="[13.5 GiB]" memory.weights.total="10.7 GiB" memory.weights.repeating="10.7 GiB" memory.weights.nonrepeating="922.9 MiB" memory.graph.full="509.0 MiB" memory.graph.partial="1.4 GiB" time=2025-03-17T13:29:13.048+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2658925 model=C:\Users\Administrator\.ollama\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 llama_model_loader: loaded meta data with 33 key-value pairs and 508 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = gemma-2-27b-it llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 llama_model_loader: - kv 4: gemma2.block_count u32 = 46 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 llama_model_loader: - kv 11: general.file_type u32 = 27 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/gemma-2-27b-it-GGUF/gemma... llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 322 llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128 llama_model_loader: - type f32: 185 tensors llama_model_loader: - type q4_K: 97 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type iq3_s: 225 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ3_S mix - 3.66 bpw print_info: file size = 11.59 GiB (3.66 BPW) time=2025-03-17T13:29:13.296+08:00 level=WARN source=sched.go:647 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5145427 model=C:\Users\Administrator\.ollama\models\blobs\sha256-adca500fad9b54c565ae672184e0c9eb690eb6014ba63f8ec13849d4f73a32d3 load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 217 load: token to piece cache size = 1.6014 MB print_info: arch = gemma2 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 27.23 B print_info: general.name = gemma-2-27b-it print_info: vocab type = SPM print_info: n_vocab = 256000 print_info: n_merges = 0 print_info: BOS token = 2 '<bos>' print_info: EOS token = 1 '<eos>' print_info: EOT token = 107 '<end_of_turn>' print_info: UNK token = 3 '<unk>' print_info: PAD token = 0 '<pad>' print_info: LF token = 227 '<0x0A>' print_info: EOG token = 1 '<eos>' print_info: EOG token = 107 '<end_of_turn>' print_info: max token length = 48 llama_model_load: vocab only - skipping tensors time=2025-03-17T13:29:13.561+08:00 level=INFO source=server.go:405 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac --ctx-size 2048 --batch-size 512 --n-gpu-layers 47 --threads 8 --no-mmap --parallel 1 --port 4998" time=2025-03-17T13:29:13.599+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1 time=2025-03-17T13:29:13.599+08:00 level=INFO source=server.go:585 msg="waiting for llama runner to start responding" time=2025-03-17T13:29:13.600+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" time=2025-03-17T13:29:13.635+08:00 level=INFO source=runner.go:931 msg="starting go runner" ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: GRID P100-16Q, compute capability 6.0, VMM: no load_backend: loaded CUDA backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\cuda_v12\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\Administrator\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-sandybridge.dll time=2025-03-17T13:29:13.887+08:00 level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-03-17T13:29:13.888+08:00 level=INFO source=runner.go:991 msg="Server listening on 127.0.0.1:4998" time=2025-03-17T13:29:14.103+08:00 level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (GRID P100-16Q) - 14901 MiB free llama_model_loader: loaded meta data with 33 key-value pairs and 508 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-891bd9a80644aedea8f018896b1c1af396603ebfb5e7bb96da4fdd2d867c21ac (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma2 llama_model_loader: - kv 1: general.name str = gemma-2-27b-it llama_model_loader: - kv 2: gemma2.context_length u32 = 8192 llama_model_loader: - kv 3: gemma2.embedding_length u32 = 4608 llama_model_loader: - kv 4: gemma2.block_count u32 = 46 llama_model_loader: - kv 5: gemma2.feed_forward_length u32 = 36864 llama_model_loader: - kv 6: gemma2.attention.head_count u32 = 32 llama_model_loader: - kv 7: gemma2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: gemma2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: gemma2.attention.key_length u32 = 128 llama_model_loader: - kv 10: gemma2.attention.value_length u32 = 128 llama_model_loader: - kv 11: general.file_type u32 = 27 llama_model_loader: - kv 12: gemma2.attn_logit_softcapping f32 = 50.000000 llama_model_loader: - kv 13: gemma2.final_logit_softcapping f32 = 30.000000 llama_model_loader: - kv 14: gemma2.attention.sliding_window u32 = 4096 llama_model_loader: - kv 15: tokenizer.ggml.model str = llama llama_model_loader: - kv 16: tokenizer.ggml.pre str = default llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... llama_model_loader: - kv 18: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 19: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 22: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 25: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 26: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 27: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - kv 29: quantize.imatrix.file str = /models_out/gemma-2-27b-it-GGUF/gemma... llama_model_loader: - kv 30: quantize.imatrix.dataset str = /training_dir/calibration_datav3.txt llama_model_loader: - kv 31: quantize.imatrix.entries_count i32 = 322 llama_model_loader: - kv 32: quantize.imatrix.chunks_count i32 = 128 llama_model_loader: - type f32: 185 tensors llama_model_loader: - type q4_K: 97 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_loader: - type iq3_s: 225 tensors print_info: file format = GGUF V3 (latest) print_info: file type = IQ3_S mix - 3.66 bpw print_info: file size = 11.59 GiB (3.66 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 217 load: token to piece cache size = 1.6014 MB print_info: arch = gemma2 print_info: vocab_only = 0 print_info: n_ctx_train = 8192 print_info: n_embd = 4608 print_info: n_layer = 46 print_info: n_head = 32 print_info: n_head_kv = 16 print_info: n_rot = 128 print_info: n_swa = 4096 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 2 print_info: n_embd_k_gqa = 2048 print_info: n_embd_v_gqa = 2048 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: n_ff = 36864 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 10000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 8192 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 27B print_info: model params = 27.23 B print_info: general.name = gemma-2-27b-it print_info: vocab type = SPM print_info: n_vocab = 256000 print_info: n_merges = 0 print_info: BOS token = 2 '<bos>' print_info: EOS token = 1 '<eos>' print_info: EOT token = 107 '<end_of_turn>' print_info: UNK token = 3 '<unk>' print_info: PAD token = 0 '<pad>' print_info: LF token = 227 '<0x0A>' print_info: EOG token = 1 '<eos>' print_info: EOG token = 107 '<end_of_turn>' print_info: max token length = 48 load_tensors: loading model tensors, this can take a while... (mmap = false) load_tensors: offloading 46 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 47/47 layers to GPU load_tensors: CUDA0 model buffer size = 11872.07 MiB load_tensors: CPU model buffer size = 922.85 MiB llama_init_from_model: n_seq_max = 1 llama_init_from_model: n_ctx = 2048 llama_init_from_model: n_ctx_per_seq = 2048 llama_init_from_model: n_batch = 512 llama_init_from_model: n_ubatch = 512 llama_init_from_model: flash_attn = 0 llama_init_from_model: freq_base = 10000.0 llama_init_from_model: freq_scale = 1 llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (8192) -- the full capacity of the model will not be utilized llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 46, can_shift = 1 llama_kv_cache_init: CUDA0 KV buffer size = 736.00 MiB llama_init_from_model: KV self size = 736.00 MiB, K (f16): 368.00 MiB, V (f16): 368.00 MiB llama_init_from_model: CUDA_Host output buffer size = 0.99 MiB llama_init_from_model: CUDA0 compute buffer size = 509.00 MiB llama_init_from_model: CUDA_Host compute buffer size = 17.01 MiB llama_init_from_model: graph nodes = 1850 llama_init_from_model: graph splits = 2 time=2025-03-17T13:30:33.377+08:00 level=INFO source=server.go:624 msg="llama runner started in 79.78 seconds" CUDA error: the requested functionality is not supported current device: 0, in function ggml_cuda_mul_mat_batched_cublas at C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:1832 cublasGemmBatchedEx(ctx.cublas_handle(), CUBLAS_OP_T, CUBLAS_OP_N, ne01, ne11, ne10, alpha, (const void **) (ptrs_src.get() + 0*ne23), CUDA_R_16F, nb01/nb00, (const void **) (ptrs_src.get() + 1*ne23), CUDA_R_16F, nb11/nb10, beta, ( void **) (ptrs_dst.get() + 0*ne23), cu_data_type, ne01, ne23, cu_compute_type, CUBLAS_GEMM_DEFAULT_TENSOR_OP) C:\a\ollama\ollama\ml\backend\ggml\ggml\src\ggml-cuda\ggml-cuda.cu:73: CUDA error [GIN] 2025/03/17 - 13:30:35 | 500 | 1m36s | 192.168.1.102 | POST "/api/chat" [GIN] 2025/03/17 - 13:30:35 | 500 | 2m19s | 192.168.1.102 | POST "/api/chat" time=2025-03-17T13:30:35.671+08:00 level=ERROR source=server.go:449 msg="llama runner terminated" error="exit status 0xc0000409" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.13+
GiteaMirror added the bugneeds more info labels 2026-05-04 14:21:50 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 20, 2025):

What request are you sending via the API?

<!-- gh-comment-id:2739669248 --> @rick-github commented on GitHub (Mar 20, 2025): What request are you sending via the API?
Author
Owner

@SuYueQiuLiang commented on GitHub (Mar 21, 2025):

What request are you sending via the API?

{ "model": "gemma2", "messages": [ { "role": "system", "content": "you are a salty pirate" }, { "role": "user", "content": "why is the sky blue" } ], "stream": false }
generated by post man test api

<!-- gh-comment-id:2743349174 --> @SuYueQiuLiang commented on GitHub (Mar 21, 2025): > What request are you sending via the API? `{ "model": "gemma2", "messages": [ { "role": "system", "content": "you are a salty pirate" }, { "role": "user", "content": "why is the sky blue" } ], "stream": false }` generated by post man test api
Author
Owner

@rick-github commented on GitHub (Mar 21, 2025):

Where did you get the model from?

<!-- gh-comment-id:2743373732 --> @rick-github commented on GitHub (Mar 21, 2025): Where did you get the model from?
Author
Owner

@SuYueQiuLiang commented on GitHub (Mar 24, 2025):

Where did you get the model from?

I have tried multiple models, including Gemma2, Gemma3, Qwen, and QWQ. Most of them are from the Ollama library, and only Gemma2 was downloaded from Hugging Face. Except for Gemma3, other models work well in the old Ollama version. Gemma3 requires a higher Ollama version, so I can't run it.

<!-- gh-comment-id:2746937112 --> @SuYueQiuLiang commented on GitHub (Mar 24, 2025): > Where did you get the model from? I have tried multiple models, including Gemma2, Gemma3, Qwen, and QWQ. Most of them are from the Ollama library, and only Gemma2 was downloaded from Hugging Face. Except for Gemma3, other models work well in the old Ollama version. Gemma3 requires a higher Ollama version, so I can't run it.
Author
Owner

@rick-github commented on GitHub (Mar 24, 2025):

ollama run is just a wrapper for the REST API, so both should work or both should fail. Are you running multiple servers?

<!-- gh-comment-id:2747569572 --> @rick-github commented on GitHub (Mar 24, 2025): `ollama run` is just a wrapper for the REST API, so both should work or both should fail. Are you running multiple servers?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68536