[GH-ISSUE #4442] Error: llama runner process has terminated: exit status 0xc0000409 #2774

New Issue

GiteaMirror · 2026-04-12T13:05:50-05:00

GiteaMirror commented

2026-04-12 13:05:50 -05:00

Originally created by @hcr707305003 on GitHub (May 15, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4442

Originally assigned to: @dhiltgen on GitHub.

What is the issue?

when i run quantified model on v0.1.37,is errors out Error: llama runner process has terminated: exit status 0xc0000409

first step:

>>> ollama create test_q8_0 -f building_qwen_7b_gguf.Modelfile
transferring model data
using existing layer sha256:82ed01bbf8fff66078cc84849c959d77e6ee78400cf176513b22055f3848bd09
creating new layer sha256:58353639a7c4b7529da8c5c8a63e81c426f206bab10cf82e4b9e427f15a466f8
creating new layer sha256:1da117d6723df114af0d948b614cae0aa684875e2775ca9607d23e2e0769651d
creating new layer sha256:9297f08dd6c6435240b5cddc93261e8a159aa0fecf010de4568ec2df2417bdb2
creating new layer sha256:c1f908392f9e4c55a2a12ddd1035ef02f552216301e1ab2cf545aa70b4f93b67
writing manifest
success

secord step:

>>> ollama run test_q8_0
Error: llama runner process has terminated: exit status 0xc0000409

OS

Windows

GPU

Intel

CPU

Intel

Ollama version

v0.1.37

Originally created by @hcr707305003 on GitHub (May 15, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4442 Originally assigned to: @dhiltgen on GitHub. ### What is the issue? when i run quantified model on v0.1.37,is errors out `Error: llama runner process has terminated: exit status 0xc0000409` first step: ```shell >>> ollama create test_q8_0 -f building_qwen_7b_gguf.Modelfile transferring model data using existing layer sha256:82ed01bbf8fff66078cc84849c959d77e6ee78400cf176513b22055f3848bd09 creating new layer sha256:58353639a7c4b7529da8c5c8a63e81c426f206bab10cf82e4b9e427f15a466f8 creating new layer sha256:1da117d6723df114af0d948b614cae0aa684875e2775ca9607d23e2e0769651d creating new layer sha256:9297f08dd6c6435240b5cddc93261e8a159aa0fecf010de4568ec2df2417bdb2 creating new layer sha256:c1f908392f9e4c55a2a12ddd1035ef02f552216301e1ab2cf545aa70b4f93b67 writing manifest success ``` secord step: ```shell >>> ollama run test_q8_0 Error: llama runner process has terminated: exit status 0xc0000409 ``` ### OS Windows ### GPU Intel ### CPU Intel ### Ollama version v0.1.37

GiteaMirror added the bug windows labels 2026-04-12 13:05:50 -05:00

GiteaMirror closed this issue

2026-04-12 13:05:50 -05:00

GiteaMirror commented

2026-04-12 13:05:52 -05:00

@hcr707305003 commented on GitHub (May 15, 2024):

this is my building_qwen_7b_gguf.Modelfile

FROM test_quantize-q8_0.gguf

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
PARAMETER top_k 20

TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""

# set the system message
SYSTEM """
You are a helpful assistant.
"""

@hcr707305003 commented on GitHub (May 15, 2024): this is my building_qwen_7b_gguf.Modelfile ```shell FROM test_quantize-q8_0.gguf # set the temperature to 1 [higher is more creative, lower is more coherent] PARAMETER temperature 0.7 PARAMETER top_p 0.8 PARAMETER repeat_penalty 1.05 PARAMETER top_k 20 TEMPLATE """{{ if and .First .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}<|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant {{ .Response }}""" # set the system message SYSTEM """ You are a helpful assistant. """ ```

GiteaMirror commented

2026-04-12 13:05:53 -05:00

@lgw-0 commented on GitHub (May 15, 2024):

I'm similar to you. I'm running a fine-tuned model.
convert step as follow:

python convert-hf-to-gguf.py /content/LLaMA-Factory/p1 --outfile /content/drive/MyDrive/model/qwen1_5-1.8b-chat-fp16.gguf

step:

ollama run qwen_p
Error: llama runner process has terminated: exit status 0xc0000409

logs:
time=2024-05-15T17:57:25.717+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
time=2024-05-15T17:57:26.113+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error"
time=2024-05-15T17:57:26.371+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "
[GIN] 2024/05/15 - 17:57:26 | 500 | 3.2125605s | 127.0.0.1 | POST "/api/chat"

Each time I try to load these models, I get the same error.
Could anyone provide a fix?
Thank you in advance :)

@lgw-0 commented on GitHub (May 15, 2024): I'm similar to you. I'm running a fine-tuned model. convert step as follow: python convert-hf-to-gguf.py /content/LLaMA-Factory/p1 --outfile /content/drive/MyDrive/model/qwen1_5-1.8b-chat-fp16.gguf step: >ollama run qwen_p Error: llama runner process has terminated: exit status 0xc0000409 logs: time=2024-05-15T17:57:25.717+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model time=2024-05-15T17:57:26.113+08:00 level=INFO source=server.go:524 msg="waiting for server to become available" status="llm server error" time=2024-05-15T17:57:26.371+08:00 level=ERROR source=sched.go:339 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " [GIN] 2024/05/15 - 17:57:26 | 500 | 3.2125605s | 127.0.0.1 | POST "/api/chat" Each time I try to load these models, I get the same error. Could anyone provide a fix? Thank you in advance :)

GiteaMirror commented

2026-04-12 13:05:53 -05:00

@xdfnet commented on GitHub (May 15, 2024):

me too

D:\llama.cpp>ollama create eduaigc -f modelfile
transferring model data
using existing layer sha256:28ce318a0cda9dac3b5561c944c16c7e966b07890bed5bb12e122646bc8d71c4
creating new layer sha256:58353639a7c4b7529da8c5c8a63e81c426f206bab10cf82e4b9e427f15a466f8
creating new layer sha256:1da117d6723df114af0d948b614cae0aa684875e2775ca9607d23e2e0769651d
creating new layer sha256:9297f08dd6c6435240b5cddc93261e8a159aa0fecf010de4568ec2df2417bdb2
creating new layer sha256:14d7a26fe5b8e2168e038646c5fb6b0048e27c33628abda8d92ebfed0f369b9f
writing manifest
success

D:\llama.cpp>ollama run eduaigc
Error: llama runner process has terminated: exit status 0xc0000409

modelfile

FROM eduaigc-Q4_0.gguf

# set the temperature to 1 [higher is more creative, lower is more coherent]
PARAMETER temperature 0.7
PARAMETER top_p 0.8
PARAMETER repeat_penalty 1.05
PARAMETER top_k 20

TEMPLATE """{{ if and .First .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ .Response }}"""

# set the system message
SYSTEM """
You are a helpful assistant.
"""

@xdfnet commented on GitHub (May 15, 2024): me too ```shell D:\llama.cpp>ollama create eduaigc -f modelfile transferring model data using existing layer sha256:28ce318a0cda9dac3b5561c944c16c7e966b07890bed5bb12e122646bc8d71c4 creating new layer sha256:58353639a7c4b7529da8c5c8a63e81c426f206bab10cf82e4b9e427f15a466f8 creating new layer sha256:1da117d6723df114af0d948b614cae0aa684875e2775ca9607d23e2e0769651d creating new layer sha256:9297f08dd6c6435240b5cddc93261e8a159aa0fecf010de4568ec2df2417bdb2 creating new layer sha256:14d7a26fe5b8e2168e038646c5fb6b0048e27c33628abda8d92ebfed0f369b9f writing manifest success D:\llama.cpp>ollama run eduaigc Error: llama runner process has terminated: exit status 0xc0000409 ``` modelfile ```shell FROM eduaigc-Q4_0.gguf # set the temperature to 1 [higher is more creative, lower is more coherent] PARAMETER temperature 0.7 PARAMETER top_p 0.8 PARAMETER repeat_penalty 1.05 PARAMETER top_k 20 TEMPLATE """{{ if and .First .System }}<|im_start|>system {{ .System }}<|im_end|> {{ end }}<|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant {{ .Response }}""" # set the system message SYSTEM """ You are a helpful assistant. """ ```

GiteaMirror commented

2026-04-12 13:05:54 -05:00

@wanshichenguang commented on GitHub (May 20, 2024):

I have the same problem:
use llama_factory to have a lora. export model:

Note: DO NOT use quantized model or quantization_bit when merging lora adapters

model

model_name_or_path: /hy-tmp/model/qwen/Qwen1___5-7B-Chat
adapter_name_or_path: /hy-tmp/model/checkpoint7
template: qwen
finetuning_type: lora

export

export_dir: /hy-tmp/qwen7
export_size: 2
export_device: cpu
export_legacy_format: false

and use llama.cpp to convert :
python convert-hf-to-gguf.py /hy-tmp/qwen7

it works:
(info_extra) root@7eff8c7865f0:/project/llama.cpp# ./main -m /hy-tmp/qwen7/ggml-model-f16.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt
Log start
main: build = 2887 (583fd6b0)
main: built with cc (Ubuntu 11.4.0-1ubuntu122.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1716201079
llama_model_loader: loaded meta data with 21 key-value pairs and 387 tensors from /hy-tmp/qwen7/ggml-model-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = qwen7
llama_model_loader: - kv 2: qwen2.block_count u32 = 32
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 4096
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 32
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 32
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% set system_message = 'You are a he...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type f16: 226 tensors
llm_load_vocab: special tokens definition check successful ( 293/151936 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 7.72 B
llm_load_print_meta: model size = 14.38 GiB (16.00 BPW)
llm_load_print_meta: general.name = qwen7
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.18 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors: CPU buffer size = 14728.52 MiB
......................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA_Host KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 1491.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 17.01 MiB
llama_new_context_with_model: graph nodes = 1126
llama_new_context_with_model: graph splits = 452

== Running in interactive mode. ==

Press Ctrl+C to interject at any time.
Press Return to return control to LLaMa.
To return control without starting a new line, end your input with '/'.
If you want to submit another line, end your input with ''.

你好啊
你好！有什么问题或需要帮助吗？<|im_end|>

the same problem in ollama:

time=2024-05-20T10:41:19.127Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
time=2024-05-20T10:41:19.379Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
[GIN] 2024/05/20 - 10:41:19 | 500 | 1.463489615s | 127.0.0.1 | POST "/api/chat"
time=2024-05-20T10:41:24.584Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.205149643
time=2024-05-20T10:41:24.863Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.484359346
time=2024-05-20T10:41:25.142Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.762816273

@wanshichenguang commented on GitHub (May 20, 2024): I have the same problem: use llama_factory to have a lora. export model: # Note: DO NOT use quantized model or quantization_bit when merging lora adapters # model model_name_or_path: /hy-tmp/model/qwen/Qwen1___5-7B-Chat adapter_name_or_path: /hy-tmp/model/checkpoint7 template: qwen finetuning_type: lora # export export_dir: /hy-tmp/qwen7 export_size: 2 export_device: cpu export_legacy_format: false and use llama.cpp to convert : python convert-hf-to-gguf.py /hy-tmp/qwen7 it works: (info_extra) root@7eff8c7865f0:~/project/llama.cpp# ./main -m /hy-tmp/qwen7/ggml-model-f16.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt Log start main: build = 2887 (583fd6b0) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1716201079 llama_model_loader: loaded meta data with 21 key-value pairs and 387 tensors from /hy-tmp/qwen7/ggml-model-f16.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = qwen7 llama_model_loader: - kv 2: qwen2.block_count u32 = 32 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 4096 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 11008 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 32 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% set system_message = 'You are a he... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 161 tensors llama_model_loader: - type f16: 226 tensors llm_load_vocab: special tokens definition check successful ( 293/151936 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 32 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 4096 llm_load_print_meta: n_embd_v_gqa = 4096 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 11008 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 7.72 B llm_load_print_meta: model size = 14.38 GiB (16.00 BPW) llm_load_print_meta: general.name = qwen7 llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.18 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/33 layers to GPU llm_load_tensors: CPU buffer size = 14728.52 MiB ...................................................................................... llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA_Host KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB llama_new_context_with_model: CUDA0 compute buffer size = 1491.75 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 17.01 MiB llama_new_context_with_model: graph nodes = 1126 llama_new_context_with_model: graph splits = 452 system_info: n_threads = 8 / 24 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. Reverse prompt: '<|im_start|>user ' sampling: repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000 top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800 mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 512, n_batch = 2048, n_predict = 512, n_keep = 11 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to LLaMa. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. <|endoftext|><|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user > 你好啊你好！有什么问题或需要帮助吗？<|im_end|> the same problem in ollama: time=2024-05-20T10:41:19.127Z level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model terminate called after throwing an instance of 'std::runtime_error' what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' time=2024-05-20T10:41:19.379Z level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) " [GIN] 2024/05/20 - 10:41:19 | 500 | 1.463489615s | 127.0.0.1 | POST "/api/chat" time=2024-05-20T10:41:24.584Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.205149643 time=2024-05-20T10:41:24.863Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.484359346 time=2024-05-20T10:41:25.142Z level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.762816273

GiteaMirror commented

2026-04-12 13:05:55 -05:00

@suadAlwajeeh commented on GitHub (Jun 10, 2024):

I have the same error
ollama run qwen2:0.5b
Error: llama runner process has terminated: exit status 0xc0000409

my ollama version was v0.1.38 ,but when I upgraded to v0.1.42 problem solved and the llm runs successfully

@suadAlwajeeh commented on GitHub (Jun 10, 2024): I have the same error ollama run qwen2:0.5b Error: llama runner process has terminated: exit status 0xc0000409 my ollama version was v0.1.38 ,but when I upgraded to v0.1.42 problem solved and the llm runs successfully

GiteaMirror commented

2026-04-12 13:05:56 -05:00

@parvuselephantus commented on GitHub (Jun 11, 2024):

I just try:
ollama run hhao/openbmb-minicpm-llama3-v-2_5
with no other configuration. Windows 11, CPU, ollama v0.1.42 - I'm gettting same error.

@parvuselephantus commented on GitHub (Jun 11, 2024): I just try: ollama run hhao/openbmb-minicpm-llama3-v-2_5 with no other configuration. Windows 11, CPU, ollama v0.1.42 - I'm gettting same error.

GiteaMirror commented

2026-04-12 13:05:56 -05:00

@suadAlwajeeh commented on GitHub (Jun 11, 2024):

I just try: ollama run hhao/openbmb-minicpm-llama3-v-2_5 with no other configuration. Windows 11, CPU, ollama v0.1.42 - I'm gettting same error.

try download and install again

@suadAlwajeeh commented on GitHub (Jun 11, 2024): > I just try: ollama run hhao/openbmb-minicpm-llama3-v-2_5 with no other configuration. Windows 11, CPU, ollama v0.1.42 - I'm gettting same error. try download and install again

GiteaMirror commented

2026-04-12 13:05:57 -05:00

@parvuselephantus commented on GitHub (Jun 11, 2024):

try download and install again

Thanks, I though about restarting PC, but didn't think of reinstalling model. Tried ollama rm and then run again, but unfortunately still same error:

ollama run hhao/openbmb-minicpm-llama3-v-2_5
pulling manifest
pulling 391d11736c3c... 100% ▕████████████████████████████████████████████████████████▏ 1.0 GB
pulling 010ec3ba94cb... 100% ▕████████████████████████████████████████████████████████▏ 4.9 GB
pulling 8ab4849b038c... 100% ▕████████████████████████████████████████████████████████▏ 254 B
pulling 2c527a8fcba5... 100% ▕████████████████████████████████████████████████████████▏ 124 B
pulling ada64ec88682... 100% ▕████████████████████████████████████████████████████████▏ 493 B
verifying sha256 digest
writing manifest
removing any unused layers
success
Error: llama runner process has terminated: exit status 0xc0000409

@parvuselephantus commented on GitHub (Jun 11, 2024): > try download and install again Thanks, I though about restarting PC, but didn't think of reinstalling model. Tried ollama rm and then run again, but unfortunately still same error: >ollama run hhao/openbmb-minicpm-llama3-v-2_5 pulling manifest pulling 391d11736c3c... 100% ▕████████████████████████████████████████████████████████▏ 1.0 GB pulling 010ec3ba94cb... 100% ▕████████████████████████████████████████████████████████▏ 4.9 GB pulling 8ab4849b038c... 100% ▕████████████████████████████████████████████████████████▏ 254 B pulling 2c527a8fcba5... 100% ▕████████████████████████████████████████████████████████▏ 124 B pulling ada64ec88682... 100% ▕████████████████████████████████████████████████████████▏ 493 B verifying sha256 digest writing manifest removing any unused layers success Error: llama runner process has terminated: exit status 0xc0000409

GiteaMirror commented

2026-04-12 13:05:58 -05:00

@DHclly commented on GitHub (Jun 12, 2024):

ollama run qwen:1.8b
Error: llama runner process has terminated: exit status 0xc0000409 CUDA error"

time=2024-06-12T09:10:33.042+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=25 memory.available="11.0 GiB" memory.required.full="2.0 GiB" memory.required.partial="2.0 GiB" memory.required.kv="384.0 MiB" memory.weights.total="895.7 MiB" memory.weights.repeating="652.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="300.8 MiB" memory.graph.partial="544.2 MiB"
time=2024-06-12T09:10:33.042+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=25 memory.available="11.0 GiB" memory.required.full="2.0 GiB" memory.required.partial="2.0 GiB" memory.required.kv="384.0 MiB" memory.weights.total="895.7 MiB" memory.weights.repeating="652.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="300.8 MiB" memory.graph.partial="544.2 MiB"
time=2024-06-12T09:10:33.048+08:00 level=INFO source=server.go:341 msg="starting llama server" cmd="C:\Users\Administrator\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\Administrator\.ollama\models\blobs\sha256-1296b084ed6bc4c6eaee99255d73e9c715d38e0087b6467fd1c498b908180614 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 63328"
time=2024-06-12T09:10:33.052+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-12T09:10:33.052+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-12T09:10:33.053+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3051 commit="5921b8f0" tid="21596" timestamp=1718154633
INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="21596" timestamp=1718154633 total_threads=12
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="63328" tid="21596" timestamp=1718154633
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\Users\Administrator.ollama\models\blobs\sha256-1296b084ed6bc4c6eaee99255d73e9c715d38e0087b6467fd1c498b908180614 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen2-beta-1_8B-Chat
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 5504
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true
llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.chat_template str = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - kv 19: general.file_type u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type q4_0: 169 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-06-12T09:10:33.317+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default'
llm_load_vocab: special tokens cache size = 293
llm_load_vocab: token to piece cache size = 1.8676 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: n_ctx_train = 32768
llm_load_print_meta: n_embd = 2048
llm_load_print_meta: n_head = 16
llm_load_print_meta: n_head_kv = 16
llm_load_print_meta: n_layer = 24
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 2048
llm_load_print_meta: n_embd_v_gqa = 2048
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 5504
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 32768
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 1B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 1.84 B
llm_load_print_meta: model size = 1.04 GiB (4.85 BPW)
llm_load_print_meta: general.name = Qwen2-beta-1_8B-Chat
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151643 '<|endoftext|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size = 0.28 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: CPU buffer size = 166.92 MiB
llm_load_tensors: CUDA0 buffer size = 895.75 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 384.00 MiB
llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 300.75 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 8.01 MiB
llama_new_context_with_model: graph nodes = 846
llama_new_context_with_model: graph splits = 2
fatal : Memory allocation failure
CUDA error: CUBLAS_STATUS_NOT_INITIALIZED
current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:653
cublasCreate_v2(&cublas_handles[device])
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu💯 !"CUDA error"
time=2024-06-12T09:10:39.665+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
time=2024-06-12T09:10:39.923+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error""
[GIN] 2024/06/12 - 09:10:39 | 500 | 9.0390595s | 127.0.0.1 | POST "/api/chat"

before I sleep is right , but today is become bad

@DHclly commented on GitHub (Jun 12, 2024): ``` ollama run qwen:1.8b Error: llama runner process has terminated: exit status 0xc0000409 CUDA error" ``` time=2024-06-12T09:10:33.042+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=25 memory.available="11.0 GiB" memory.required.full="2.0 GiB" memory.required.partial="2.0 GiB" memory.required.kv="384.0 MiB" memory.weights.total="895.7 MiB" memory.weights.repeating="652.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="300.8 MiB" memory.graph.partial="544.2 MiB" time=2024-06-12T09:10:33.042+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=25 memory.available="11.0 GiB" memory.required.full="2.0 GiB" memory.required.partial="2.0 GiB" memory.required.kv="384.0 MiB" memory.weights.total="895.7 MiB" memory.weights.repeating="652.3 MiB" memory.weights.nonrepeating="243.4 MiB" memory.graph.full="300.8 MiB" memory.graph.partial="544.2 MiB" time=2024-06-12T09:10:33.048+08:00 level=INFO source=server.go:341 msg="starting llama server" cmd="C:\\Users\\Administrator\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\Administrator\\.ollama\\models\\blobs\\sha256-1296b084ed6bc4c6eaee99255d73e9c715d38e0087b6467fd1c498b908180614 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 25 --parallel 1 --port 63328" time=2024-06-12T09:10:33.052+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-12T09:10:33.052+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-12T09:10:33.053+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3051 commit="5921b8f0" tid="21596" timestamp=1718154633 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="21596" timestamp=1718154633 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="63328" tid="21596" timestamp=1718154633 llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from C:\Users\Administrator\.ollama\models\blobs\sha256-1296b084ed6bc4c6eaee99255d73e9c715d38e0087b6467fd1c498b908180614 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen2-beta-1_8B-Chat llama_model_loader: - kv 2: qwen2.block_count u32 = 24 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 2048 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 5504 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 9: qwen2.use_parallel_residual bool = true llama_model_loader: - kv 10: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 11: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 12: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 13: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 15: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 17: tokenizer.chat_template str = {% for message in messages %}{{'<|im_... llama_model_loader: - kv 18: general.quantization_version u32 = 2 llama_model_loader: - kv 19: general.file_type u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q4_0: 169 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-06-12T09:10:33.317+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: missing or unrecognized pre-tokenizer type, using: 'default' llm_load_vocab: special tokens cache size = 293 llm_load_vocab: token to piece cache size = 1.8676 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151936 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 2048 llm_load_print_meta: n_head = 16 llm_load_print_meta: n_head_kv = 16 llm_load_print_meta: n_layer = 24 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 2048 llm_load_print_meta: n_embd_v_gqa = 2048 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-06 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 5504 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 1B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 1.84 B llm_load_print_meta: model size = 1.04 GiB (4.85 BPW) llm_load_print_meta: general.name = Qwen2-beta-1_8B-Chat llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151643 '<|endoftext|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: offloading 24 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 25/25 layers to GPU llm_load_tensors: CPU buffer size = 166.92 MiB llm_load_tensors: CUDA0 buffer size = 895.75 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 384.00 MiB llama_new_context_with_model: KV self size = 384.00 MiB, K (f16): 192.00 MiB, V (f16): 192.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB llama_new_context_with_model: CUDA0 compute buffer size = 300.75 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 8.01 MiB llama_new_context_with_model: graph nodes = 846 llama_new_context_with_model: graph splits = 2 fatal : Memory allocation failure CUDA error: CUBLAS_STATUS_NOT_INITIALIZED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:653 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error" time=2024-06-12T09:10:39.665+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" time=2024-06-12T09:10:39.923+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error\"" [GIN] 2024/06/12 - 09:10:39 | 500 | 9.0390595s | 127.0.0.1 | POST "/api/chat" before I sleep is right , but today is become bad

GiteaMirror commented

2026-04-12 13:05:58 -05:00

@parvuselephantus commented on GitHub (Jun 12, 2024):

got update to 0.1.43 - still same error. As per DHclly - seems it's not only on CPU (Now I will be afraid to go to sleep when it works!)

@parvuselephantus commented on GitHub (Jun 12, 2024): got update to 0.1.43 - still same error. As per DHclly - seems it's not only on CPU (Now I will be afraid to go to sleep when it works!)

GiteaMirror commented

2026-04-12 13:05:59 -05:00

@DHclly commented on GitHub (Jun 12, 2024):

got update to 0.1.43 - still same error. As per DHclly - seems it's not only on CPU (Now I will be afraid to go to sleep when it works!)

It's amazing，after 1 hours , I restart it , it run very well on nvidia gpu , now , it's running success, but I don't know why.

@DHclly commented on GitHub (Jun 12, 2024): > got update to 0.1.43 - still same error. As per DHclly - seems it's not only on CPU (Now I will be afraid to go to sleep when it works!) It's amazing，after 1 hours , I restart it , it run very well on nvidia gpu , now , it's running success, but I don't know why.

GiteaMirror commented

2026-04-12 13:06:00 -05:00

@lijunfeng11 commented on GitHub (Jul 2, 2024):

me too
`C:\Users\LI>ollama run llama3
pulling manifest
pulling 6a0746a1ec1a... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB
pulling 4fa551d4f938... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB
pulling 8ab4849b038c... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 254 B
pulling 577073ffcc6c... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B
pulling 3f8eb4da87fa... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 485 B
verifying sha256 digest
writing manifest
removing any unused layers
success
Error: llama runner process has terminated: exit status 0xc0000409 error:failed to create context with model 'C:\Users\LI.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa'

C:\Users\LI>
C:\Users\LI>
C:\Users\LI>ollama run llama3
Error: llama runner process has terminated: exit status 0xc0000409 error:failed to create context with model 'C:\Users\LI.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa'`

@lijunfeng11 commented on GitHub (Jul 2, 2024): me too `C:\Users\LI>ollama run llama3 pulling manifest pulling 6a0746a1ec1a... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB pulling 4fa551d4f938... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB pulling 8ab4849b038c... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 254 B pulling 577073ffcc6c... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B pulling 3f8eb4da87fa... 100% ▕█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 485 B verifying sha256 digest writing manifest removing any unused layers success Error: llama runner process has terminated: exit status 0xc0000409 error:failed to create context with model 'C:\Users\LI\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa' C:\Users\LI> C:\Users\LI> C:\Users\LI>ollama run llama3 Error: llama runner process has terminated: exit status 0xc0000409 error:failed to create context with model 'C:\Users\LI\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa'`

GiteaMirror commented

2026-04-12 13:06:01 -05:00

@dhiltgen commented on GitHub (Jul 3, 2024):

Unfortunately the exit code 0xc0000409 just indicates something went wrong. It looks like there are multiple unrelated topics in this issue.

For people trying to use qwen, please make sure to upgrade to the latest version, as fixes have gone in over the past few releases which should hopefully resolve those.

For people trying to create their own models which are causing the server to crash, please share your server log which may help understand which property/parameter caused the failure.

For the Memory allocation failure, please make sure you're running the latest version, and if that doesn't clear it, please share your server log.

@dhiltgen commented on GitHub (Jul 3, 2024): Unfortunately the exit code 0xc0000409 just indicates something went wrong. It looks like there are multiple unrelated topics in this issue. For people trying to use qwen, please make sure to upgrade to the latest version, as fixes have gone in over the past few releases which should hopefully resolve those. For people trying to create their own models which are causing the server to crash, please share your server log which may help understand which property/parameter caused the failure. For the Memory allocation failure, please make sure you're running the latest version, and if that doesn't clear it, please share your server log.

GiteaMirror commented

2026-04-12 13:06:02 -05:00

@someone2018 commented on GitHub (Jul 5, 2024):

same error here. @dhiltgen

Here are my logs :

2024/07/05 14:14:20 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\Users\DELL\.ollama\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-05T14:14:20.644+02:00 level=INFO source=images.go:730 msg="total blobs: 0"
time=2024-07-05T14:14:20.644+02:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0"
time=2024-07-05T14:14:20.645+02:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11434 (version 0.1.48)"
time=2024-07-05T14:14:20.645+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v5.7 cpu cpu_avx]"
time=2024-07-05T14:14:21.533+02:00 level=INFO source=types.go:98 msg="inference compute" id=GPU-15161996-1a7c-8143-bc65-810c3bf997fb library=cuda compute=7.5 driver=0.0 name="" total="6.0 GiB" available="5.0 GiB"
[GIN] 2024/07/05 - 14:14:33 | 200 | 0s | 127.0.0.1 | HEAD "/"
[GIN] 2024/07/05 - 14:14:33 | 404 | 575.7µs | 127.0.0.1 | POST "/api/show"
time=2024-07-05T14:14:35.466+02:00 level=INFO source=download.go:136 msg="downloading 6a0746a1ec1a in 47 100 MB part(s)"
time=2024-07-05T14:17:40.917+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 11 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:17:45.920+02:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 11 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection."
time=2024-07-05T14:18:16.571+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 16 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:18:48.245+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 7 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:18:56.308+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 23 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:18:59.772+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 9 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:19:00.704+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 29 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:19:15.866+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 5 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:19:21.075+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 19 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:19:31.399+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 33 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:19:37.085+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 20 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:19:42.827+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 45 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:19:49.355+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 26 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:20:04.830+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 34 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:20:21.486+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 37 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:20:31.388+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 4 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:20:34.714+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 2 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:20:43.434+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 38 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:20:50.118+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 44 attempt 0 failed: unexpected EOF, retrying in 1s"
time=2024-07-05T14:21:27.716+02:00 level=INFO source=download.go:136 msg="downloading 4fa551d4f938 in 1 12 KB part(s)"
time=2024-07-05T14:21:29.588+02:00 level=INFO source=download.go:136 msg="downloading 8ab4849b038c in 1 254 B part(s)"
time=2024-07-05T14:21:31.512+02:00 level=INFO source=download.go:136 msg="downloading 577073ffcc6c in 1 110 B part(s)"
time=2024-07-05T14:21:33.288+02:00 level=INFO source=download.go:136 msg="downloading 3f8eb4da87fa in 1 485 B part(s)"
[GIN] 2024/07/05 - 14:21:44 | 200 | 7m11s | 127.0.0.1 | POST "/api/pull"
[GIN] 2024/07/05 - 14:21:44 | 200 | 17.08ms | 127.0.0.1 | POST "/api/show"
time=2024-07-05T14:21:44.703+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-05T14:21:44.706+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 59978"
time=2024-07-05T14:21:44.730+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
time=2024-07-05T14:21:44.730+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
time=2024-07-05T14:21:44.730+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3171 commit="7c26775a" tid="10184" timestamp=1720182105
INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="10184" timestamp=1720182105 total_threads=12
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="59978" tid="10184" timestamp=1720182105
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
time=2024-07-05T14:21:45.247+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 281.81 MiB
llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="10184" timestamp=1720182107
time=2024-07-05T14:21:48.178+02:00 level=INFO source=server.go:599 msg="llama runner started in 3.45 seconds"
[GIN] 2024/07/05 - 14:21:48 | 200 | 3.5127842s | 127.0.0.1 | POST "/api/chat"
[GIN] 2024/07/05 - 14:22:07 | 200 | 11.0666417s | 127.0.0.1 | POST "/api/chat"
time=2024-07-05T14:26:05.227+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="487.5 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="283.4 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-05T14:26:05.230+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 3900 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 60034"
time=2024-07-05T14:26:05.233+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
time=2024-07-05T14:26:05.233+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
time=2024-07-05T14:26:05.234+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3171 commit="7c26775a" tid="23620" timestamp=1720182365
INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="23620" timestamp=1720182365 total_threads=12
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="60034" tid="23620" timestamp=1720182365
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-07-05T14:26:05.485+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 281.81 MiB
llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
llama_new_context_with_model: n_ctx = 3904
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 488.00 MiB
llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 283.63 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 15.63 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
CUDA error: CUBLAS_STATUS_ALLOC_FAILED
current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:826
cublasCreate_v2(&cublas_handles[device])
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu💯 !"CUDA error"
time=2024-07-05T14:26:08.104+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error""
[GIN] 2024/07/05 - 14:26:08 | 500 | 3.2221016s | 127.0.0.1 | POST "/api/chat"
time=2024-07-05T14:26:13.138+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0337086 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-05T14:26:13.386+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2815106 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-05T14:26:13.634+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5300006 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
[GIN] 2024/07/05 - 14:30:22 | 404 | 0s | 127.0.0.1 | GET "/api/chat"
[GIN] 2024/07/05 - 14:31:49 | 200 | 0s | 127.0.0.1 | GET "/"
[GIN] 2024/07/05 - 14:32:21 | 404 | 0s | 127.0.0.1 | GET "/api/chat"
time=2024-07-05T15:59:52.039+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="487.5 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="283.4 MiB" memory.graph.partial="677.5 MiB"
time=2024-07-05T15:59:52.044+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\Users\DELL\AppData\Local\Programs\Ollama\ollama_runners\cuda_v11.3\ollama_llama_server.exe --model C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 3900 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 60942"
time=2024-07-05T15:59:52.073+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1
time=2024-07-05T15:59:52.073+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding"
time=2024-07-05T15:59:52.074+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3171 commit="7c26775a" tid="11152" timestamp=1720187992
INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="11152" timestamp=1720187992 total_threads=12
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="60942" tid="11152" timestamp=1720187992
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
time=2024-07-05T15:59:52.846+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.30 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: CUDA_Host buffer size = 281.81 MiB
llm_load_tensors: CUDA0 buffer size = 4155.99 MiB
llama_new_context_with_model: n_ctx = 3904
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 488.00 MiB
llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 283.63 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 15.63 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 2
CUDA error: CUBLAS_STATUS_ALLOC_FAILED
current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:826
cublasCreate_v2(&cublas_handles[device])
GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu💯 !"CUDA error"
time=2024-07-05T15:59:57.275+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error"
time=2024-07-05T15:59:57.538+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error""
[GIN] 2024/07/05 - 15:59:57 | 500 | 5.5825397s | 127.0.0.1 | POST "/api/chat"
time=2024-07-05T16:00:02.560+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.021286 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-05T16:00:02.811+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2725836 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa
time=2024-07-05T16:00:03.062+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5234575 model=C:\Users\DELL.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa

@someone2018 commented on GitHub (Jul 5, 2024): same error here. @dhiltgen Here are my logs : 2024/07/05 14:14:20 routes.go:1064: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE: OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\\Users\\DELL\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\DELL\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-05T14:14:20.644+02:00 level=INFO source=images.go:730 msg="total blobs: 0" time=2024-07-05T14:14:20.644+02:00 level=INFO source=images.go:737 msg="total unused blobs removed: 0" time=2024-07-05T14:14:20.645+02:00 level=INFO source=routes.go:1111 msg="Listening on 127.0.0.1:11434 (version 0.1.48)" time=2024-07-05T14:14:20.645+02:00 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx2 cuda_v11.3 rocm_v5.7 cpu cpu_avx]" time=2024-07-05T14:14:21.533+02:00 level=INFO source=types.go:98 msg="inference compute" id=GPU-15161996-1a7c-8143-bc65-810c3bf997fb library=cuda compute=7.5 driver=0.0 name="" total="6.0 GiB" available="5.0 GiB" [GIN] 2024/07/05 - 14:14:33 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/07/05 - 14:14:33 | 404 | 575.7µs | 127.0.0.1 | POST "/api/show" time=2024-07-05T14:14:35.466+02:00 level=INFO source=download.go:136 msg="downloading 6a0746a1ec1a in 47 100 MB part(s)" time=2024-07-05T14:17:40.917+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 11 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:17:45.920+02:00 level=INFO source=download.go:251 msg="6a0746a1ec1a part 11 stalled; retrying. If this persists, press ctrl-c to exit, then 'ollama pull' to find a faster connection." time=2024-07-05T14:18:16.571+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 16 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:18:48.245+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 7 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:18:56.308+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 23 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:18:59.772+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 9 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:00.704+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 29 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:15.866+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 5 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:21.075+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 19 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:31.399+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 33 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:37.085+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 20 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:42.827+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 45 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:19:49.355+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 26 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:04.830+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 34 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:21.486+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 37 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:31.388+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 4 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:34.714+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 2 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:43.434+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 38 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:20:50.118+02:00 level=INFO source=download.go:178 msg="6a0746a1ec1a part 44 attempt 0 failed: unexpected EOF, retrying in 1s" time=2024-07-05T14:21:27.716+02:00 level=INFO source=download.go:136 msg="downloading 4fa551d4f938 in 1 12 KB part(s)" time=2024-07-05T14:21:29.588+02:00 level=INFO source=download.go:136 msg="downloading 8ab4849b038c in 1 254 B part(s)" time=2024-07-05T14:21:31.512+02:00 level=INFO source=download.go:136 msg="downloading 577073ffcc6c in 1 110 B part(s)" time=2024-07-05T14:21:33.288+02:00 level=INFO source=download.go:136 msg="downloading 3f8eb4da87fa in 1 485 B part(s)" [GIN] 2024/07/05 - 14:21:44 | 200 | 7m11s | 127.0.0.1 | POST "/api/pull" [GIN] 2024/07/05 - 14:21:44 | 200 | 17.08ms | 127.0.0.1 | POST "/api/show" time=2024-07-05T14:21:44.703+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.0 GiB" memory.required.partial="5.0 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="677.5 MiB" time=2024-07-05T14:21:44.706+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\\Users\\DELL\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\DELL\\.ollama\\models\\blobs\\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 59978" time=2024-07-05T14:21:44.730+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 time=2024-07-05T14:21:44.730+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-05T14:21:44.730+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3171 commit="7c26775a" tid="10184" timestamp=1720182105 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="10184" timestamp=1720182105 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="59978" tid="10184" timestamp=1720182105 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe time=2024-07-05T14:21:45.247+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 258.50 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="10184" timestamp=1720182107 time=2024-07-05T14:21:48.178+02:00 level=INFO source=server.go:599 msg="llama runner started in 3.45 seconds" [GIN] 2024/07/05 - 14:21:48 | 200 | 3.5127842s | 127.0.0.1 | POST "/api/chat" [GIN] 2024/07/05 - 14:22:07 | 200 | 11.0666417s | 127.0.0.1 | POST "/api/chat" time=2024-07-05T14:26:05.227+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="487.5 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="283.4 MiB" memory.graph.partial="677.5 MiB" time=2024-07-05T14:26:05.230+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\\Users\\DELL\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\DELL\\.ollama\\models\\blobs\\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 3900 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 60034" time=2024-07-05T14:26:05.233+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 time=2024-07-05T14:26:05.233+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-05T14:26:05.234+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3171 commit="7c26775a" tid="23620" timestamp=1720182365 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="23620" timestamp=1720182365 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="60034" tid="23620" timestamp=1720182365 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-05T14:26:05.485+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 3904 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 488.00 MiB llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 283.63 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 15.63 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 CUDA error: CUBLAS_STATUS_ALLOC_FAILED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:826 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error" time=2024-07-05T14:26:08.104+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error\"" [GIN] 2024/07/05 - 14:26:08 | 500 | 3.2221016s | 127.0.0.1 | POST "/api/chat" time=2024-07-05T14:26:13.138+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.0337086 model=C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T14:26:13.386+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2815106 model=C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T14:26:13.634+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5300006 model=C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa [GIN] 2024/07/05 - 14:30:22 | 404 | 0s | 127.0.0.1 | GET "/api/chat" [GIN] 2024/07/05 - 14:31:49 | 200 | 0s | 127.0.0.1 | GET "/" [GIN] 2024/07/05 - 14:32:21 | 404 | 0s | 127.0.0.1 | GET "/api/chat" time=2024-07-05T15:59:52.039+02:00 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[5.8 GiB]" memory.required.full="5.4 GiB" memory.required.partial="5.4 GiB" memory.required.kv="487.5 MiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="4.1 GiB" memory.weights.repeating="3.7 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="283.4 MiB" memory.graph.partial="677.5 MiB" time=2024-07-05T15:59:52.044+02:00 level=INFO source=server.go:368 msg="starting llama server" cmd="C:\\Users\\DELL\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\DELL\\.ollama\\models\\blobs\\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 3900 --batch-size 512 --embedding --log-disable --n-gpu-layers 33 --no-mmap --parallel 1 --port 60942" time=2024-07-05T15:59:52.073+02:00 level=INFO source=sched.go:382 msg="loaded runners" count=1 time=2024-07-05T15:59:52.073+02:00 level=INFO source=server.go:556 msg="waiting for llama runner to start responding" time=2024-07-05T15:59:52.074+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3171 commit="7c26775a" tid="11152" timestamp=1720187992 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="11152" timestamp=1720187992 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="60942" tid="11152" timestamp=1720187992 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe time=2024-07-05T15:59:52.846+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes llm_load_tensors: ggml ctx size = 0.30 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CUDA_Host buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 3904 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 488.00 MiB llama_new_context_with_model: KV self size = 488.00 MiB, K (f16): 244.00 MiB, V (f16): 244.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 283.63 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 15.63 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 CUDA error: CUBLAS_STATUS_ALLOC_FAILED current device: 0, in function cublas_handle at C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda/common.cuh:826 cublasCreate_v2(&cublas_handles[device]) GGML_ASSERT: C:\a\ollama\ollama\llm\llama.cpp\ggml-cuda.cu:100: !"CUDA error" time=2024-07-05T15:59:57.275+02:00 level=INFO source=server.go:594 msg="waiting for server to become available" status="llm server error" time=2024-07-05T15:59:57.538+02:00 level=ERROR source=sched.go:388 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 CUDA error\"" [GIN] 2024/07/05 - 15:59:57 | 500 | 5.5825397s | 127.0.0.1 | POST "/api/chat" time=2024-07-05T16:00:02.560+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.021286 model=C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T16:00:02.811+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.2725836 model=C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa time=2024-07-05T16:00:03.062+02:00 level=WARN source=sched.go:575 msg="gpu VRAM usage didn't recover within timeout" seconds=5.5234575 model=C:\Users\DELL\.ollama\models\blobs\sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa

GiteaMirror commented

2026-04-12 13:06:03 -05:00

@dhiltgen commented on GitHub (Jul 22, 2024):

@someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load.

@dhiltgen commented on GitHub (Jul 22, 2024): @someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load.

GiteaMirror commented

2026-04-12 13:06:04 -05:00

@lijunfeng11 commented on GitHub (Jul 23, 2024):

@someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load.

Okay, I'm trying it out. The model I'm using is llama3

@lijunfeng11 commented on GitHub (Jul 23, 2024): > @someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load. Okay, I'm trying it out. The model I'm using is llama3

GiteaMirror commented

2026-04-12 13:06:05 -05:00

@lijunfeng11 commented on GitHub (Jul 23, 2024):

@someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load.

Okay, I'm trying it out. The model I'm using is llama3

@lijunfeng11 commented on GitHub (Jul 23, 2024): > @someone2018 your error looks like an OOM problem. We failed to partially load with 5G available on the GPU. Please make sure to update to the latest version and if you're still hitting the OOM crash, please let us know which model you were trying to load. Okay, I'm trying it out. The model I'm using is llama3

GiteaMirror commented

2026-04-12 13:06:06 -05:00

@dhiltgen commented on GitHub (Aug 9, 2024):

I'm going to close this one out. We should detect most failures and report a better error message now than 0xc0000409 and folks can find other similar issues to +1, or open new ones.

@dhiltgen commented on GitHub (Aug 9, 2024): I'm going to close this one out. We should detect most failures and report a better error message now than `0xc0000409` and folks can find other similar issues to +1, or open new ones.

GiteaMirror commented

2026-04-12 13:06:07 -05:00

@metouitude commented on GitHub (Jan 31, 2025):

Guys are you all trying with powershell ?
Try raw CMD of windows.

@metouitude commented on GitHub (Jan 31, 2025): Guys are you all trying with powershell ? Try raw CMD of windows.

GiteaMirror commented

2026-04-12 13:06:09 -05:00

@moein459 commented on GitHub (Mar 6, 2025):

I know this issue is a bit old, but I wanted to share my workaround:

Install the original version of Ollama.
Replace the entire rocm folder inside ollama-windows-amd64.7z from this repo's releases: Ollama for AMD.
Replace the corresponding ROCm libraries (lib and .dll files) based on my GPU from this repo: ROCmLibs for gfx1103 (AMD 780M APU).
Restart Ollama, and everything works—especially with my GPU (RX 6700 XT).

Just to clarify for the devs, my error was related to the ROCm library and mentioned issues with GGML files. Maybe because my gpu isn't officially supported.

Hope this helps! 🚀

@moein459 commented on GitHub (Mar 6, 2025): I know this issue is a bit old, but I wanted to share my workaround: 1. Install the original version of Ollama. 2. Replace the entire rocm folder inside ollama-windows-amd64.7z from this repo's releases: [Ollama for AMD](https://github.com/likelovewant/ollama-for-amd/releases). 3. Replace the corresponding ROCm libraries (lib and .dll files) based on my GPU from this repo: [ROCmLibs for gfx1103 (AMD 780M APU)](https://github.com/likelovewant/ROCmLibs-for-gfx1103-AMD780M-APU/releases/tag/v0.6.1.2). 4. Restart Ollama, and everything works—especially with my GPU (RX 6700 XT). Just to clarify for the devs, my error was related to the ROCm library and mentioned issues with GGML files. Maybe because my gpu isn't officially supported. Hope this helps! 🚀

GiteaMirror referenced this issue

2026-04-22 04:46:46 -05:00

[GH-ISSUE #2774] What is the different between /api/generate and /api/chat? #27432

GiteaMirror referenced this issue

2026-04-28 07:04:35 -05:00

[GH-ISSUE #2774] What is the different between /api/generate and /api/chat? #48184

GiteaMirror referenced this issue

2026-05-03 14:45:29 -05:00

[GH-ISSUE #2774] What is the different between /api/generate and /api/chat? #63710

GiteaMirror referenced this issue

2026-05-09 04:55:47 -05:00

[GH-ISSUE #2774] What is the different between /api/generate and /api/chat? #79350

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#2774