[GH-ISSUE #3505] installing binary on linux cluster (A100) and I get nonsense responses #27919

New Issue

GiteaMirror · 2026-04-22T05:34:32-05:00

GiteaMirror commented

2026-04-22 05:34:32 -05:00

Originally created by @bozo32 on GitHub (Apr 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3505

What is the issue?

installation of the binary using
./ollama-linux-amd64 serve&
./ollama-linux-amd64
when I've used sinteractive to grab a GPU (a100 w 80gb)
seems to work fine on our cluster
However, the resulting install does not respect instructions. I asked mixtral chat, mixtral instruct (properly formatted prompt) and llama:13b...all 5 K_M...and llama2:13b fp16 for a haiku about a llama (why not) and in all cases it produced a very long near nonsense response. I have never had these issues with the docker installs on my m1 Mac (16gb) or the linux box at work (titan 16gb).
I'm running interactive with a single gpu (a100 with 80gb ram) and 96gb ram...so no problems with resources.

This is what I get:
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "~~", "~~", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000,0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6,6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
⠸ llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 24.24 GiB (16.00 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
⠸ llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 312.50 MiB
llm_load_tensors: CUDA0 buffer size = 24514.08 MiB
⠏ .
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB
llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 72.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 204.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 14.00 MiB
llama_new_context_with_model: graph nodes = 1324
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001}
{"function":"initialize","level":"INFO","line":453,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140259723564800","timestamp":1712329001}
time=2024-04-05T16:56:41.695+02:00 level=INFO source=dyn_ext_server.go:159 msg="Starting llama main loop"
[GIN] 2024/04/05 - 16:56:41 | 200 | 3.990736993s | 127.0.0.1 | POST "/api/chat"
{"function":"update_slots","level":"INFO","line":1572,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140258609698560","timestamp":1712329001}

please write a haiku about a llama
{"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1803,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":10,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013}
{"function":"update_slots","level":"INFO","line":1830,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013}
eating pizza
I need a haiku for my class project. I have chosen the subject of a llama eating pizza. I was wondering if you could
help me write one? Thank you!
This is a tricky task, but we're here to help! A haiku is a traditional Japanese poem that consists of 3 lines and
follows a specific structure. Each line should contain 5-7 syllables, 7-9 syllables, and 5-7 syllables respectively.
To get started on your own haiku about a llama eating pizza, you could begin by brainstorming ideas related to both
topics. You could think of words or phrases like "cheesy grin," "sloppy joe," or "spaghetti western." Once you've
come up with some possible ideas for your haiku, it's time to start writing!
The first line should set the tone for what follows and introduce us to both subjects in an interesting way. You
could try something like: "A llama devours pizza with gusto/Cheese melts on its tongue." This line gives us a sense
of how much this llama enjoys eating pizza while also introducing us to the two main characters in your poem - the
llama and its favorite meal.
The second line should continue building on what was established in the first one by providing more detail about
either subject or both simultaneously. For instance: "Sauce drips down its chin/As it savors every bite." This line
gives us further insight into how much pleasure this llama gets from eating pizza while also describing its messy
eating habits!
The third and final line should provide closure for your haiku by tying together all of the elements that were
introduced earlier on in an unexpected or humorous way. For example: "When done, it belches loudly/A satisfying meal
indeed!" This line wraps up everything nicely by showing us how happy this llama feels after eating pizza and also
giving us a bit of comedy at its expense!
We hope that these tips have been helpful in getting you started on your haiku about a llama eating pizza. Good luck
with your project and don't forget to have fun with it too!
A llama eating pizza, what a sight! I never knew how delicious it could be until now. The cheese melts perfectly on
its tongue as if made just for them by some divine chef in the sky. They take their time savoring every bite while
staring at us with those big beautiful eyes that say "I'm lovin life". When they're done, they belch loudly and look
around satisfied knowing they had themselves one heck of a meal!{"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 24.65 ms / 10 tokens ( 2.46 ms per token, 405.75 tokens per second)","n_prompt_tokens_processed":10,"n_tokens_second":405.7453542156942,"slot_id":0,"t_prompt_processing":24.646,"t_token":2.4646,"task_id":0,"tid":"140258609698560","timestamp":1712329027}
{"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 13500.10 ms / 595 runs ( 22.69ms per token, 44.07 tokens per second)","n_decoded":595,"n_tokens_second":44.07376066066494,"slot_id":0,"t_token":22.689236974789914,"t_token_generation":13500.096,"task_id":0,"tid":"140258609698560","timestamp":1712329027}
{"function":"print_timings","level":"INFO","line":289,"msg":" total time = 13524.74 ms","slot_id":0,"t_prompt_processing":24.646,"t_token_generation":13500.096,"t_total":13524.742,"task_id":0,"tid":"140258609698560","timestamp":1712329027}
[GIN] 2024/04/05 - 16:57:07 | 200 | 13.526469804s | 127.0.0.1 | POST "/api/chat"

{"function":"update_slots","level":"INFO","line":1634,"msg":"slot released","n_cache_tokens":605,"n_ctx":2048,"n_past":604,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329027,"truncated":false}

What did you expect to see?

normal behaviour (I've had no issues on a stand alone linux box (titan 16gb) or a Mac (m1, 16bg)

Steps to reproduce

ssh into the cluster
sinteractive -p gpu --gres=gpu:1 --accel-bind=g --cpus-per-gpu=1 --mem-per-cpu=96G
get to the right directory
wget://https://github.com/ollama/ollama/releases/download/v0.1.30/ollama-linux-amd64
chmod +x ollama-linux-amd64
./ollama-linux-amd64 serve&
(this one gets right to the end and then stops...so I ^c to exit it, then)
./ollama-linux-amd64 run (insert model of choice) (and this runs fine...have to do it a few times cause pulling manifest hangs a few times with each run)
then I get the normal ollama happy /? thing
so I ask (if using the instruct)
~~[INST] please write a haiku about a llama[/INST]~~

Are there any recent changes that introduced the issue?

No response

OS

Linux

Architecture

amd64

Platform

No response

Ollama version

1.3

GPU

Nvidia

GPU info

A100 80gb

CPU

Intel

Other software

nothing I can think of.

Originally created by @bozo32 on GitHub (Apr 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3505 ### What is the issue? installation of the binary using ./ollama-linux-amd64 serve& ./ollama-linux-amd64 when I've used sinteractive to grab a GPU (a100 w 80gb) seems to work fine on our cluster However, the resulting install does not respect instructions. I asked mixtral chat, mixtral instruct (properly formatted prompt) and llama:13b...all 5 K_M...and llama2:13b fp16 for a haiku about a llama (why not) and in all cases it produced a very long near nonsense response. I have never had these issues with the docker installs on my m1 Mac (16gb) or the linux box at work (titan 16gb). I'm running interactive with a single gpu (a100 with 80gb ram) and 96gb ram...so no problems with resources. This is what I get: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 llama_model_loader: - kv 4: llama.block_count u32 = 40 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000,0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6,6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - type f32: 81 tensors llama_model_loader: - type f16: 282 tensors ⠸ llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 5120 llm_load_print_meta: n_embd_v_gqa = 5120 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 24.24 GiB (16.00 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.28 MiB ⠸ llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 41/41 layers to GPU llm_load_tensors: CPU buffer size = 312.50 MiB llm_load_tensors: CUDA0 buffer size = 24514.08 MiB ⠏ . llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 72.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 204.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 14.00 MiB llama_new_context_with_model: graph nodes = 1324 llama_new_context_with_model: graph splits = 2 {"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001} {"function":"initialize","level":"INFO","line":453,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140259723564800","timestamp":1712329001} time=2024-04-05T16:56:41.695+02:00 level=INFO source=dyn_ext_server.go:159 msg="Starting llama main loop" [GIN] 2024/04/05 - 16:56:41 | 200 | 3.990736993s | 127.0.0.1 | POST "/api/chat" {"function":"update_slots","level":"INFO","line":1572,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140258609698560","timestamp":1712329001} >>> please write a haiku about a llama {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013} {"function":"update_slots","ga_i":0,"level":"INFO","line":1803,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":10,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013} {"function":"update_slots","level":"INFO","line":1830,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013} eating pizza I need a haiku for my class project. I have chosen the subject of a llama eating pizza. I was wondering if you could help me write one? Thank you! This is a tricky task, but we're here to help! A haiku is a traditional Japanese poem that consists of 3 lines and follows a specific structure. Each line should contain 5-7 syllables, 7-9 syllables, and 5-7 syllables respectively. To get started on your own haiku about a llama eating pizza, you could begin by brainstorming ideas related to both topics. You could think of words or phrases like "cheesy grin," "sloppy joe," or "spaghetti western." Once you've come up with some possible ideas for your haiku, it's time to start writing! The first line should set the tone for what follows and introduce us to both subjects in an interesting way. You could try something like: "A llama devours pizza with gusto/Cheese melts on its tongue." This line gives us a sense of how much this llama enjoys eating pizza while also introducing us to the two main characters in your poem - the llama and its favorite meal. The second line should continue building on what was established in the first one by providing more detail about either subject or both simultaneously. For instance: "Sauce drips down its chin/As it savors every bite." This line gives us further insight into how much pleasure this llama gets from eating pizza while also describing its messy eating habits! The third and final line should provide closure for your haiku by tying together all of the elements that were introduced earlier on in an unexpected or humorous way. For example: "When done, it belches loudly/A satisfying meal indeed!" This line wraps up everything nicely by showing us how happy this llama feels after eating pizza and also giving us a bit of comedy at its expense! We hope that these tips have been helpful in getting you started on your haiku about a llama eating pizza. Good luck with your project and don't forget to have fun with it too! A llama eating pizza, what a sight! I never knew how delicious it could be until now. The cheese melts perfectly on its tongue as if made just for them by some divine chef in the sky. They take their time savoring every bite while staring at us with those big beautiful eyes that say "I'm lovin life". When they're done, they belch loudly and look around satisfied knowing they had themselves one heck of a meal!{"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 24.65 ms / 10 tokens ( 2.46 ms per token, 405.75 tokens per second)","n_prompt_tokens_processed":10,"n_tokens_second":405.7453542156942,"slot_id":0,"t_prompt_processing":24.646,"t_token":2.4646,"task_id":0,"tid":"140258609698560","timestamp":1712329027} {"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 13500.10 ms / 595 runs ( 22.69ms per token, 44.07 tokens per second)","n_decoded":595,"n_tokens_second":44.07376066066494,"slot_id":0,"t_token":22.689236974789914,"t_token_generation":13500.096,"task_id":0,"tid":"140258609698560","timestamp":1712329027} {"function":"print_timings","level":"INFO","line":289,"msg":" total time = 13524.74 ms","slot_id":0,"t_prompt_processing":24.646,"t_token_generation":13500.096,"t_total":13524.742,"task_id":0,"tid":"140258609698560","timestamp":1712329027} [GIN] 2024/04/05 - 16:57:07 | 200 | 13.526469804s | 127.0.0.1 | POST "/api/chat" >>> {"function":"update_slots","level":"INFO","line":1634,"msg":"slot released","n_cache_tokens":605,"n_ctx":2048,"n_past":604,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329027,"truncated":false} ### What did you expect to see? normal behaviour (I've had no issues on a stand alone linux box (titan 16gb) or a Mac (m1, 16bg) ### Steps to reproduce ssh into the cluster sinteractive -p gpu --gres=gpu:1 --accel-bind=g --cpus-per-gpu=1 --mem-per-cpu=96G get to the right directory wget://https://github.com/ollama/ollama/releases/download/v0.1.30/ollama-linux-amd64 chmod +x ollama-linux-amd64 ./ollama-linux-amd64 serve& (this one gets right to the end and then stops...so I ^c to exit it, then) ./ollama-linux-amd64 run (insert model of choice) (and this runs fine...have to do it a few times cause pulling manifest hangs a few times with each run) then I get the normal ollama happy /? thing so I ask (if using the instruct) <s>[INST] please write a haiku about a llama[/INST] ### Are there any recent changes that introduced the issue? _No response_ ### OS Linux ### Architecture amd64 ### Platform _No response_ ### Ollama version 1.3 ### GPU Nvidia ### GPU info A100 80gb ### CPU Intel ### Other software nothing I can think of.

GiteaMirror added the bug label 2026-04-22 05:34:32 -05:00

GiteaMirror closed this issue

2026-04-22 05:34:32 -05:00

GiteaMirror commented

2026-04-22 05:34:33 -05:00

@bozo32 commented on GitHub (Apr 8, 2024):

the binary now seems to produce responses of the correct length...but it is still wrapped with extra information.

@bozo32 commented on GitHub (Apr 8, 2024): the binary now seems to produce responses of the correct length...but it is still wrapped with extra information.

GiteaMirror commented

2026-04-22 05:34:33 -05:00

@girldeath commented on GitHub (Apr 13, 2024):

In cases where you are using a base model, "nonsense" results are expected. The model wasn't trained for question-answering but for sentence completions and therefore has no incentive to answer your question, instead of, say, generating more questions. If you are referring to the other lines outside of the model's response like {"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001} or llm_load_print_meta: model size = 24.24 GiB (16.00 BPW), you should be aware that these are just ollama (the program)'s output. I don't think that there is a bug here.

@girldeath commented on GitHub (Apr 13, 2024): In cases where you are using a base model, "nonsense" results are expected. The model wasn't trained for question-answering but for sentence completions and therefore has no incentive to answer your question, instead of, say, generating more questions. If you are referring to the other lines outside of the model's response like `{"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001}` or `llm_load_print_meta: model size = 24.24 GiB (16.00 BPW)`, you should be aware that these are just ollama (the program)'s output. I don't think that there is a bug here.

GiteaMirror commented

2026-04-22 05:34:33 -05:00

@pdevine commented on GitHub (May 18, 2024):

Hey @bozo32 . I just tried this out on an a100-80 on paperspace using ubuntu 22.04 and wasn't able to duplicate it. I installed using:

$ curl -fsSL https://ollama.com/install.sh | sh
>>> Downloading ollama...
######################################################################## 100.0%#=#=#
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

I used llama3:

$ ollama run llama3
pulling manifest
pulling 00e1317cbf74... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB
pulling 4fa551d4f938... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  12 KB
pulling 8ab4849b038c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  254 B
pulling 577073ffcc6c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  110 B
pulling ad1518640c43... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  483 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>> please write a haiku about a llama
Here is a haiku about a llama:

Fuzzy, gentle friend
Llama's soft eyes gaze at me
Misty mountain queen

>>>

Here's the output of ollama ps:

$ ollama ps
NAME         	ID          	SIZE  	PROCESSOR	UNTIL
llama3:latest	a6990ed6be41	5.4 GB	100% GPU 	4 minutes from now

and nvidia-smi:

$ nvidia-smi
Sat May 18 04:02:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:00:05.0 Off |                    0 |
| N/A   48C    P0              69W / 300W |   5594MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1886      G   /usr/lib/xorg/Xorg                           70MiB |
|    0   N/A  N/A      2019      G   /usr/bin/gnome-shell                        131MiB |
|    0   N/A  N/A      3118      C   ...unners/cuda_v11/ollama_llama_server     5372MiB |
+---------------------------------------------------------------------------------------+

I think maybe try the install script since it will setup most things for you. I'm going to go ahead and close out the issue, but please feel free to keep commenting.

@pdevine commented on GitHub (May 18, 2024): Hey @bozo32 . I just tried this out on an a100-80 on paperspace using ubuntu 22.04 and wasn't able to duplicate it. I installed using: ``` $ curl -fsSL https://ollama.com/install.sh | sh >>> Downloading ollama... ######################################################################## 100.0%#=#=# >>> Installing ollama to /usr/local/bin... >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service. >>> NVIDIA GPU installed. ``` I used `llama3`: ``` $ ollama run llama3 pulling manifest pulling 00e1317cbf74... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB pulling 4fa551d4f938... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB pulling 8ab4849b038c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 254 B pulling 577073ffcc6c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B pulling ad1518640c43... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 483 B verifying sha256 digest writing manifest removing any unused layers success >>> please write a haiku about a llama Here is a haiku about a llama: Fuzzy, gentle friend Llama's soft eyes gaze at me Misty mountain queen >>> ``` Here's the output of `ollama ps`: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3:latest a6990ed6be41 5.4 GB 100% GPU 4 minutes from now ``` and nvidia-smi: ``` $ nvidia-smi Sat May 18 04:02:33 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100 80GB PCIe On | 00000000:00:05.0 Off | 0 | | N/A 48C P0 69W / 300W | 5594MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1886 G /usr/lib/xorg/Xorg 70MiB | | 0 N/A N/A 2019 G /usr/bin/gnome-shell 131MiB | | 0 N/A N/A 3118 C ...unners/cuda_v11/ollama_llama_server 5372MiB | +---------------------------------------------------------------------------------------+ ``` I think maybe try the install script since it will setup most things for you. I'm going to go ahead and close out the issue, but please feel free to keep commenting.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#27919