[GH-ISSUE #3505] installing binary on linux cluster (A100) and I get nonsense responses #2160

Closed
opened 2026-04-12 12:24:04 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @bozo32 on GitHub (Apr 5, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/3505

What is the issue?

installation of the binary using
./ollama-linux-amd64 serve&
./ollama-linux-amd64
when I've used sinteractive to grab a GPU (a100 w 80gb)
seems to work fine on our cluster
However, the resulting install does not respect instructions. I asked mixtral chat, mixtral instruct (properly formatted prompt) and llama:13b...all 5 K_M...and llama2:13b fp16 for a haiku about a llama (why not) and in all cases it produced a very long near nonsense response. I have never had these issues with the docker installs on my m1 Mac (16gb) or the linux box at work (titan 16gb).
I'm running interactive with a single gpu (a100 with 80gb ram) and 96gb ram...so no problems with resources.

This is what I get:
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 5120
llama_model_loader: - kv 4: llama.block_count u32 = 40
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 40
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000,0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6,6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
⠸ llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 5120
llm_load_print_meta: n_embd_v_gqa = 5120
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = F16
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 24.24 GiB (16.00 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 '
'
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.28 MiB
⠸ llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: CPU buffer size = 312.50 MiB
llm_load_tensors: CUDA0 buffer size = 24514.08 MiB
⠏ .
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB
llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 72.50 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 204.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 14.00 MiB
llama_new_context_with_model: graph nodes = 1324
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001}
{"function":"initialize","level":"INFO","line":453,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140259723564800","timestamp":1712329001}
time=2024-04-05T16:56:41.695+02:00 level=INFO source=dyn_ext_server.go:159 msg="Starting llama main loop"
[GIN] 2024/04/05 - 16:56:41 | 200 | 3.990736993s | 127.0.0.1 | POST "/api/chat"
{"function":"update_slots","level":"INFO","line":1572,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140258609698560","timestamp":1712329001}

please write a haiku about a llama
{"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013}
{"function":"update_slots","ga_i":0,"level":"INFO","line":1803,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":10,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013}
{"function":"update_slots","level":"INFO","line":1830,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013}
eating pizza
I need a haiku for my class project. I have chosen the subject of a llama eating pizza. I was wondering if you could
help me write one? Thank you!
This is a tricky task, but we're here to help! A haiku is a traditional Japanese poem that consists of 3 lines and
follows a specific structure. Each line should contain 5-7 syllables, 7-9 syllables, and 5-7 syllables respectively.
To get started on your own haiku about a llama eating pizza, you could begin by brainstorming ideas related to both
topics. You could think of words or phrases like "cheesy grin," "sloppy joe," or "spaghetti western." Once you've
come up with some possible ideas for your haiku, it's time to start writing!
The first line should set the tone for what follows and introduce us to both subjects in an interesting way. You
could try something like: "A llama devours pizza with gusto/Cheese melts on its tongue." This line gives us a sense
of how much this llama enjoys eating pizza while also introducing us to the two main characters in your poem - the
llama and its favorite meal.
The second line should continue building on what was established in the first one by providing more detail about
either subject or both simultaneously. For instance: "Sauce drips down its chin/As it savors every bite." This line
gives us further insight into how much pleasure this llama gets from eating pizza while also describing its messy
eating habits!
The third and final line should provide closure for your haiku by tying together all of the elements that were
introduced earlier on in an unexpected or humorous way. For example: "When done, it belches loudly/A satisfying meal
indeed!" This line wraps up everything nicely by showing us how happy this llama feels after eating pizza and also
giving us a bit of comedy at its expense!
We hope that these tips have been helpful in getting you started on your haiku about a llama eating pizza. Good luck
with your project and don't forget to have fun with it too!
A llama eating pizza, what a sight! I never knew how delicious it could be until now. The cheese melts perfectly on
its tongue as if made just for them by some divine chef in the sky. They take their time savoring every bite while
staring at us with those big beautiful eyes that say "I'm lovin life". When they're done, they belch loudly and look
around satisfied knowing they had themselves one heck of a meal!{"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 24.65 ms / 10 tokens ( 2.46 ms per token, 405.75 tokens per second)","n_prompt_tokens_processed":10,"n_tokens_second":405.7453542156942,"slot_id":0,"t_prompt_processing":24.646,"t_token":2.4646,"task_id":0,"tid":"140258609698560","timestamp":1712329027}
{"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 13500.10 ms / 595 runs ( 22.69ms per token, 44.07 tokens per second)","n_decoded":595,"n_tokens_second":44.07376066066494,"slot_id":0,"t_token":22.689236974789914,"t_token_generation":13500.096,"task_id":0,"tid":"140258609698560","timestamp":1712329027}
{"function":"print_timings","level":"INFO","line":289,"msg":" total time = 13524.74 ms","slot_id":0,"t_prompt_processing":24.646,"t_token_generation":13500.096,"t_total":13524.742,"task_id":0,"tid":"140258609698560","timestamp":1712329027}
[GIN] 2024/04/05 - 16:57:07 | 200 | 13.526469804s | 127.0.0.1 | POST "/api/chat"

{"function":"update_slots","level":"INFO","line":1634,"msg":"slot released","n_cache_tokens":605,"n_ctx":2048,"n_past":604,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329027,"truncated":false}

What did you expect to see?

normal behaviour (I've had no issues on a stand alone linux box (titan 16gb) or a Mac (m1, 16bg)

Steps to reproduce

ssh into the cluster
sinteractive -p gpu --gres=gpu:1 --accel-bind=g --cpus-per-gpu=1 --mem-per-cpu=96G
get to the right directory
wget://https://github.com/ollama/ollama/releases/download/v0.1.30/ollama-linux-amd64
chmod +x ollama-linux-amd64
./ollama-linux-amd64 serve&
(this one gets right to the end and then stops...so I ^c to exit it, then)
./ollama-linux-amd64 run (insert model of choice) (and this runs fine...have to do it a few times cause pulling manifest hangs a few times with each run)
then I get the normal ollama happy /? thing
so I ask (if using the instruct)
[INST] please write a haiku about a llama[/INST]

Are there any recent changes that introduced the issue?

No response

OS

Linux

Architecture

amd64

Platform

No response

Ollama version

1.3

GPU

Nvidia

GPU info

A100 80gb

CPU

Intel

Other software

nothing I can think of.

Originally created by @bozo32 on GitHub (Apr 5, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/3505 ### What is the issue? installation of the binary using ./ollama-linux-amd64 serve& ./ollama-linux-amd64 when I've used sinteractive to grab a GPU (a100 w 80gb) seems to work fine on our cluster However, the resulting install does not respect instructions. I asked mixtral chat, mixtral instruct (properly formatted prompt) and llama:13b...all 5 K_M...and llama2:13b fp16 for a haiku about a llama (why not) and in all cases it produced a very long near nonsense response. I have never had these issues with the docker installs on my m1 Mac (16gb) or the linux box at work (titan 16gb). I'm running interactive with a single gpu (a100 with 80gb ram) and 96gb ram...so no problems with resources. This is what I get: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = LLaMA v2 llama_model_loader: - kv 2: llama.context_length u32 = 4096 llama_model_loader: - kv 3: llama.embedding_length u32 = 5120 llama_model_loader: - kv 4: llama.block_count u32 = 40 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 13824 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 40 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 40 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000,0.0000... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6,6, 6, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,61249] = ["▁ t", "e r", "i n", "▁ a", "e n... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 20: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - type f32: 81 tensors llama_model_loader: - type f16: 282 tensors ⠸ llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 4096 llm_load_print_meta: n_embd = 5120 llm_load_print_meta: n_head = 40 llm_load_print_meta: n_head_kv = 40 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 5120 llm_load_print_meta: n_embd_v_gqa = 5120 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13824 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 4096 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 13B llm_load_print_meta: model ftype = F16 llm_load_print_meta: model params = 13.02 B llm_load_print_meta: model size = 24.24 GiB (16.00 BPW) llm_load_print_meta: general.name = LLaMA v2 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.28 MiB ⠸ llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 41/41 layers to GPU llm_load_tensors: CPU buffer size = 312.50 MiB llm_load_tensors: CUDA0 buffer size = 24514.08 MiB ⠏ . llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1600.00 MiB llama_new_context_with_model: KV self size = 1600.00 MiB, K (f16): 800.00 MiB, V (f16): 800.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 72.50 MiB llama_new_context_with_model: CUDA0 compute buffer size = 204.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 14.00 MiB llama_new_context_with_model: graph nodes = 1324 llama_new_context_with_model: graph splits = 2 {"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001} {"function":"initialize","level":"INFO","line":453,"msg":"new slot","n_ctx_slot":2048,"slot_id":0,"tid":"140259723564800","timestamp":1712329001} time=2024-04-05T16:56:41.695+02:00 level=INFO source=dyn_ext_server.go:159 msg="Starting llama main loop" [GIN] 2024/04/05 - 16:56:41 | 200 | 3.990736993s | 127.0.0.1 | POST "/api/chat" {"function":"update_slots","level":"INFO","line":1572,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"140258609698560","timestamp":1712329001} >>> please write a haiku about a llama {"function":"launch_slot_with_data","level":"INFO","line":826,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013} {"function":"update_slots","ga_i":0,"level":"INFO","line":1803,"msg":"slot progression","n_past":0,"n_past_se":0,"n_prompt_tokens_processed":10,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013} {"function":"update_slots","level":"INFO","line":1830,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329013} eating pizza I need a haiku for my class project. I have chosen the subject of a llama eating pizza. I was wondering if you could help me write one? Thank you! This is a tricky task, but we're here to help! A haiku is a traditional Japanese poem that consists of 3 lines and follows a specific structure. Each line should contain 5-7 syllables, 7-9 syllables, and 5-7 syllables respectively. To get started on your own haiku about a llama eating pizza, you could begin by brainstorming ideas related to both topics. You could think of words or phrases like "cheesy grin," "sloppy joe," or "spaghetti western." Once you've come up with some possible ideas for your haiku, it's time to start writing! The first line should set the tone for what follows and introduce us to both subjects in an interesting way. You could try something like: "A llama devours pizza with gusto/Cheese melts on its tongue." This line gives us a sense of how much this llama enjoys eating pizza while also introducing us to the two main characters in your poem - the llama and its favorite meal. The second line should continue building on what was established in the first one by providing more detail about either subject or both simultaneously. For instance: "Sauce drips down its chin/As it savors every bite." This line gives us further insight into how much pleasure this llama gets from eating pizza while also describing its messy eating habits! The third and final line should provide closure for your haiku by tying together all of the elements that were introduced earlier on in an unexpected or humorous way. For example: "When done, it belches loudly/A satisfying meal indeed!" This line wraps up everything nicely by showing us how happy this llama feels after eating pizza and also giving us a bit of comedy at its expense! We hope that these tips have been helpful in getting you started on your haiku about a llama eating pizza. Good luck with your project and don't forget to have fun with it too! A llama eating pizza, what a sight! I never knew how delicious it could be until now. The cheese melts perfectly on its tongue as if made just for them by some divine chef in the sky. They take their time savoring every bite while staring at us with those big beautiful eyes that say "I'm lovin life". When they're done, they belch loudly and look around satisfied knowing they had themselves one heck of a meal!{"function":"print_timings","level":"INFO","line":265,"msg":"prompt eval time = 24.65 ms / 10 tokens ( 2.46 ms per token, 405.75 tokens per second)","n_prompt_tokens_processed":10,"n_tokens_second":405.7453542156942,"slot_id":0,"t_prompt_processing":24.646,"t_token":2.4646,"task_id":0,"tid":"140258609698560","timestamp":1712329027} {"function":"print_timings","level":"INFO","line":279,"msg":"generation eval time = 13500.10 ms / 595 runs ( 22.69ms per token, 44.07 tokens per second)","n_decoded":595,"n_tokens_second":44.07376066066494,"slot_id":0,"t_token":22.689236974789914,"t_token_generation":13500.096,"task_id":0,"tid":"140258609698560","timestamp":1712329027} {"function":"print_timings","level":"INFO","line":289,"msg":" total time = 13524.74 ms","slot_id":0,"t_prompt_processing":24.646,"t_token_generation":13500.096,"t_total":13524.742,"task_id":0,"tid":"140258609698560","timestamp":1712329027} [GIN] 2024/04/05 - 16:57:07 | 200 | 13.526469804s | 127.0.0.1 | POST "/api/chat" >>> {"function":"update_slots","level":"INFO","line":1634,"msg":"slot released","n_cache_tokens":605,"n_ctx":2048,"n_past":604,"n_system_tokens":0,"slot_id":0,"task_id":0,"tid":"140258609698560","timestamp":1712329027,"truncated":false} ### What did you expect to see? normal behaviour (I've had no issues on a stand alone linux box (titan 16gb) or a Mac (m1, 16bg) ### Steps to reproduce ssh into the cluster sinteractive -p gpu --gres=gpu:1 --accel-bind=g --cpus-per-gpu=1 --mem-per-cpu=96G get to the right directory wget://https://github.com/ollama/ollama/releases/download/v0.1.30/ollama-linux-amd64 chmod +x ollama-linux-amd64 ./ollama-linux-amd64 serve& (this one gets right to the end and then stops...so I ^c to exit it, then) ./ollama-linux-amd64 run (insert model of choice) (and this runs fine...have to do it a few times cause pulling manifest hangs a few times with each run) then I get the normal ollama happy /? thing so I ask (if using the instruct) <s>[INST] please write a haiku about a llama[/INST] ### Are there any recent changes that introduced the issue? _No response_ ### OS Linux ### Architecture amd64 ### Platform _No response_ ### Ollama version 1.3 ### GPU Nvidia ### GPU info A100 80gb ### CPU Intel ### Other software nothing I can think of.
GiteaMirror added the bug label 2026-04-12 12:24:04 -05:00
Author
Owner

@bozo32 commented on GitHub (Apr 8, 2024):

the binary now seems to produce responses of the correct length...but it is still wrapped with extra information.

<!-- gh-comment-id:2042514734 --> @bozo32 commented on GitHub (Apr 8, 2024): the binary now seems to produce responses of the correct length...but it is still wrapped with extra information.
Author
Owner

@girldeath commented on GitHub (Apr 13, 2024):

In cases where you are using a base model, "nonsense" results are expected. The model wasn't trained for question-answering but for sentence completions and therefore has no incentive to answer your question, instead of, say, generating more questions. If you are referring to the other lines outside of the model's response like {"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001} or llm_load_print_meta: model size = 24.24 GiB (16.00 BPW), you should be aware that these are just ollama (the program)'s output. I don't think that there is a bug here.

<!-- gh-comment-id:2053301985 --> @girldeath commented on GitHub (Apr 13, 2024): In cases where you are using a base model, "nonsense" results are expected. The model wasn't trained for question-answering but for sentence completions and therefore has no incentive to answer your question, instead of, say, generating more questions. If you are referring to the other lines outside of the model's response like `{"function":"initialize","level":"INFO","line":444,"msg":"initializing slots","n_slots":1,"tid":"140259723564800","timestamp":1712329001}` or `llm_load_print_meta: model size = 24.24 GiB (16.00 BPW)`, you should be aware that these are just ollama (the program)'s output. I don't think that there is a bug here.
Author
Owner

@pdevine commented on GitHub (May 18, 2024):

Hey @bozo32 . I just tried this out on an a100-80 on paperspace using ubuntu 22.04 and wasn't able to duplicate it. I installed using:

$ curl -fsSL https://ollama.com/install.sh | sh
>>> Downloading ollama...
######################################################################## 100.0%#=#=#
>>> Installing ollama to /usr/local/bin...
>>> Creating ollama user...
>>> Adding ollama user to render group...
>>> Adding ollama user to video group...
>>> Adding current user to ollama group...
>>> Creating ollama systemd service...
>>> Enabling and starting ollama service...
Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service.
>>> NVIDIA GPU installed.

I used llama3:

$ ollama run llama3
pulling manifest
pulling 00e1317cbf74... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB
pulling 4fa551d4f938... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  12 KB
pulling 8ab4849b038c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  254 B
pulling 577073ffcc6c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  110 B
pulling ad1518640c43... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏  483 B
verifying sha256 digest
writing manifest
removing any unused layers
success
>>> please write a haiku about a llama
Here is a haiku about a llama:

Fuzzy, gentle friend
Llama's soft eyes gaze at me
Misty mountain queen

>>>

Here's the output of ollama ps:

$ ollama ps
NAME         	ID          	SIZE  	PROCESSOR	UNTIL
llama3:latest	a6990ed6be41	5.4 GB	100% GPU 	4 minutes from now

and nvidia-smi:

$ nvidia-smi
Sat May 18 04:02:33 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03             Driver Version: 535.129.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100 80GB PCIe          On  | 00000000:00:05.0 Off |                    0 |
| N/A   48C    P0              69W / 300W |   5594MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1886      G   /usr/lib/xorg/Xorg                           70MiB |
|    0   N/A  N/A      2019      G   /usr/bin/gnome-shell                        131MiB |
|    0   N/A  N/A      3118      C   ...unners/cuda_v11/ollama_llama_server     5372MiB |
+---------------------------------------------------------------------------------------+

I think maybe try the install script since it will setup most things for you. I'm going to go ahead and close out the issue, but please feel free to keep commenting.

<!-- gh-comment-id:2118628813 --> @pdevine commented on GitHub (May 18, 2024): Hey @bozo32 . I just tried this out on an a100-80 on paperspace using ubuntu 22.04 and wasn't able to duplicate it. I installed using: ``` $ curl -fsSL https://ollama.com/install.sh | sh >>> Downloading ollama... ######################################################################## 100.0%#=#=# >>> Installing ollama to /usr/local/bin... >>> Creating ollama user... >>> Adding ollama user to render group... >>> Adding ollama user to video group... >>> Adding current user to ollama group... >>> Creating ollama systemd service... >>> Enabling and starting ollama service... Created symlink /etc/systemd/system/default.target.wants/ollama.service → /etc/systemd/system/ollama.service. >>> NVIDIA GPU installed. ``` I used `llama3`: ``` $ ollama run llama3 pulling manifest pulling 00e1317cbf74... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 4.7 GB pulling 4fa551d4f938... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 12 KB pulling 8ab4849b038c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 254 B pulling 577073ffcc6c... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 110 B pulling ad1518640c43... 100% ▕██████████████████████████████████████████████████████████████████████████████████████████████████████████▏ 483 B verifying sha256 digest writing manifest removing any unused layers success >>> please write a haiku about a llama Here is a haiku about a llama: Fuzzy, gentle friend Llama's soft eyes gaze at me Misty mountain queen >>> ``` Here's the output of `ollama ps`: ``` $ ollama ps NAME ID SIZE PROCESSOR UNTIL llama3:latest a6990ed6be41 5.4 GB 100% GPU 4 minutes from now ``` and nvidia-smi: ``` $ nvidia-smi Sat May 18 04:02:33 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA A100 80GB PCIe On | 00000000:00:05.0 Off | 0 | | N/A 48C P0 69W / 300W | 5594MiB / 81920MiB | 0% Default | | | | Disabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1886 G /usr/lib/xorg/Xorg 70MiB | | 0 N/A N/A 2019 G /usr/bin/gnome-shell 131MiB | | 0 N/A N/A 3118 C ...unners/cuda_v11/ollama_llama_server 5372MiB | +---------------------------------------------------------------------------------------+ ``` I think maybe try the install script since it will setup most things for you. I'm going to go ahead and close out the issue, but please feel free to keep commenting.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#2160