[GH-ISSUE #7766] ollama hangs randomly and sometimes responds with G's #30721

Closed
opened 2026-04-22 10:37:12 -05:00 by GiteaMirror · 11 comments
Owner

Originally created by @Pho3niX90 on GitHub (Nov 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7766

What is the issue?

I am starting my journey into ollama, so my info below might not align 100% to what you need, but can provide as needed.

After the prompts "hang", I need to reboot the service to get it going again.

Short generation relatively seems OK,
Asking for longer responses typically hang it mid sentence
Asking again has it replying in "GGGGGGGGGGGGGGGG"

On the below graph, the reponse stood still on the 6th prompt
image

here is an example where it gave me the "G's" straight off the bat, restarted the service, and all was well
image

Until I asked it to write an extra long essay, multiple times
image

Notice the last G output, before it hangs

More Info
Model: llama3.2
Params --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 4 --port 32807

  1. Vram seems to never exceed 3700mb
  2. It seems only a cpu single thread is being utilized, always at 100%

Cpu only, gives no issues, at around 10tokens/s
Gpu around 90tokens/s

System Specs:
Cpu: Amd EPYC 32core
Gpu: 3060 12Gb (gen4 8x riser)
Mem: 256GB DDR4
OS: Ubuntu 22.04
Disk: nvme
Cuda 12.6

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.4.2

Originally created by @Pho3niX90 on GitHub (Nov 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7766 ### What is the issue? I am starting my journey into ollama, so my info below might not align 100% to what you need, but can provide as needed. After the prompts "hang", I need to reboot the service to get it going again. Short generation relatively seems OK, Asking for longer responses typically hang it mid sentence Asking again has it replying in "GGGGGGGGGGGGGGGG" On the below graph, the reponse stood still on the 6th prompt ![image](https://github.com/user-attachments/assets/7672a7bc-be00-4c3d-b2f3-340e7bcc2777) here is an example where it gave me the "G's" straight off the bat, restarted the service, and all was well ![image](https://github.com/user-attachments/assets/72819cde-0c0c-4b9f-be12-95a4f0c8b7f1) Until I asked it to write an extra long essay, multiple times ![image](https://github.com/user-attachments/assets/f07f3199-91ee-46cb-b3e4-07021a71d912) Notice the last G output, before it hangs **More Info** **Model**: llama3.2 **Params** --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 29 --parallel 4 --port 32807 1. Vram seems to never exceed 3700mb 2. It seems only a cpu single thread is being utilized, always at 100% Cpu only, gives no issues, at around 10tokens/s Gpu around 90tokens/s **System Specs:** Cpu: Amd EPYC 32core Gpu: 3060 12Gb (gen4 8x riser) Mem: 256GB DDR4 OS: Ubuntu 22.04 Disk: nvme Cuda 12.6 ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.4.2
GiteaMirror added the bug label 2026-04-22 10:37:12 -05:00
Author
Owner

@rick-github commented on GitHub (Nov 20, 2024):

I'm not sure what happened in the "straight off the bat" one, but repeated G's is a symptom of exceeding the context window. The context windows includes both input and output tokens, so if you've had a few rounds of conversations in which longer and longer output is requested as it looks in the screenshot, then that could be the cause of the problem you see here. There are various ways of modifying the context window, see here and here.

<!-- gh-comment-id:2489461443 --> @rick-github commented on GitHub (Nov 20, 2024): I'm not sure what happened in the "straight off the bat" one, but repeated G's is a symptom of exceeding the context window. The context windows includes both input and output tokens, so if you've had a few rounds of conversations in which longer and longer output is requested as it looks in the screenshot, then that could be the cause of the problem you see here. There are various ways of modifying the context window, see [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size) and [here](https://github.com/ollama/ollama/issues/5965#issuecomment-2252354726).
Author
Owner

@jessegross commented on GitHub (Nov 20, 2024):

@Pho3niX90 Can you double check the version? The parameters that you listed I don't think should be present in 0.4.2. It would also help to post server logs.

<!-- gh-comment-id:2489465946 --> @jessegross commented on GitHub (Nov 20, 2024): @Pho3niX90 Can you double check the version? The parameters that you listed I don't think should be present in 0.4.2. It would also help to post [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).
Author
Owner

@Pho3niX90 commented on GitHub (Nov 21, 2024):

So I was using open-webui, I have stopped it completely now to interact natively with ollama. Doing so, the data is as follows:

Params: --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 41329

image

journal logs

Nov 21 05:42:46 ollama systemd[1]: Started Ollama Service.
Nov 21 05:42:46 ollama ollama[23630]: 2024/11/21 05:42:46 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.934Z level=INFO source=images.go:755 msg="total blobs: 12"
Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.935Z level=INFO source=images.go:762 msg="total unused blobs removed: 0"
Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.935Z level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.2)"
Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.935Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2344306266/runners
Nov 21 05:42:47 ollama ollama[23630]: time=2024-11-21T05:42:47.028Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]"
Nov 21 05:42:47 ollama ollama[23630]: time=2024-11-21T05:42:47.028Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
Nov 21 05:42:47 ollama ollama[23630]: time=2024-11-21T05:42:47.209Z level=INFO source=types.go:123 msg="inference compute" id=GPU-b0fd65ee-05b6-962b-9e19-5471ea53cbf2 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="11.7 GiB" available="11.6 GiB"
Nov 21 05:42:51 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:51 | 200 |      39.851µs |       127.0.0.1 | HEAD     "/"
Nov 21 05:42:51 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:51 | 200 |   22.925724ms |       127.0.0.1 | POST     "/api/show"
Nov 21 05:42:51 ollama ollama[23630]: time=2024-11-21T05:42:51.850Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-b0fd65ee-05b6-962b-9e19-5471ea53cbf2 parallel=4 available=12403474432 required="3.7 GiB"
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.002Z level=INFO source=server.go:105 msg="system memory" total="62.8 GiB" free="61.7 GiB" free_swap="8.0 GiB"
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.003Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[11.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.004Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama2344306266/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 41329"
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.004Z level=INFO source=sched.go:449 msg="loaded runners" count=1
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.004Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding"
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.005Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error"
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.034Z level=INFO source=runner.go:883 msg="starting go runner"
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.034Z level=INFO source=runner.go:884 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=8
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.034Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:41329"
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   1:                               general.type str              = model
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   5:                         general.size_label str              = 3B
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   8:                          llama.block_count u32              = 28
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  18:                          general.file_type u32              = 15
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - type  f32:   58 tensors
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - type q4_K:  168 tensors
Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - type q6_K:   29 tensors
Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.257Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model"
Nov 21 05:42:52 ollama ollama[23630]: llm_load_vocab: special tokens cache size = 256
Nov 21 05:42:52 ollama ollama[23630]: llm_load_vocab: token to piece cache size = 0.7999 MB
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: arch             = llama
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: vocab type       = BPE
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_vocab          = 128256
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_merges         = 280147
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: vocab_only       = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_ctx_train      = 131072
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd           = 3072
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_layer          = 28
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_head           = 24
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_head_kv        = 8
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_rot            = 128
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_swa            = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_head_k    = 128
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_head_v    = 128
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_gqa            = 3
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_k_gqa     = 1024
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_v_gqa     = 1024
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_ff             = 8192
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_expert         = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_expert_used    = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: causal attn      = 1
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: pooling type     = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: rope type        = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: rope scaling     = linear
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: freq_base_train  = 500000.0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: freq_scale_train = 1
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_ctx_orig_yarn  = 131072
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: rope_finetuned   = unknown
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_d_conv       = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_d_inner      = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_d_state      = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_dt_rank      = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_dt_b_c_rms   = 0
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model type       = 3B
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model ftype      = Q4_K - Medium
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model params     = 3.21 B
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: LF token         = 128 'Ä'
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: max token length = 256
Nov 21 05:42:52 ollama ollama[23630]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Nov 21 05:42:52 ollama ollama[23630]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Nov 21 05:42:52 ollama ollama[23630]: ggml_cuda_init: found 1 CUDA devices:
Nov 21 05:42:52 ollama ollama[23630]:   Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes
Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: ggml ctx size =    0.24 MiB
Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: offloading 28 repeating layers to GPU
Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: offloading non-repeating layers to GPU
Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: offloaded 29/29 layers to GPU
Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors:        CPU buffer size =   308.23 MiB
Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors:      CUDA0 buffer size =  1918.36 MiB
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: n_ctx      = 8192
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: n_batch    = 2048
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: n_ubatch   = 512
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: flash_attn = 0
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: freq_base  = 500000.0
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: freq_scale = 1
Nov 21 05:42:53 ollama ollama[23630]: llama_kv_cache_init:      CUDA0 KV buffer size =   896.00 MiB
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: KV self size  =  896.00 MiB, K (f16):  448.00 MiB, V (f16):  448.00 MiB
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model:  CUDA_Host  output buffer size =     2.00 MiB
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model:      CUDA0 compute buffer size =   424.00 MiB
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model:  CUDA_Host compute buffer size =    22.01 MiB
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: graph nodes  = 902
Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: graph splits = 2
Nov 21 05:42:53 ollama ollama[23630]: time=2024-11-21T05:42:53.512Z level=INFO source=server.go:601 msg="llama runner started in 1.51 seconds"
Nov 21 05:42:53 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:53 | 200 |  2.169034657s |       127.0.0.1 | POST     "/api/generate"
Nov 21 05:42:56 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:56 | 200 |  346.974237ms |       127.0.0.1 | POST     "/api/chat"
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   1:                               general.type str              = model
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   5:                         general.size_label str              = 3B
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   8:                          llama.block_count u32              = 28
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  18:                          general.file_type u32              = 15
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - type  f32:   58 tensors
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - type q4_K:  168 tensors
Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - type q6_K:   29 tensors
Nov 21 05:44:22 ollama ollama[23630]: llm_load_vocab: special tokens cache size = 256
Nov 21 05:44:22 ollama ollama[23630]: llm_load_vocab: token to piece cache size = 0.7999 MB
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: arch             = llama
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: vocab type       = BPE
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: n_vocab          = 128256
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: n_merges         = 280147
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: vocab_only       = 1
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model type       = ?B
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model ftype      = all F32
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model params     = 3.21 B
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: LF token         = 128 'Ä'
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: max token length = 256
Nov 21 05:44:22 ollama ollama[23630]: llama_model_load: vocab only - skipping tensors```
<!-- gh-comment-id:2490136499 --> @Pho3niX90 commented on GitHub (Nov 21, 2024): So I was using open-webui, I have stopped it completely now to interact natively with ollama. Doing so, the data is as follows: **Params:** --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 41329 ![image](https://github.com/user-attachments/assets/275d9d24-f0b4-4240-943d-cd5b8ae3b819) journal logs ``` Nov 21 05:42:46 ollama systemd[1]: Started Ollama Service. Nov 21 05:42:46 ollama ollama[23630]: 2024/11/21 05:42:46 routes.go:1189: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.934Z level=INFO source=images.go:755 msg="total blobs: 12" Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.935Z level=INFO source=images.go:762 msg="total unused blobs removed: 0" Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.935Z level=INFO source=routes.go:1240 msg="Listening on [::]:11434 (version 0.4.2)" Nov 21 05:42:46 ollama ollama[23630]: time=2024-11-21T05:42:46.935Z level=INFO source=common.go:135 msg="extracting embedded files" dir=/tmp/ollama2344306266/runners Nov 21 05:42:47 ollama ollama[23630]: time=2024-11-21T05:42:47.028Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 cuda_v11 cuda_v12 rocm]" Nov 21 05:42:47 ollama ollama[23630]: time=2024-11-21T05:42:47.028Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs" Nov 21 05:42:47 ollama ollama[23630]: time=2024-11-21T05:42:47.209Z level=INFO source=types.go:123 msg="inference compute" id=GPU-b0fd65ee-05b6-962b-9e19-5471ea53cbf2 library=cuda variant=v12 compute=8.6 driver=12.6 name="NVIDIA GeForce RTX 3060" total="11.7 GiB" available="11.6 GiB" Nov 21 05:42:51 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:51 | 200 | 39.851µs | 127.0.0.1 | HEAD "/" Nov 21 05:42:51 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:51 | 200 | 22.925724ms | 127.0.0.1 | POST "/api/show" Nov 21 05:42:51 ollama ollama[23630]: time=2024-11-21T05:42:51.850Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-b0fd65ee-05b6-962b-9e19-5471ea53cbf2 parallel=4 available=12403474432 required="3.7 GiB" Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.002Z level=INFO source=server.go:105 msg="system memory" total="62.8 GiB" free="61.7 GiB" free_swap="8.0 GiB" Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.003Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[11.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.004Z level=INFO source=server.go:383 msg="starting llama server" cmd="/tmp/ollama2344306266/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 41329" Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.004Z level=INFO source=sched.go:449 msg="loaded runners" count=1 Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.004Z level=INFO source=server.go:562 msg="waiting for llama runner to start responding" Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.005Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server error" Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.034Z level=INFO source=runner.go:883 msg="starting go runner" Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.034Z level=INFO source=runner.go:884 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=8 Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.034Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:41329" Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 0: general.architecture str = llama Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 1: general.type str = model Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 3: general.finetune str = Instruct Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 4: general.basename str = Llama-3.2 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 5: general.size_label str = 3B Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 8: llama.block_count u32 = 28 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 9: llama.context_length u32 = 131072 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 18: general.file_type u32 = 15 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - type f32: 58 tensors Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - type q4_K: 168 tensors Nov 21 05:42:52 ollama ollama[23630]: llama_model_loader: - type q6_K: 29 tensors Nov 21 05:42:52 ollama ollama[23630]: time=2024-11-21T05:42:52.257Z level=INFO source=server.go:596 msg="waiting for server to become available" status="llm server loading model" Nov 21 05:42:52 ollama ollama[23630]: llm_load_vocab: special tokens cache size = 256 Nov 21 05:42:52 ollama ollama[23630]: llm_load_vocab: token to piece cache size = 0.7999 MB Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: format = GGUF V3 (latest) Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: arch = llama Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: vocab type = BPE Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_vocab = 128256 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_merges = 280147 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: vocab_only = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_ctx_train = 131072 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd = 3072 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_layer = 28 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_head = 24 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_head_kv = 8 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_rot = 128 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_swa = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_head_k = 128 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_head_v = 128 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_gqa = 3 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_k_gqa = 1024 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_embd_v_gqa = 1024 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_norm_eps = 0.0e+00 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: f_logit_scale = 0.0e+00 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_ff = 8192 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_expert = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_expert_used = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: causal attn = 1 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: pooling type = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: rope type = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: rope scaling = linear Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: freq_base_train = 500000.0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: freq_scale_train = 1 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: rope_finetuned = unknown Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_d_conv = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_d_inner = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_d_state = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_dt_rank = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model type = 3B Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model ftype = Q4_K - Medium Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model params = 3.21 B Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: general.name = Llama 3.2 3B Instruct Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: LF token = 128 'Ä' Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' Nov 21 05:42:52 ollama ollama[23630]: llm_load_print_meta: max token length = 256 Nov 21 05:42:52 ollama ollama[23630]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Nov 21 05:42:52 ollama ollama[23630]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Nov 21 05:42:52 ollama ollama[23630]: ggml_cuda_init: found 1 CUDA devices: Nov 21 05:42:52 ollama ollama[23630]: Device 0: NVIDIA GeForce RTX 3060, compute capability 8.6, VMM: yes Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: ggml ctx size = 0.24 MiB Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: offloading 28 repeating layers to GPU Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: offloading non-repeating layers to GPU Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: offloaded 29/29 layers to GPU Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: CPU buffer size = 308.23 MiB Nov 21 05:42:53 ollama ollama[23630]: llm_load_tensors: CUDA0 buffer size = 1918.36 MiB Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: n_ctx = 8192 Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: n_batch = 2048 Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: n_ubatch = 512 Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: flash_attn = 0 Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: freq_base = 500000.0 Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: freq_scale = 1 Nov 21 05:42:53 ollama ollama[23630]: llama_kv_cache_init: CUDA0 KV buffer size = 896.00 MiB Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: CUDA_Host output buffer size = 2.00 MiB Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: CUDA0 compute buffer size = 424.00 MiB Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: CUDA_Host compute buffer size = 22.01 MiB Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: graph nodes = 902 Nov 21 05:42:53 ollama ollama[23630]: llama_new_context_with_model: graph splits = 2 Nov 21 05:42:53 ollama ollama[23630]: time=2024-11-21T05:42:53.512Z level=INFO source=server.go:601 msg="llama runner started in 1.51 seconds" Nov 21 05:42:53 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:53 | 200 | 2.169034657s | 127.0.0.1 | POST "/api/generate" Nov 21 05:42:56 ollama ollama[23630]: [GIN] 2024/11/21 - 05:42:56 | 200 | 346.974237ms | 127.0.0.1 | POST "/api/chat" Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 0: general.architecture str = llama Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 1: general.type str = model Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 3: general.finetune str = Instruct Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 4: general.basename str = Llama-3.2 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 5: general.size_label str = 3B Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 8: llama.block_count u32 = 28 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 9: llama.context_length u32 = 131072 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 18: general.file_type u32 = 15 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - type f32: 58 tensors Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - type q4_K: 168 tensors Nov 21 05:44:21 ollama ollama[23630]: llama_model_loader: - type q6_K: 29 tensors Nov 21 05:44:22 ollama ollama[23630]: llm_load_vocab: special tokens cache size = 256 Nov 21 05:44:22 ollama ollama[23630]: llm_load_vocab: token to piece cache size = 0.7999 MB Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: format = GGUF V3 (latest) Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: arch = llama Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: vocab type = BPE Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: n_vocab = 128256 Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: n_merges = 280147 Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: vocab_only = 1 Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model type = ?B Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model ftype = all F32 Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model params = 3.21 B Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: general.name = Llama 3.2 3B Instruct Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: LF token = 128 'Ä' Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' Nov 21 05:44:22 ollama ollama[23630]: llm_load_print_meta: max token length = 256 Nov 21 05:44:22 ollama ollama[23630]: llama_model_load: vocab only - skipping tensors```
Author
Owner

@Pho3niX90 commented on GitHub (Nov 21, 2024):

I'm not sure what happened in the "straight off the bat" one, but repeated G's is a symptom of exceeding the context window. The context windows includes both input and output tokens, so if you've had a few rounds of conversations in which longer and longer output is requested as it looks in the screenshot, then that could be the cause of the problem you see here. There are various ways of modifying the context window, see here and here.

Wouldn't I see the vram go up then? My vram usage is constantly stable, never going up or down with prompt. See the last comment where I made an example of asking for a super long essay

<!-- gh-comment-id:2490139673 --> @Pho3niX90 commented on GitHub (Nov 21, 2024): > I'm not sure what happened in the "straight off the bat" one, but repeated G's is a symptom of exceeding the context window. The context windows includes both input and output tokens, so if you've had a few rounds of conversations in which longer and longer output is requested as it looks in the screenshot, then that could be the cause of the problem you see here. There are various ways of modifying the context window, see [here](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-can-i-specify-the-context-window-size) and [here](https://github.com/ollama/ollama/issues/5965#issuecomment-2252354726). Wouldn't I see the vram go up then? My vram usage is constantly stable, never going up or down with prompt. See the last comment where I made an example of asking for a super long essay
Author
Owner

@Pho3niX90 commented on GitHub (Nov 21, 2024):

I enabled debug logs, and tried again. this was another garbage response.

Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=INFO source=server.go:601 msg="llama runner started in 1.51 seconds"
Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
Nov 21 06:14:48 ollama ollama[25689]: [GIN] 2024/11/21 - 06:14:48 | 200 |  7.316823498s |       127.0.0.1 | POST     "/api/generate"
Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:466 msg="context for request finished"
Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff duration=5m0s
Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff refCount=0
Nov 21 06:14:57 ollama ollama[25689]: time=2024-11-21T06:14:57.588Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
Nov 21 06:14:57 ollama ollama[25689]: time=2024-11-21T06:14:57.589Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
Nov 21 06:14:57 ollama ollama[25689]: time=2024-11-21T06:14:57.590Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=26 used=0 remaining=26
Nov 21 06:15:37 ollama ollama[25689]: time=2024-11-21T06:15:37.432Z level=DEBUG source=sched.go:407 msg="context for request finished"
Nov 21 06:15:37 ollama ollama[25689]: time=2024-11-21T06:15:37.432Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff duration=5m0s
Nov 21 06:15:37 ollama ollama[25689]: time=2024-11-21T06:15:37.432Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff refCount=0
Nov 21 06:15:37 ollama ollama[25689]: [GIN] 2024/11/21 - 06:15:37 | 200 | 39.865711755s |       127.0.0.1 | POST     "/api/chat"
Nov 21 06:15:40 ollama ollama[25689]: time=2024-11-21T06:15:40.680Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff
Nov 21 06:15:40 ollama ollama[25689]: time=2024-11-21T06:15:40.681Z level=DEBUG source=server.go:955 msg="new runner detected, loading model for cgo tokenization"
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   1:                               general.type str              = model
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 3B Instruct
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   3:                           general.finetune str              = Instruct
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   4:                           general.basename str              = Llama-3.2
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   5:                         general.size_label str              = 3B
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   6:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   7:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   8:                          llama.block_count u32              = 28
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  10:                     llama.embedding_length u32              = 3072
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 8192
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 24
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 500000.000000
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  18:                          general.file_type u32              = 15
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  19:                           llama.vocab_size u32              = 128256
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv  29:               general.quantization_version u32              = 2
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - type  f32:   58 tensors
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - type q4_K:  168 tensors
Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - type q6_K:   29 tensors
Nov 21 06:15:41 ollama ollama[25689]: llm_load_vocab: special tokens cache size = 256
Nov 21 06:15:41 ollama ollama[25689]: llm_load_vocab: token to piece cache size = 0.7999 MB
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: format           = GGUF V3 (latest)
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: arch             = llama
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: vocab type       = BPE
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: n_vocab          = 128256
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: n_merges         = 280147
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: vocab_only       = 1
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model type       = ?B
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model ftype      = all F32
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model params     = 3.21 B
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model size       = 1.87 GiB (5.01 BPW)
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: general.name     = Llama 3.2 3B Instruct
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: LF token         = 128 'Ä'
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOM token        = 128008 '<|eom_id|>'
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOG token        = 128008 '<|eom_id|>'
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: max token length = 256
Nov 21 06:15:41 ollama ollama[25689]: llama_model_load: vocab only - skipping tensors
Nov 21 06:15:41 ollama ollama[25689]: time=2024-11-21T06:15:41.458Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n_mode_mode_mode_mode_mode_mode opin multijo opin_mode_mode_mode_mode_mode_mode_mode opin_mode_mode_mode opin opin opin_mode opin opin_mode opin_mode opin_mode opin opin opin opin opin opin opin opin_mode vehicle_mode opin opin opin opin vehicle opin opin opin_mode vehicle_mode vehicle_mode vehicle opin opin vehicle opin vehicle opin opin opin opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle vehicle opin multi_mode vehicle_mode multi_mode vehicle opin opin vehicle opin vehicle vehicle opin opin opin opin vehicle_mode multi_mode opin opin vehicle opin vehicle_mode multi opin opin vehicle_mode vehicle_mode multi vehicle_mode multi vehicle opin opin multi_mode multi vehicle opin multi vehicle opin multi vehicle opin multi vehicle opin multi_mode multi vehicle opin multi vehicle opin multi vehicle opin multi vehicle vehicle opin multi vehicle opin multi vehicle vehicle vehicle opin multi_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi multi opin opin opin opin vehicle opin multi vehicle opin multi multi opin multi vehicle opin multi multi vehicle opin multi multi vehicle opin multi multi vehicle opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi vehicle opin multi multi opin multi multi multi opin multi vehicle opin multi multi vehicle opin multi multi multi opin multi multi multi vehicle vehicle opin multi multi vehicle opin multi multi multi vehicle multi vehicle multi multi opin multi multi multi opin multi multi multi opin multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle opin multi multi multi vehicle opin multi multi multi multi opin multi vehicle opin multi multi vehicle opin multi multi multi opin multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin opin opin opin opin vehicle opin multi multi multi opin multi multi multi opin multi vehicle opin multi multi multi opin multi multi vehicle vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi vehicle multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle vehicle opin multi multi multi opin opin opin vehicle opin multi multi vehicle multi opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi opin multi multi multi opin opin vehicle opin multi opin multi multi multi opin multi vehicle multi vehicle opin multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi opin multi multi multi vehicle opin multi multi multi opin multi opin multi multi multi opin multi vehicle multi opin multi multi multi opin multi vehicle multi opin multi multi vehicle opin multi multi multi vehicle opin multi multi multi opin opin multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi vehicle multi vehicle vehicle vehicle opin opin opin opin vehicle opin multi opin multi multi multi vehicle multi vehicle multi vehicle multi vehicle multi multi opin opin multi opin multi multi opin multi multi opin multi opin multi opin multi opin multi opin multi opin multi opin multi opin multi opin opin opin multi vehicle multi multi vehicle multi multi vehicle multi multi opin opin multi multi multi opin multi multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi multi vehicle multi vehicle multi vehicle multi multi vehicle multi multi opin opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi multi opin multi multi multi opin multi multi multi opin multi multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin opin multi opin multi multi multi multi vehicle multi multi multi opin opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin opin multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi opin opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi vehicle vehicle vehicle opin opin vehicle opin multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi vehicle multi multi multi opin multi multi multi opin opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ntest<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

<!-- gh-comment-id:2490166307 --> @Pho3niX90 commented on GitHub (Nov 21, 2024): I enabled debug logs, and tried again. this was another garbage response. ``` Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=INFO source=server.go:601 msg="llama runner started in 1.51 seconds" Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:462 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff Nov 21 06:14:48 ollama ollama[25689]: [GIN] 2024/11/21 - 06:14:48 | 200 | 7.316823498s | 127.0.0.1 | POST "/api/generate" Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:466 msg="context for request finished" Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff duration=5m0s Nov 21 06:14:48 ollama ollama[25689]: time=2024-11-21T06:14:48.644Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff refCount=0 Nov 21 06:14:57 ollama ollama[25689]: time=2024-11-21T06:14:57.588Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff Nov 21 06:14:57 ollama ollama[25689]: time=2024-11-21T06:14:57.589Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" Nov 21 06:14:57 ollama ollama[25689]: time=2024-11-21T06:14:57.590Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=26 used=0 remaining=26 Nov 21 06:15:37 ollama ollama[25689]: time=2024-11-21T06:15:37.432Z level=DEBUG source=sched.go:407 msg="context for request finished" Nov 21 06:15:37 ollama ollama[25689]: time=2024-11-21T06:15:37.432Z level=DEBUG source=sched.go:339 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff duration=5m0s Nov 21 06:15:37 ollama ollama[25689]: time=2024-11-21T06:15:37.432Z level=DEBUG source=sched.go:357 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff refCount=0 Nov 21 06:15:37 ollama ollama[25689]: [GIN] 2024/11/21 - 06:15:37 | 200 | 39.865711755s | 127.0.0.1 | POST "/api/chat" Nov 21 06:15:40 ollama ollama[25689]: time=2024-11-21T06:15:40.680Z level=DEBUG source=sched.go:575 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff Nov 21 06:15:40 ollama ollama[25689]: time=2024-11-21T06:15:40.681Z level=DEBUG source=server.go:955 msg="new runner detected, loading model for cgo tokenization" Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 0: general.architecture str = llama Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 1: general.type str = model Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 3: general.finetune str = Instruct Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 4: general.basename str = Llama-3.2 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 5: general.size_label str = 3B Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 8: llama.block_count u32 = 28 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 9: llama.context_length u32 = 131072 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 18: general.file_type u32 = 15 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - kv 29: general.quantization_version u32 = 2 Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - type f32: 58 tensors Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - type q4_K: 168 tensors Nov 21 06:15:40 ollama ollama[25689]: llama_model_loader: - type q6_K: 29 tensors Nov 21 06:15:41 ollama ollama[25689]: llm_load_vocab: special tokens cache size = 256 Nov 21 06:15:41 ollama ollama[25689]: llm_load_vocab: token to piece cache size = 0.7999 MB Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: format = GGUF V3 (latest) Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: arch = llama Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: vocab type = BPE Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: n_vocab = 128256 Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: n_merges = 280147 Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: vocab_only = 1 Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model type = ?B Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model ftype = all F32 Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model params = 3.21 B Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: general.name = Llama 3.2 3B Instruct Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: LF token = 128 'Ä' Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOM token = 128008 '<|eom_id|>' Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOG token = 128008 '<|eom_id|>' Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: EOG token = 128009 '<|eot_id|>' Nov 21 06:15:41 ollama ollama[25689]: llm_load_print_meta: max token length = 256 Nov 21 06:15:41 ollama ollama[25689]: llama_model_load: vocab only - skipping tensors Nov 21 06:15:41 ollama ollama[25689]: time=2024-11-21T06:15:41.458Z level=DEBUG source=routes.go:1457 msg="chat request" images=0 prompt="<|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n_mode_mode_mode_mode_mode_mode opin multijo opin_mode_mode_mode_mode_mode_mode_mode opin_mode_mode_mode opin opin opin_mode opin opin_mode opin_mode opin_mode opin opin opin opin opin opin opin opin_mode vehicle_mode opin opin opin opin vehicle opin opin opin_mode vehicle_mode vehicle_mode vehicle opin opin vehicle opin vehicle opin opin opin opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle opin vehicle vehicle opin multi_mode vehicle_mode multi_mode vehicle opin opin vehicle opin vehicle vehicle opin opin opin opin vehicle_mode multi_mode opin opin vehicle opin vehicle_mode multi opin opin vehicle_mode vehicle_mode multi vehicle_mode multi vehicle opin opin multi_mode multi vehicle opin multi vehicle opin multi vehicle opin multi vehicle opin multi_mode multi vehicle opin multi vehicle opin multi vehicle opin multi vehicle vehicle opin multi vehicle opin multi vehicle vehicle vehicle opin multi_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi vehicle_mode multi multi opin opin opin opin vehicle opin multi vehicle opin multi multi opin multi vehicle opin multi multi vehicle opin multi multi vehicle opin multi multi vehicle opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi vehicle opin multi multi opin multi multi multi opin multi vehicle opin multi multi vehicle opin multi multi multi opin multi multi multi vehicle vehicle opin multi multi vehicle opin multi multi multi vehicle multi vehicle multi multi opin multi multi multi opin multi multi multi opin multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle opin multi multi multi vehicle opin multi multi multi multi opin multi vehicle opin multi multi vehicle opin multi multi multi opin multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin opin opin opin opin vehicle opin multi multi multi opin multi multi multi opin multi vehicle opin multi multi multi opin multi multi vehicle vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi vehicle multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle vehicle opin multi multi multi opin opin opin vehicle opin multi multi vehicle multi opin multi multi multi vehicle opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi opin multi multi multi opin opin vehicle opin multi opin multi multi multi opin multi vehicle multi vehicle opin multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi opin multi multi multi vehicle opin multi multi multi opin multi opin multi multi multi opin multi vehicle multi opin multi multi multi opin multi vehicle multi opin multi multi vehicle opin multi multi multi vehicle opin multi multi multi opin opin multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi vehicle multi vehicle vehicle vehicle opin opin opin opin vehicle opin multi opin multi multi multi vehicle multi vehicle multi vehicle multi vehicle multi multi opin opin multi opin multi multi opin multi multi opin multi opin multi opin multi opin multi opin multi opin multi opin multi opin multi opin opin opin multi vehicle multi multi vehicle multi multi vehicle multi multi opin opin multi multi multi opin multi multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin multi multi multi opin multi multi vehicle multi vehicle multi vehicle multi multi vehicle multi multi opin opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi multi opin multi multi multi opin multi multi multi opin multi multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi vehicle opin opin multi opin multi multi multi multi vehicle multi multi multi opin opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin opin multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi opin opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi vehicle vehicle vehicle opin opin vehicle opin multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi vehicle multi multi multi opin multi multi multi opin opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi multi multi opin multi opin multi multi multi opin multi vehicle multi multi multi opin multi multi multi opin multi multi multi<|eot_id|><|start_header_id|>user<|end_header_id|>\n\ntest<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" ```
Author
Owner

@Pho3niX90 commented on GitHub (Nov 21, 2024):

It seems there is a possibility that this might be related to drivers, and perhaps the riser cable as well.

  • The installed driver was 565, which turned out to be a development branch release, this is being changed to 550.
  • The riser is also communicating at gen4 8x, instead of 16x. New one will be delivered today. Although, I would expect performance to only be bad here.

I will revert back after both are changed

<!-- gh-comment-id:2490243257 --> @Pho3niX90 commented on GitHub (Nov 21, 2024): It seems there is a possibility that this might be related to drivers, and perhaps the riser cable as well. - The installed driver was 565, which turned out to be a development branch release, this is being changed to 550. - The riser is also communicating at gen4 8x, instead of 16x. New one will be delivered today. Although, I would expect performance to only be bad here. I will revert back after both are changed
Author
Owner

@rick-github commented on GitHub (Nov 21, 2024):

VRAM is allocated when the model is loaded.

Your last prompt had a lot of context that looks like nonsense: "_mode_mode_mode_mode_mode_mode opin multijo opin_mod ...". It looks like the models response to an earlier prompt of "hello", and then you sent another prompt of "test". Perhaps the model itself is damaged, have you tried re-pulling it?

<!-- gh-comment-id:2490425013 --> @rick-github commented on GitHub (Nov 21, 2024): VRAM is allocated when the model is loaded. Your last prompt had a lot of context that looks like nonsense: "_mode_mode_mode_mode_mode_mode opin multijo opin_mod ...". It looks like the models response to an earlier prompt of "hello", and then you sent another prompt of "test". Perhaps the model itself is damaged, have you tried re-pulling it?
Author
Owner

@Pho3niX90 commented on GitHub (Nov 22, 2024):

So I am utterly confused.

I have tried:

  • Purging all drivers, reinstalling most used/mentioned driver (535)
  • I have reinstalled ollama
  • Deleted models, re pulled them a couple times
  • changing between cuda 11/22

All without success.

I cannot seem to find the serverlogs mentioned here https://github.com/ollama/ollama/issues/7766#issuecomment-2489465946

One thing I noticed, there is a cuda version, a driver version, and a compute version. I am assuming all of these need to match 100%?

<!-- gh-comment-id:2493065513 --> @Pho3niX90 commented on GitHub (Nov 22, 2024): So I am utterly confused. I have tried: - Purging all drivers, reinstalling most used/mentioned driver (535) - I have reinstalled ollama - Deleted models, re pulled them a couple times - changing between cuda 11/22 All without success. I cannot seem to find the serverlogs mentioned here https://github.com/ollama/ollama/issues/7766#issuecomment-2489465946 One thing I noticed, there is a cuda version, a driver version, and a compute version. I am assuming all of these need to match 100%?
Author
Owner

@Pho3niX90 commented on GitHub (Nov 22, 2024):

Small update,

The issue is possibly resolved. The new riser cable never arrived yesterday, that being said, I decided to hard set the current ones specs in the bios, instead of having the bios auto negotiate.

Max Speed: Was set to Gen3
Lanes was left to auto negotiate, since the cable has x16, and negotiated to that.

All seems to be fine atm. I am doing further test just to make sure that this might just be a "lucky" session.

<!-- gh-comment-id:2493126752 --> @Pho3niX90 commented on GitHub (Nov 22, 2024): Small update, The issue is possibly resolved. The new riser cable never arrived yesterday, that being said, I decided to hard set the current ones specs in the bios, instead of having the bios auto negotiate. Max Speed: Was set to Gen3 Lanes was left to auto negotiate, since the cable has x16, and negotiated to that. All seems to be fine atm. I am doing further test just to make sure that this might just be a "lucky" session.
Author
Owner

@rick-github commented on GitHub (Nov 22, 2024):

I was going to suggest it might be GPU related. Another way to test it would be to use a different LLM app. LM Studio have an AppImage which is easy to deploy. It uses the same underlying technology (llama.cpp) with different defaults, and their own library with equivalent models, so it would be a simple test. vLLM don't use llama.cpp and have a docker image so if you have docker installed, it's simple to give it a spin.

<!-- gh-comment-id:2493141508 --> @rick-github commented on GitHub (Nov 22, 2024): I was going to suggest it might be GPU related. Another way to test it would be to use a different LLM app. [LM Studio](https://lmstudio.ai/) have an AppImage which is easy to deploy. It uses the same underlying technology (llama.cpp) with different defaults, and their own library with equivalent models, so it would be a simple test. [vLLM](https://github.com/vllm-project/vllm) don't use llama.cpp and have a [docker image](https://hub.docker.com/r/vllm/vllm-openai/tags) so if you have docker installed, it's simple to give it a spin.
Author
Owner

@Pho3niX90 commented on GitHub (Nov 23, 2024):

I was going to suggest it might be GPU related. Another way to test it would be to use a different LLM app. LM Studio have an AppImage which is easy to deploy. It uses the same underlying technology (llama.cpp) with different defaults, and their own library with equivalent models, so it would be a simple test. vLLM don't use llama.cpp and have a docker image so if you have docker installed, it's simple to give it a spin.

Yeah I suppose this was niche, since it wasn't technically the gpu, but the riser cable not able to handle the speeds.

But, this seems to be solved. In hindsight, I should have done a stability test.

<!-- gh-comment-id:2495634910 --> @Pho3niX90 commented on GitHub (Nov 23, 2024): > I was going to suggest it might be GPU related. Another way to test it would be to use a different LLM app. [LM Studio](https://lmstudio.ai/) have an AppImage which is easy to deploy. It uses the same underlying technology (llama.cpp) with different defaults, and their own library with equivalent models, so it would be a simple test. [vLLM](https://github.com/vllm-project/vllm) don't use llama.cpp and have a [docker image](https://hub.docker.com/r/vllm/vllm-openai/tags) so if you have docker installed, it's simple to give it a spin. Yeah I suppose this was niche, since it wasn't technically the gpu, but the riser cable not able to handle the speeds. But, this seems to be solved. In hindsight, I should have done a stability test.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30721