[GH-ISSUE #5668] Glm4 in ollama v0.2.3 still returns gibberish G's #50045

Closed
opened 2026-04-28 13:56:05 -05:00 by GiteaMirror · 46 comments
Owner

Originally created by @loveyume520 on GitHub (Jul 13, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5668

What is the issue?

After running for a while, the model still returns gibberish:

[12:59:39] [INFO] [Part of Speech Determination] [Fixed] JSON string: Since you did not provide specific content text, I cannot perform actual word frequency analysis, context analysis, etc. Therefore, I will provide a hypothetical example to demonstrate how to make judgments according to the steps.
{
  "person": "Yes",
  "explanation": [
    {
      "step": 1,
      "detail": "The word appears frequently in the text"
    },
    {
      "step": 2,
      "detail": "The word often appears in sentence structures as a subject or object, such as 'Jack is playing games' where 'Jack' is the subject"
    },
    {
      "step": 3,
      "detail": "The word is not written in Katakana, which does not match the characteristics of a proper noun"
    },
    {
      "step": 4,
      "detail": "Through dependency syntax analysis, it is determined that the word is used as a noun, serving as a subject or object"
    },
    {
      "step": 5,
      "detail": "There is a clear role behavior description in the text, such as 'Jack jumps high' where 'Jack' is the subject"
    },
    {
      "step": 6,
      "detail": "The word appears in dialogue, such as 'A: Hello, I am Jack. B: Hi, Jack!'"
    },
    {
      "step": 7,
      "detail": "The usage of the word remains consistent across different paragraphs and scenes"
    },
    {
      "step": 8,
     !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!}
[12:59:39] [WARNING] [Part of Speech Determination] Subtask execution failed, will retry later ... Expecting value: line 1 column 1 (char 0)
[12:59:39] [INFO] About to start executing [Semantic Analysis] ...
[12:59:44] [INFO] [Semantic Analysis] [Raw] LLM response: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[12:59:44] [INFO] [Semantic Analysis] [Fixed] JSON string: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!}
 . . .
Press any key to continue . . .
(base) PS C:\Users\account\Desktop> ollama --version
ollama version is 0.2.3

Then try posting and it respond:

{
    "id": "chatcmpl-991",
    "object": "chat.completion",
    "created": 1720861217,
    "model": "glm-4-9b-chat",
    "system_fingerprint": "fp_ollama",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"
            },
            "finish_reason": null
        }
    ],
    "usage": {
        "prompt_tokens": 0,
        "completion_tokens": 0,
        "total_tokens": 0
    }
}

Here's ollama serve:

2024/07/13 12:58:16 routes.go:940: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\\Users\\Ototsuyume\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Ototsuyume\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-07-13T12:58:16.286 level=INFO source=images.go:760 msg="total blobs: 17"
time=2024-07-13T12:58:16.287 level=INFO source=images.go:767 msg="total unused blobs removed: 0"
time=2024-07-13T12:58:16.288 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.2.3)"
time=2024-07-13T12:58:16.289 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]"
time=2024-07-13T12:58:16.289 level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-07-13T12:58:16.631 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx1030 driver=5.7 name="AMD Radeon RX 6800 XT" total="16.0 GiB" available="15.9 GiB"
time=2024-07-13T12:58:30.651 level=INFO source=sched.go:179 msg="one or more GPUs detected that are unable to accurately report free memory - disabling default concurrency"
time=2024-07-13T12:58:30.663 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Ototsuyume\.ollama\models\blobs\sha256-eb30fa5273749385c6a42b8df12a692ea3ab552fbf8883ce87af9938f69e9f4c gpu=0 parallel=4 available=17028874240 required="6.9 GiB"
time=2024-07-13T12:58:30.664 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[15.9 GiB]" memory.required.full="6.9 GiB" memory.required.partial="6.9 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[6.9 GiB]" memory.weights.total="5.3 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="561.0 MiB" memory.graph.partial="789.6 MiB"
time=2024-07-13T12:58:30.670 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Ototsuyume\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\rocm_v6.1\\ollama_llama_server.exe --model C:\\Users\\Ototsuyume\\.ollama\\models\\blobs\\sha256-eb30fa5273749385c6a42b8df12a692ea3ab552fbf8883ce87af9938f69e9f4c --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 10596"
time=2024-07-13T12:58:30.694 level=INFO source=sched.go:437 msg="loaded runners" count=1
time=2024-07-13T12:58:30.694 level=INFO source=server.go:571 msg="waiting for llama runner to start responding"
time=2024-07-13T12:58:30.695 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="3896" timestamp=1720861110
INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="3896" timestamp=1720861110 total_threads=12
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="10596" tid="3896" timestamp=1720861110
llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from C:\Users\Ototsuyume\.ollama\models\blobs\sha256-eb30fa5273749385c6a42b8df12a692ea3ab552fbf8883ce87af9938f69e9f4c (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = chatglm
llama_model_loader: - kv   1:                               general.name str              = glm-4-9b-chat
llama_model_loader: - kv   2:                     chatglm.context_length u32              = 131072
llama_model_loader: - kv   3:                   chatglm.embedding_length u32              = 4096
llama_model_loader: - kv   4:                chatglm.feed_forward_length u32              = 13696
llama_model_loader: - kv   5:                        chatglm.block_count u32              = 40
llama_model_loader: - kv   6:               chatglm.attention.head_count u32              = 32
llama_model_loader: - kv   7:            chatglm.attention.head_count_kv u32              = 2
llama_model_loader: - kv   8:   chatglm.attention.layer_norm_rms_epsilon f32              = 0.000000
llama_model_loader: - kv   9:                          general.file_type u32              = 15
llama_model_loader: - kv  10:               chatglm.rope.dimension_count u32              = 64
llama_model_loader: - kv  11:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  12:                     chatglm.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  14:                         tokenizer.ggml.pre str              = chatglm-bpe
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,151552]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,151552]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,151073]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 151329
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 151329
llama_model_loader: - kv  20:                tokenizer.ggml.eot_token_id u32              = 151336
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 151329
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = [gMASK]<sop>{% for item in messages %...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_0:   20 tensors
llama_model_loader: - type q8_0:   20 tensors
llama_model_loader: - type q4_K:   81 tensors
llama_model_loader: - type q5_K:   40 tensors
llama_model_loader: - type q6_K:    1 tensors
time=2024-07-13T12:58:30.961 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 223
llm_load_vocab: token to piece cache size = 0.9732 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = chatglm
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 151552
llm_load_print_meta: n_merges         = 151073
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 2
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 16
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.6e-07
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13696
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 9B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 9.40 B
llm_load_print_meta: model size       = 5.82 GiB (5.31 BPW)
llm_load_print_meta: general.name     = glm-4-9b-chat
llm_load_print_meta: EOS token        = 151329 '<|endoftext|>'
llm_load_print_meta: UNK token        = 151329 '<|endoftext|>'
llm_load_print_meta: PAD token        = 151329 '<|endoftext|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 151336 '<|user|>'
llm_load_print_meta: max token length = 1024
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no
llm_load_tensors: ggml ctx size =    0.28 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  5622.60 MiB
llm_load_tensors:        CPU buffer size =   333.00 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =   320.00 MiB
llama_new_context_with_model: KV self size  =  320.00 MiB, K (f16):  160.00 MiB, V (f16):  160.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     2.38 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   561.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1606
llama_new_context_with_model: graph splits = 2
INFO [wmain] model loaded | tid="3896" timestamp=1720861115
time=2024-07-13T12:58:35.360 level=INFO source=server.go:617 msg="llama runner started in 4.67 seconds"
[GIN] 2024/07/13 - 12:58:35 | 200 |    5.6746358s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:04 | 200 |    6.5357628s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:06 | 200 |    8.0197978s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:07 | 200 |    9.3109135s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:08 | 200 |   10.3975869s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:09 | 200 |    5.0742796s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:11 | 200 |    5.4391138s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:11 | 200 |    3.1233214s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:12 | 200 |    5.0868503s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:13 | 200 |    3.4118485s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:20 | 200 |    8.9596744s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:27 | 200 |    433.1813ms |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:31 | 200 |   10.5772594s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:39 | 200 |    8.1706446s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:44 | 200 |    4.5071605s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:44 | 200 |    4.4032214s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:44 | 200 |    4.7316797s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:44 | 200 |    4.7339705s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:45 | 200 |    1.5329916s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:47 | 200 |    2.3864411s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:48 | 200 |    2.4289955s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:48 | 200 |    2.4288198s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:48 | 200 |    2.4720815s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:49 | 200 |    1.5614626s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:51 | 200 |    2.3866176s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:52 | 200 |    2.4702948s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:52 | 200 |    2.4698489s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:52 | 200 |    2.4714931s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:53 | 200 |    1.5666652s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:55 | 200 |    1.5533587s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:55 | 200 |    1.5873278s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:55 | 200 |    1.5888237s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:55 | 200 |    1.6187491s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:55 | 200 |    874.0734ms |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:57 | 200 |    1.1203497s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:57 | 200 |    1.1544978s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:57 | 200 |    1.1558197s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:57 | 200 |    1.1856924s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:57 | 200 |    871.3637ms |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:59 | 200 |     1.122985s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:59 | 200 |    1.1230159s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:59 | 200 |    1.1563781s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:59 | 200 |    1.1845468s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 12:59:59 | 200 |    867.6324ms |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 13:00:04 | 200 |    4.1391989s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 13:00:04 | 200 |    4.1805937s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 13:00:04 | 200 |    4.1799448s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 13:00:04 | 200 |     4.216249s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 13:00:05 | 200 |    1.0996839s |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 13:00:17 | 200 |    980.3081ms |       127.0.0.1 | POST     "/v1/chat/completions"
[GIN] 2024/07/13 - 13:00:51 | 200 |            0s |       127.0.0.1 | GET      "/api/version"

OS

Windows

GPU

AMD

CPU

AMD

Ollama version

0.2.3

Originally created by @loveyume520 on GitHub (Jul 13, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5668 ### What is the issue? After running for a while, the model still returns gibberish: ``` [12:59:39] [INFO] [Part of Speech Determination] [Fixed] JSON string: Since you did not provide specific content text, I cannot perform actual word frequency analysis, context analysis, etc. Therefore, I will provide a hypothetical example to demonstrate how to make judgments according to the steps. { "person": "Yes", "explanation": [ { "step": 1, "detail": "The word appears frequently in the text" }, { "step": 2, "detail": "The word often appears in sentence structures as a subject or object, such as 'Jack is playing games' where 'Jack' is the subject" }, { "step": 3, "detail": "The word is not written in Katakana, which does not match the characteristics of a proper noun" }, { "step": 4, "detail": "Through dependency syntax analysis, it is determined that the word is used as a noun, serving as a subject or object" }, { "step": 5, "detail": "There is a clear role behavior description in the text, such as 'Jack jumps high' where 'Jack' is the subject" }, { "step": 6, "detail": "The word appears in dialogue, such as 'A: Hello, I am Jack. B: Hi, Jack!'" }, { "step": 7, "detail": "The usage of the word remains consistent across different paragraphs and scenes" }, { "step": 8, !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!} [12:59:39] [WARNING] [Part of Speech Determination] Subtask execution failed, will retry later ... Expecting value: line 1 column 1 (char 0) [12:59:39] [INFO] About to start executing [Semantic Analysis] ... [12:59:44] [INFO] [Semantic Analysis] [Raw] LLM response: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [12:59:44] [INFO] [Semantic Analysis] [Fixed] JSON string: !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!} . . . Press any key to continue . . . (base) PS C:\Users\account\Desktop> ollama --version ollama version is 0.2.3 ``` Then try posting and it respond: ``` { "id": "chatcmpl-991", "object": "chat.completion", "created": 1720861217, "model": "glm-4-9b-chat", "system_fingerprint": "fp_ollama", "choices": [ { "index": 0, "message": { "role": "assistant", "content": "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" }, "finish_reason": null } ], "usage": { "prompt_tokens": 0, "completion_tokens": 0, "total_tokens": 0 } } ``` Here's ollama serve: ``` 2024/07/13 12:58:16 routes.go:940: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://127.0.0.1:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:C:\\Users\\Ototsuyume\\.ollama\\models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR:C:\\Users\\Ototsuyume\\AppData\\Local\\Programs\\Ollama\\ollama_runners OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-07-13T12:58:16.286 level=INFO source=images.go:760 msg="total blobs: 17" time=2024-07-13T12:58:16.287 level=INFO source=images.go:767 msg="total unused blobs removed: 0" time=2024-07-13T12:58:16.288 level=INFO source=routes.go:987 msg="Listening on 127.0.0.1:11434 (version 0.2.3)" time=2024-07-13T12:58:16.289 level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cpu_avx cpu_avx2 cuda_v11.3 rocm_v6.1]" time=2024-07-13T12:58:16.289 level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-07-13T12:58:16.631 level=INFO source=types.go:105 msg="inference compute" id=0 library=rocm compute=gfx1030 driver=5.7 name="AMD Radeon RX 6800 XT" total="16.0 GiB" available="15.9 GiB" time=2024-07-13T12:58:30.651 level=INFO source=sched.go:179 msg="one or more GPUs detected that are unable to accurately report free memory - disabling default concurrency" time=2024-07-13T12:58:30.663 level=INFO source=sched.go:701 msg="new model will fit in available VRAM in single GPU, loading" model=C:\Users\Ototsuyume\.ollama\models\blobs\sha256-eb30fa5273749385c6a42b8df12a692ea3ab552fbf8883ce87af9938f69e9f4c gpu=0 parallel=4 available=17028874240 required="6.9 GiB" time=2024-07-13T12:58:30.664 level=INFO source=memory.go:309 msg="offload to rocm" layers.requested=-1 layers.model=41 layers.offload=41 layers.split="" memory.available="[15.9 GiB]" memory.required.full="6.9 GiB" memory.required.partial="6.9 GiB" memory.required.kv="320.0 MiB" memory.required.allocations="[6.9 GiB]" memory.weights.total="5.3 GiB" memory.weights.repeating="4.9 GiB" memory.weights.nonrepeating="485.6 MiB" memory.graph.full="561.0 MiB" memory.graph.partial="789.6 MiB" time=2024-07-13T12:58:30.670 level=INFO source=server.go:383 msg="starting llama server" cmd="C:\\Users\\Ototsuyume\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\rocm_v6.1\\ollama_llama_server.exe --model C:\\Users\\Ototsuyume\\.ollama\\models\\blobs\\sha256-eb30fa5273749385c6a42b8df12a692ea3ab552fbf8883ce87af9938f69e9f4c --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 4 --port 10596" time=2024-07-13T12:58:30.694 level=INFO source=sched.go:437 msg="loaded runners" count=1 time=2024-07-13T12:58:30.694 level=INFO source=server.go:571 msg="waiting for llama runner to start responding" time=2024-07-13T12:58:30.695 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3337 commit="a8db2a9c" tid="3896" timestamp=1720861110 INFO [wmain] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 0 | " tid="3896" timestamp=1720861110 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="10596" tid="3896" timestamp=1720861110 llama_model_loader: loaded meta data with 24 key-value pairs and 283 tensors from C:\Users\Ototsuyume\.ollama\models\blobs\sha256-eb30fa5273749385c6a42b8df12a692ea3ab552fbf8883ce87af9938f69e9f4c (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = chatglm llama_model_loader: - kv 1: general.name str = glm-4-9b-chat llama_model_loader: - kv 2: chatglm.context_length u32 = 131072 llama_model_loader: - kv 3: chatglm.embedding_length u32 = 4096 llama_model_loader: - kv 4: chatglm.feed_forward_length u32 = 13696 llama_model_loader: - kv 5: chatglm.block_count u32 = 40 llama_model_loader: - kv 6: chatglm.attention.head_count u32 = 32 llama_model_loader: - kv 7: chatglm.attention.head_count_kv u32 = 2 llama_model_loader: - kv 8: chatglm.attention.layer_norm_rms_epsilon f32 = 0.000000 llama_model_loader: - kv 9: general.file_type u32 = 15 llama_model_loader: - kv 10: chatglm.rope.dimension_count u32 = 64 llama_model_loader: - kv 11: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 12: chatglm.rope.freq_base f32 = 5000000.000000 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = chatglm-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,151552] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,151552] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,151073] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 151329 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 151329 llama_model_loader: - kv 20: tokenizer.ggml.eot_token_id u32 = 151336 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 151329 llama_model_loader: - kv 22: tokenizer.chat_template str = [gMASK]<sop>{% for item in messages %... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type q5_0: 20 tensors llama_model_loader: - type q8_0: 20 tensors llama_model_loader: - type q4_K: 81 tensors llama_model_loader: - type q5_K: 40 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-07-13T12:58:30.961 level=INFO source=server.go:612 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 223 llm_load_vocab: token to piece cache size = 0.9732 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = chatglm llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 151552 llm_load_print_meta: n_merges = 151073 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 131072 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 40 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 2 llm_load_print_meta: n_rot = 64 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 16 llm_load_print_meta: n_embd_k_gqa = 256 llm_load_print_meta: n_embd_v_gqa = 256 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.6e-07 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 13696 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 5000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 131072 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 9B llm_load_print_meta: model ftype = Q4_K - Medium llm_load_print_meta: model params = 9.40 B llm_load_print_meta: model size = 5.82 GiB (5.31 BPW) llm_load_print_meta: general.name = glm-4-9b-chat llm_load_print_meta: EOS token = 151329 '<|endoftext|>' llm_load_print_meta: UNK token = 151329 '<|endoftext|>' llm_load_print_meta: PAD token = 151329 '<|endoftext|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 151336 '<|user|>' llm_load_print_meta: max token length = 1024 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon RX 6800 XT, compute capability 10.3, VMM: no llm_load_tensors: ggml ctx size = 0.28 MiB llm_load_tensors: offloading 40 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 41/41 layers to GPU llm_load_tensors: ROCm0 buffer size = 5622.60 MiB llm_load_tensors: CPU buffer size = 333.00 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 5000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 320.00 MiB llama_new_context_with_model: KV self size = 320.00 MiB, K (f16): 160.00 MiB, V (f16): 160.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 2.38 MiB llama_new_context_with_model: ROCm0 compute buffer size = 561.00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1606 llama_new_context_with_model: graph splits = 2 INFO [wmain] model loaded | tid="3896" timestamp=1720861115 time=2024-07-13T12:58:35.360 level=INFO source=server.go:617 msg="llama runner started in 4.67 seconds" [GIN] 2024/07/13 - 12:58:35 | 200 | 5.6746358s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:04 | 200 | 6.5357628s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:06 | 200 | 8.0197978s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:07 | 200 | 9.3109135s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:08 | 200 | 10.3975869s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:09 | 200 | 5.0742796s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:11 | 200 | 5.4391138s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:11 | 200 | 3.1233214s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:12 | 200 | 5.0868503s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:13 | 200 | 3.4118485s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:20 | 200 | 8.9596744s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:27 | 200 | 433.1813ms | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:31 | 200 | 10.5772594s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:39 | 200 | 8.1706446s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:44 | 200 | 4.5071605s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:44 | 200 | 4.4032214s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:44 | 200 | 4.7316797s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:44 | 200 | 4.7339705s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:45 | 200 | 1.5329916s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:47 | 200 | 2.3864411s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:48 | 200 | 2.4289955s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:48 | 200 | 2.4288198s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:48 | 200 | 2.4720815s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:49 | 200 | 1.5614626s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:51 | 200 | 2.3866176s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:52 | 200 | 2.4702948s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:52 | 200 | 2.4698489s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:52 | 200 | 2.4714931s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:53 | 200 | 1.5666652s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:55 | 200 | 1.5533587s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:55 | 200 | 1.5873278s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:55 | 200 | 1.5888237s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:55 | 200 | 1.6187491s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:55 | 200 | 874.0734ms | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:57 | 200 | 1.1203497s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:57 | 200 | 1.1544978s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:57 | 200 | 1.1558197s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:57 | 200 | 1.1856924s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:57 | 200 | 871.3637ms | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:59 | 200 | 1.122985s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:59 | 200 | 1.1230159s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:59 | 200 | 1.1563781s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:59 | 200 | 1.1845468s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 12:59:59 | 200 | 867.6324ms | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 13:00:04 | 200 | 4.1391989s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 13:00:04 | 200 | 4.1805937s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 13:00:04 | 200 | 4.1799448s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 13:00:04 | 200 | 4.216249s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 13:00:05 | 200 | 1.0996839s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 13:00:17 | 200 | 980.3081ms | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2024/07/13 - 13:00:51 | 200 | 0s | 127.0.0.1 | GET "/api/version" ``` ### OS Windows ### GPU AMD ### CPU AMD ### Ollama version 0.2.3
GiteaMirror added the bug label 2026-04-28 13:56:05 -05:00
Author
Owner

@somnifex commented on GitHub (Jul 13, 2024):

Same problem

<!-- gh-comment-id:2226842210 --> @somnifex commented on GitHub (Jul 13, 2024): Same problem
Author
Owner

@lalahaohaizi commented on GitHub (Jul 13, 2024):

同样的问题,处理几句话后就会持续输出乱码。
image

<!-- gh-comment-id:2226912547 --> @lalahaohaizi commented on GitHub (Jul 13, 2024): 同样的问题,处理几句话后就会持续输出乱码。 ![image](https://github.com/user-attachments/assets/fc16e594-f843-4c78-9607-1de8f8b199d8)
Author
Owner

@ototsu commented on GitHub (Jul 14, 2024):

The same glm4 issue occurs in version 0.2.5 , but it runs in the llama.cpp normally.

<!-- gh-comment-id:2227472185 --> @ototsu commented on GitHub (Jul 14, 2024): The same glm4 issue occurs in version 0.2.5 , but it runs in the llama.cpp normally.
Author
Owner

@arkerwu commented on GitHub (Jul 15, 2024):

Same issue

<!-- gh-comment-id:2227540991 --> @arkerwu commented on GitHub (Jul 15, 2024): Same issue
Author
Owner

@HuChundong commented on GitHub (Jul 15, 2024):

+1

<!-- gh-comment-id:2227614231 --> @HuChundong commented on GitHub (Jul 15, 2024): +1
Author
Owner

@DanielusG commented on GitHub (Jul 17, 2024):

the q8_0 version doesn't have this problem, so i assume it is a problem of quantization

<!-- gh-comment-id:2233430207 --> @DanielusG commented on GitHub (Jul 17, 2024): the q8_0 version doesn't have this problem, so i assume it is a problem of quantization
Author
Owner

@ototsu commented on GitHub (Jul 17, 2024):

@DanielusG I've tried running q8 and q4 quantization on GLM4 on Ollama, and both resulted in this issue, but it didn't occur on llama.cpp. It seems like the problem isn't related to quantization.

<!-- gh-comment-id:2233687073 --> @ototsu commented on GitHub (Jul 17, 2024): @DanielusG I've tried running q8 and q4 quantization on GLM4 on Ollama, and both resulted in this issue, but it didn't occur on llama.cpp. It seems like the problem isn't related to quantization.
Author
Owner

@DanielusG commented on GitHub (Jul 17, 2024):

@DanielusG I've tried running q8 and q4 quantization on GLM4 on Ollama, and both resulted in this issue, but it didn't occur on llama.cpp. It seems like the problem isn't related to quantization.

@ototsu it is really strange, with ollama 2.5 it works well the q8_0 on my PC, I used it extensively

<!-- gh-comment-id:2233692671 --> @DanielusG commented on GitHub (Jul 17, 2024): > @DanielusG I've tried running q8 and q4 quantization on GLM4 on Ollama, and both resulted in this issue, but it didn't occur on llama.cpp. It seems like the problem isn't related to quantization. @ototsu it is really strange, with ollama 2.5 it works well the q8_0 on my PC, I used it extensively
Author
Owner

@DanielusG commented on GitHub (Jul 17, 2024):

With the latest version of llama-server when i load the model i get "The chat template that comes with this model is not yet supported, falling back to chatml." This can be the error

<!-- gh-comment-id:2233835415 --> @DanielusG commented on GitHub (Jul 17, 2024): With the latest version of llama-server when i load the model i get "The chat template that comes with this model is not yet supported, falling back to chatml." This can be the error
Author
Owner

@Speedway1 commented on GitHub (Jul 20, 2024):

The GGGG issue is because of a fault with the copying between more than 1 AMD GPU. The fact that some quantised versions run, this is probably because the model is fitting into a single GPU memory so that's why it works. You can fiddle with the context window and get a smaller model to run in a single GPU VRAM. However if you extend the context window so that it needs more than one GPU then it fails.

There was recently a fix that implemented the essential llama.cpp flag for AMD builds (GGML_CUDA_NO_PEER_COPY) but it seems that there are other AMD issues with memory copying between GPUs as well.

<!-- gh-comment-id:2241327954 --> @Speedway1 commented on GitHub (Jul 20, 2024): The GGGG issue is because of a fault with the copying between more than 1 AMD GPU. The fact that some quantised versions run, this is probably because the model is fitting into a single GPU memory so that's why it works. You can fiddle with the context window and get a smaller model to run in a single GPU VRAM. However if you extend the context window so that it needs more than one GPU then it fails. There was recently a fix that implemented the essential llama.cpp flag for AMD builds (GGML_CUDA_NO_PEER_COPY) but it seems that there are other AMD issues with memory copying between GPUs as well.
Author
Owner

@somnifex commented on GitHub (Jul 21, 2024):

The GGGG issue is because of a fault with the copying between more than 1 AMD GPU. The fact that some quantised versions run, this is probably because the model is fitting into a single GPU memory so that's why it works. You can fiddle with the context window and get a smaller model to run in a single GPU VRAM. However if you extend the context window so that it needs more than one GPU then it fails.

There was recently a fix that implemented the essential llama.cpp flag for AMD builds (GGML_CUDA_NO_PEER_COPY) but it seems that there are other AMD issues with memory copying between GPUs as well.

This may not be solely due to AMD. I've reproduced this issue using two NVIDIA GPUs (3090 and titanx). However, I agree that it's a problem caused by cross-GPU, and the behavior is very similar.

<!-- gh-comment-id:2241382191 --> @somnifex commented on GitHub (Jul 21, 2024): > The GGGG issue is because of a fault with the copying between more than 1 AMD GPU. The fact that some quantised versions run, this is probably because the model is fitting into a single GPU memory so that's why it works. You can fiddle with the context window and get a smaller model to run in a single GPU VRAM. However if you extend the context window so that it needs more than one GPU then it fails. > > There was recently a fix that implemented the essential llama.cpp flag for AMD builds (GGML_CUDA_NO_PEER_COPY) but it seems that there are other AMD issues with memory copying between GPUs as well. This may not be solely due to AMD. I've reproduced this issue using two NVIDIA GPUs (3090 and titanx). However, I agree that it's a problem caused by cross-GPU, and the behavior is very similar.
Author
Owner

@wxfvf commented on GitHub (Jul 23, 2024):

I encountered the same issue on CodeGeex4 with dual 4090 GPUs, using the Ollama version 0.2.5. It runs normally at first after loading, but problems eccur after a while.

<!-- gh-comment-id:2244470438 --> @wxfvf commented on GitHub (Jul 23, 2024): I encountered the same issue on CodeGeex4 with dual 4090 GPUs, using the Ollama version 0.2.5. It runs normally at first after loading, but problems eccur after a while.
Author
Owner

@leizhu1989 commented on GitHub (Jul 24, 2024):

I pulled the official image and ran it. At first, I could use the Python Asynchronous Client interface to request inference normally. After a period of time, when I tried to request again, I would see "GGGGGGGGG". I ran “ollama run glm4:9b-chat-q8_0” in the image and was able to request and reply normally again

<!-- gh-comment-id:2246762252 --> @leizhu1989 commented on GitHub (Jul 24, 2024): I pulled the official image and ran it. At first, I could use the Python Asynchronous Client interface to request inference normally. After a period of time, when I tried to request again, I would see "GGGGGGGGG". I ran “ollama run glm4:9b-chat-q8_0” in the image and was able to request and reply normally again
Author
Owner

@wszgrcy commented on GitHub (Aug 1, 2024):

Same issue, some content can be returned normally, some content is GGGG
相同的问题,部分内容能正常返回,部分内容GGGG

<!-- gh-comment-id:2261707772 --> @wszgrcy commented on GitHub (Aug 1, 2024): Same issue, some content can be returned normally, some content is GGGG 相同的问题,部分内容能正常返回,部分内容GGGG
Author
Owner

@AeneasZhu commented on GitHub (Aug 13, 2024):

Any improvements in the future? The GGGG problem seems to be more serious in 0.35. @rick-github

<!-- gh-comment-id:2286425452 --> @AeneasZhu commented on GitHub (Aug 13, 2024): Any improvements in the future? The GGGG problem seems to be more serious in 0.35. @rick-github
Author
Owner

@rick-github commented on GitHub (Aug 13, 2024):

Recent server logs would help in debugging.

<!-- gh-comment-id:2286705421 --> @rick-github commented on GitHub (Aug 13, 2024): Recent [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) would help in debugging.
Author
Owner

@pdevine commented on GitHub (Sep 12, 2024):

I'm going to go ahead and close this out. Make sure you have the most recent version of ollama and the model. We can reopen if people are still having issues, but I couldn't repro at all.

<!-- gh-comment-id:2347308209 --> @pdevine commented on GitHub (Sep 12, 2024): I'm going to go ahead and close this out. Make sure you have the most recent version of ollama and the model. We can reopen if people are still having issues, but I couldn't repro at all.
Author
Owner

@MDev-eng commented on GitHub (Dec 8, 2024):

Problem of ollama outputting a string of "G"s is still present with latest Ollama version 0.5.1.

My setup is: Proxmox hypervisor environment, Ollama 0.5.1 (latest as of today) running as system service in a VM, Open-WebUI 0.4.8 (latest as of today) running in a container in another VM, two RTX 3060 12GB GPUs (with Proxmox pass through to access them from Ollama VM)

Model used is llama3.2-vision:latest

Use case is: start a new chat in open-webui, select llama3.2-vision:latest, leave all OpenWebUI default settings (including the context length which is 2048 by default), then load an image in the chat and ask "Describe the image".

SETUP 1: Dual GPUs, both used
What happens (as seen with nvtop and nvidia-smi) is that, as soon as the first question is entered, Ollama detects that there are two 12 GB GPUs, and spontaneously loads approx 8.5 GB of the LLM model on the first GPU, and 5 GB on the second. Then, what happens may slightly vary.

Sometimes (very infrequently), the first question is answered properly, but then, if a second question is placed, the answer is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG".

Other times, already the first answer is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG".

In both scenarios, anyway, once Ollama starts answering "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG", ALL subsequent questions are answered with "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". It never goes back to normal, and you never get a normal answer again.

Even if a new chat is started on openWebui, all answers (even the first one) are always "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". That is, it is not a openWebui problem that can be reset by acting on OpenWebui.

This suggests that Ollama server has entered some corrupt internal state.

In some cases, it even happens that "ollama" process goes to 100% CPU and never goes down.

The only way to restore normal behavior (both for the GGGGG problem and for Ollama process going crazy) is to stop and restart Ollama service.

SETUP 2: Dual GPUs, but Ollama restricted to using only one
If instead I force Ollama to use only one GPU, by specifying "CUDA_VISIBLE_DEVICES=0" in /etc/systemd/system/ollama.service before starting the Ollama service, then Ollama loads the entire model (about 11.3 GB) on the 12GB VRAM of the first GPU. and the "GGGGGGGG...." problem no longer occurs. But I lose the possibility to use both GPUs to load larger models or to expand the context size.

I think there is a problem with how Ollama allocates LLMs on multiple GPUs.

This is an important use case especially in educational and non-corporate use, because not all LLMs can fit a single GPUs of affordable cost, but in some cases, they can fit a combination of two or three cheap GPUs, which significantly expands the possibilities of self-hosted experimentation with LLMs.

<!-- gh-comment-id:2525452293 --> @MDev-eng commented on GitHub (Dec 8, 2024): Problem of ollama outputting a string of "G"s is still present with latest Ollama version 0.5.1. My setup is: Proxmox hypervisor environment, Ollama 0.5.1 (latest as of today) running as system service in a VM, Open-WebUI 0.4.8 (latest as of today) running in a container in another VM, two RTX 3060 12GB GPUs (with Proxmox pass through to access them from Ollama VM) Model used is llama3.2-vision:latest Use case is: start a new chat in open-webui, select llama3.2-vision:latest, leave all OpenWebUI default settings (including the context length which is 2048 by default), then load an image in the chat and ask "Describe the image". **SETUP 1: Dual GPUs, both used** What happens (as seen with nvtop and nvidia-smi) is that, as soon as the first question is entered, Ollama detects that there are two 12 GB GPUs, and spontaneously loads approx 8.5 GB of the LLM model on the first GPU, and 5 GB on the second. Then, what happens may slightly vary. Sometimes (very infrequently), the first question is answered properly, but then, if a second question is placed, the answer is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". Other times, already the first answer is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". In both scenarios, anyway, once Ollama starts answering "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG", ALL subsequent questions are answered with "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". It never goes back to normal, and you never get a normal answer again. Even if a new chat is started on openWebui, all answers (even the first one) are always "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG". That is, it is not a openWebui problem that can be reset by acting on OpenWebui. This suggests that Ollama server has entered some corrupt internal state. In some cases, it even happens that "ollama" process goes to 100% CPU and never goes down. The only way to restore normal behavior (both for the GGGGG problem and for Ollama process going crazy) is to stop and restart Ollama service. **SETUP 2: Dual GPUs, but Ollama restricted to using only one** If instead I force Ollama to use only one GPU, by specifying "CUDA_VISIBLE_DEVICES=0" in /etc/systemd/system/ollama.service before starting the Ollama service, then Ollama loads the entire model (about 11.3 GB) on the 12GB VRAM of the first GPU. and the "GGGGGGGG...." problem no longer occurs. But I lose the possibility to use both GPUs to load larger models or to expand the context size. I think there is a problem with how Ollama allocates LLMs on multiple GPUs. This is an important use case especially in educational and non-corporate use, because not all LLMs can fit a single GPUs of affordable cost, but in some cases, they can fit a combination of two or three cheap GPUs, which significantly expands the possibilities of self-hosted experimentation with LLMs.
Author
Owner

@MDev-eng commented on GitHub (Dec 8, 2024):

Another point is that with another LLM, qwen2.5-coder:32b, which definitely does benefit from the multiple GPUs (Ollama loads 9.9 GB on the first GPU and 10.1 GB on the second), the GGGGGGG... issue does not occur.

The problem only appears when the model is llama3.2-vision:latest, although this model is smaller than qwen2.5-coder:32b.

I think that Ollama should be able to run correctly and automatically whatever LLM is chosen and on whatever underlying set of available GPUs, provided that their VRAM is sufficient of course. No "GGGGG..."s should ever be answered, and Ollama server process should never go crazy at 100% CPU.

Even if the above occurs with a certain LLM and not with another, the LLM should not be blamed for these bugs. Ollama should behave properly, or issue an explicit error diagnostic, but not go randomly crazy.

<!-- gh-comment-id:2525454969 --> @MDev-eng commented on GitHub (Dec 8, 2024): Another point is that with another LLM, qwen2.5-coder:32b, which definitely does benefit from the multiple GPUs (Ollama loads 9.9 GB on the first GPU and 10.1 GB on the second), the GGGGGGG... issue does not occur. The problem only appears when the model is llama3.2-vision:latest, although this model is smaller than qwen2.5-coder:32b. I think that Ollama should be able to run correctly and automatically whatever LLM is chosen and on whatever underlying set of available GPUs, provided that their VRAM is sufficient of course. No "GGGGG..."s should ever be answered, and Ollama server process should never go crazy at 100% CPU. Even if the above occurs with a certain LLM and not with another, the LLM should not be blamed for these bugs. Ollama should behave properly, or issue an explicit error diagnostic, but not go randomly crazy.
Author
Owner

@rick-github commented on GitHub (Dec 8, 2024):

Recent server logs would help in debugging.

As Patrick pointed out, this issue is difficult to replicate. There has been speculation up-thread about causes, but without logs and detailed information about the environment and workload, there will be no progress on this issue.

<!-- gh-comment-id:2526429247 --> @rick-github commented on GitHub (Dec 8, 2024): Recent [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) would help in debugging. As Patrick pointed out, this issue is difficult to replicate. There has been speculation up-thread about causes, but without logs and detailed information about the environment and workload, there will be no progress on this issue.
Author
Owner

@leizhu1989 commented on GitHub (Dec 9, 2024):

I used the 450 driver version of Nvidia T4 and asked several questions about the large model, I encountered GGGGG issues. However, with the 470 driver version, I haven't encountered any issues for hundreds of times

<!-- gh-comment-id:2526651342 --> @leizhu1989 commented on GitHub (Dec 9, 2024): I used the 450 driver version of Nvidia T4 and asked several questions about the large model, I encountered GGGGG issues. However, with the 470 driver version, I haven't encountered any issues for hundreds of times
Author
Owner

@MDev-eng commented on GitHub (Dec 10, 2024):

Hello, in my setup it is the contrary: it is reproducible 99% of the times. The times it works as intended are rather the exception than the rule.

Ollama log attached.

The usage pattern is:

start with both ollama server down and open-webui down, and the two GPUs VRAM empty.

start ollama server (installed as a service in a dedicated VM). Startup is complete at 05.56.41.

start open-webui (Docker container in separate VM)

open the open-webui page,

authenticate

load an image in open-webui

place the question "How many question marks are there in the image?"

ollama processes the request, and spontaneously decides to load the model split across the two GPUs: 8.4 GB on the first and 3.75 on the second.

a few seconds elapse while the GPUs are processing

then at openwebUI side, the answer given is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" (the string does not appear in a single shot; rather, the "G"s appear gradually like in normal token generation)

the string appears also in the ollama log, but I think it appears not as an output from the ollama server, but rather, as part of the body of a POST request issued by open-webUI in which the chat history is re-included as context (thus the GGGGGG... string is present)

the last ollama log message while processing seems to be normal, and before the "GGG.."s begin to appear, is at 06.01.55:

level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=22 used=0 remaining=22

and then at 06.02.01 the "GGG..."s start to appear, and the ollama log message while that happens is

level=DEBUG source=server.go:836 msg="prediction aborted, token repeat limit reached"

NOTE::: In the RARE cases in which the "GGGG..." thing does not happen, the above line does NOT appear and that's basically the only difference in the ollama log between a successful and unsuccessful test execution.
So I think the condition to be investigated is "prediction aborted, token repeat limit reached" or what can be resulting in that message.

Note that after that message there are about 30 more log lines (full detail in attached log) but they are triggered by further requests done by open-webui perhaps attempting to resume the chat by automatically repeating a GET /api/tags and POST /api/chat to give conversation context to ollama

Anyway, if the "GGG.." thing happened, then at this point ollama server process has almost always gone to 100% CPUs and stays there forever unless explicitly killed. (In very rare cases when the "GGG" thing happens, the ollama server does not go to 100% CPU). On the contrary, when the "GGG" thing does not happen, ollama server always goes down to 0% waiting for new requests as expected.

4-chunk.txt

<!-- gh-comment-id:2530564568 --> @MDev-eng commented on GitHub (Dec 10, 2024): Hello, in my setup it is the contrary: it is reproducible 99% of the times. The times it works as intended are rather the exception than the rule. Ollama log attached. The usage pattern is: start with both ollama server down and open-webui down, and the two GPUs VRAM empty. start ollama server (installed as a service in a dedicated VM). Startup is complete at 05.56.41. start open-webui (Docker container in separate VM) open the open-webui page, authenticate load an image in open-webui place the question "How many question marks are there in the image?" ollama processes the request, and spontaneously decides to load the model split across the two GPUs: 8.4 GB on the first and 3.75 on the second. a few seconds elapse while the GPUs are processing then at openwebUI side, the answer given is "GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG" (the string does not appear in a single shot; rather, the "G"s appear gradually like in normal token generation) the string appears also in the ollama log, but I think it appears not as an output from the ollama server, but rather, as part of the body of a POST request issued by open-webUI in which the chat history is re-included as context (thus the GGGGGG... string is present) the last ollama log message while processing seems to be normal, and before the "GGG.."s begin to appear, is at 06.01.55: level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=22 used=0 remaining=22 and then at 06.02.01 the "GGG..."s start to appear, and the ollama log message while that happens is level=DEBUG source=server.go:836 msg="prediction aborted, token repeat limit reached" NOTE::: In the RARE cases in which the "GGGG..." thing does not happen, the above line does NOT appear and that's basically the only difference in the ollama log between a successful and unsuccessful test execution. So I think the condition to be investigated is "prediction aborted, token repeat limit reached" or what can be resulting in that message. Note that after that message there are about 30 more log lines (full detail in attached log) but they are triggered by further requests done by open-webui perhaps attempting to resume the chat by automatically repeating a GET /api/tags and POST /api/chat to give conversation context to ollama Anyway, if the "GGG.." thing happened, then at this point ollama server process has almost always gone to 100% CPUs and stays there forever unless explicitly killed. (In very rare cases when the "GGG" thing happens, the ollama server does not go to 100% CPU). On the contrary, when the "GGG" thing does not happen, ollama server always goes down to 0% waiting for new requests as expected. [4-chunk.txt](https://github.com/user-attachments/files/18073090/4-chunk.txt)
Author
Owner

@rick-github commented on GitHub (Dec 10, 2024):

What's the output of nvidia-smi after the model is loaded?

<!-- gh-comment-id:2531226798 --> @rick-github commented on GitHub (Dec 10, 2024): What's the output of `nvidia-smi` after the model is loaded?
Author
Owner

@rick-github commented on GitHub (Dec 10, 2024):

Does the model respond properly if you ask it a question (Why is the sky blue?) before you upload an image?

<!-- gh-comment-id:2531259884 --> @rick-github commented on GitHub (Dec 10, 2024): Does the model respond properly if you ask it a question (Why is the sky blue?) before you upload an image?
Author
Owner

@MDev-eng commented on GitHub (Dec 10, 2024):

image

<!-- gh-comment-id:2532786261 --> @MDev-eng commented on GitHub (Dec 10, 2024): ![image](https://github.com/user-attachments/assets/bfc5a2bc-3286-47cc-9f30-e5c08a20ab06)
Author
Owner

@MDev-eng commented on GitHub (Dec 10, 2024):

image

<!-- gh-comment-id:2532824773 --> @MDev-eng commented on GitHub (Dec 10, 2024): ![image](https://github.com/user-attachments/assets/a38663b9-3bc7-4168-abca-eeb85f066f9e)
Author
Owner

@MDev-eng commented on GitHub (Dec 10, 2024):

Normally this model answers correctly when used with text questions.

Problems occur when images are uploaded and involved in the chat.

In the example, a first text-only question is answered properly. Then an image is loaded, and as it occasionally happens, an answer is given (albeit incorrect). Then another image is loaded, and the system enters the corrupt state. From then on, all questions (including text questions) either are not answered (timeout) or they are answered the same way - "GGGGGGGGGGGGGGGGGGG...."

And usually, once this state is reached, the "ollama" process stays at 100% CPU until killed.

<!-- gh-comment-id:2532832303 --> @MDev-eng commented on GitHub (Dec 10, 2024): Normally this model answers correctly when used with text questions. Problems occur when images are uploaded and involved in the chat. In the example, a first text-only question is answered properly. Then an image is loaded, and as it occasionally happens, an answer is given (albeit incorrect). Then another image is loaded, and the system enters the corrupt state. From then on, all questions (including text questions) either are not answered (timeout) or they are answered the same way - "GGGGGGGGGGGGGGGGGGG...." And usually, once this state is reached, the "ollama" process stays at 100% CPU until killed.
Author
Owner

@rick-github commented on GitHub (Dec 10, 2024):

Logs from this session?

<!-- gh-comment-id:2532877605 --> @rick-github commented on GitHub (Dec 10, 2024): Logs from this session?
Author
Owner

@MDev-eng commented on GitHub (Dec 11, 2024):

Here it is.
As usual, there is a "prediction aborted, token repeat limit reached" message when things go crazy.

bigsession-log.txt

<!-- gh-comment-id:2536136323 --> @MDev-eng commented on GitHub (Dec 11, 2024): Here it is. As usual, there is a "prediction aborted, token repeat limit reached" message when things go crazy. [bigsession-log.txt](https://github.com/user-attachments/files/18097043/bigsession-log.txt)
Author
Owner

@rick-github commented on GitHub (Dec 11, 2024):

Does the behaviour change if you do the same actions through the CLI? eg:

$ ollama run llama3.2-vision
>>> Why is the sky blue?
....
>>> How many questions marks are in this image? ./question_marks.png
...
>>> Describe this scene.  ./country_road.png
...
<!-- gh-comment-id:2537359525 --> @rick-github commented on GitHub (Dec 11, 2024): Does the behaviour change if you do the same actions through the CLI? eg: ```console $ ollama run llama3.2-vision >>> Why is the sky blue? .... >>> How many questions marks are in this image? ./question_marks.png ... >>> Describe this scene. ./country_road.png ... ```
Author
Owner

@MDev-eng commented on GitHub (Dec 22, 2024):

The behavior does not change. Either when answering to the question about the first picture, or when answering to the second, it will output GGGG's, ollama process goes to 100%CPU and stays there forever and must be killed manually.
So the problem is not caused by open-webui.
Below the console-based chat session. Attached the log of the session. In the log there is the usual message "prediction aborted, token repeat limit reached" shortly before the GGGG's begin appearing.

image
console-based-session-ollama-log.txt

<!-- gh-comment-id:2558547515 --> @MDev-eng commented on GitHub (Dec 22, 2024): The behavior does not change. Either when answering to the question about the first picture, or when answering to the second, it will output GGGG's, ollama process goes to 100%CPU and stays there forever and must be killed manually. So the problem is not caused by open-webui. Below the console-based chat session. Attached the log of the session. In the log there is the usual message "prediction aborted, token repeat limit reached" shortly before the GGGG's begin appearing. ![image](https://github.com/user-attachments/assets/c5bd2380-680f-40ab-acb5-3536569e25a0) [console-based-session-ollama-log.txt](https://github.com/user-attachments/files/18222877/console-based-session-ollama-log.txt)
Author
Owner

@MDev-eng commented on GitHub (Dec 22, 2024):

Above log is from ollama 0.5.1.
I upgraded to latest ollama 0.5.4 and behavior is unchanged.

So based on what we are seeing so far, it seems the problem is

1- inherent to ollama, not due to open-webui

2- occurring even with latest ollama version (0.5.4 at time of writing this) with exact same message in ollama log shortly before GGG's begin appearing

3- reproducibly occurring when all of the following hold: model is llama3.2-vision, question is about a picture, and the hardware has two GPUs on which ollama spreads the LLM to take advantage of the overall memory.

<!-- gh-comment-id:2558551436 --> @MDev-eng commented on GitHub (Dec 22, 2024): Above log is from ollama 0.5.1. I upgraded to latest ollama 0.5.4 and behavior is unchanged. So based on what we are seeing so far, it seems the problem is 1- inherent to ollama, not due to open-webui 2- occurring even with latest ollama version (0.5.4 at time of writing this) with exact same message in ollama log shortly before GGG's begin appearing 3- reproducibly occurring when all of the following hold: model is llama3.2-vision, question is about a picture, and the hardware has two GPUs on which ollama spreads the LLM to take advantage of the overall memory.
Author
Owner

@rick-github commented on GitHub (Dec 22, 2024):

Would it be possible for you to try older versions of ollama to see if this is a result of a version change? 0.4.0 is the oldest that will support llama3.2-vision. There's another open issue where the user also has a multi-GPU setup and sees token generation issues starting with 0.5.0.

<!-- gh-comment-id:2558631493 --> @rick-github commented on GitHub (Dec 22, 2024): Would it be possible for you to try older versions of ollama to see if this is a result of a version change? 0.4.0 is the oldest that will support llama3.2-vision. There's another [open issue](https://github.com/ollama/ollama/issues/8188) where the user also has a multi-GPU setup and sees token generation issues starting with 0.5.0.
Author
Owner

@MDev-eng commented on GitHub (Dec 26, 2024):

Installed Ollama 0.4.0, repeated test, same behavior (outputs GGGG's as soon as I place a question about the image).

As usual, in the log there is the message
msg="prediction aborted, token repeat limit reached"
just before it starts outputting gibberish G's.

<!-- gh-comment-id:2562757361 --> @MDev-eng commented on GitHub (Dec 26, 2024): Installed Ollama 0.4.0, repeated test, same behavior (outputs GGGG's as soon as I place a question about the image). As usual, in the log there is the message msg="prediction aborted, token repeat limit reached" just before it starts outputting gibberish G's.
Author
Owner

@rick-github commented on GitHub (Dec 26, 2024):

Thanks. Just to clarify, "prediction aborted" is a symptom, not an indication of the cause. What's happening is that the runner is going off the rails and generating a bunch of the same tokens (GGG in this case). The server detects this sequence and determines that the runner has got into a bad state so it stops listening to the runner. It doesn't kill the runner as the act of no longer listening is supposed to reset the runner. This doesn't appear to be working, perhaps in part due to the switch to go runners in 0.4.0. What we have to determine is why the runner gets in this state. Previous occurrences of this were related to the prompt + output tokens exceeding the context window - since generation is a feedback loop, something that disrupts the feedback (running out of token space) can cause the response to lose coherence and start generating rubbish. I've tried testing this locally using images clipped from your screenshots and it works fine for me, but it's worth testing it in your environment. Do the same CLI test as before, but set a large context window at the start:

$ ollama run llama3.2-vision
>>> /set parameter num_ctx 16384
>>> Why is the sky blue?
....
>>> How many questions marks are in this image? ./smileys.jpg
...
>>> Describe this scene.  ./country_road.png
<!-- gh-comment-id:2562806168 --> @rick-github commented on GitHub (Dec 26, 2024): Thanks. Just to clarify, "prediction aborted" is a symptom, not an indication of the cause. What's happening is that the runner is going off the rails and generating a bunch of the same tokens (GGG in this case). The server detects this sequence and determines that the runner has got into a bad state so it stops listening to the runner. It doesn't kill the runner as the act of no longer listening is supposed to reset the runner. This doesn't appear to be working, perhaps in part due to the switch to go runners in 0.4.0. What we have to determine is why the runner gets in this state. Previous occurrences of this were related to the prompt + output tokens exceeding the context window - since generation is a feedback loop, something that disrupts the feedback (running out of token space) can cause the response to lose coherence and start generating rubbish. I've tried testing this locally using images clipped from your screenshots and it works fine for me, but it's worth testing it in your environment. Do the same CLI test as before, but set a large context window at the start: ```console $ ollama run llama3.2-vision >>> /set parameter num_ctx 16384 >>> Why is the sky blue? .... >>> How many questions marks are in this image? ./smileys.jpg ... >>> Describe this scene. ./country_road.png
Author
Owner

@MDev-eng commented on GitHub (Dec 26, 2024):

image

With a context window of 16 K the behavior is even worse: the server goes crazy not at the second question, when asked about a picture, but already at the first question (non-picture related)

<!-- gh-comment-id:2563052999 --> @MDev-eng commented on GitHub (Dec 26, 2024): ![image](https://github.com/user-attachments/assets/4af16310-21bc-4b5a-bbc8-d7c5d0c53121) With a context window of 16 K the behavior is even worse: the server goes crazy not at the second question, when asked about a picture, but already at the first question (non-picture related)
Author
Owner

@MDev-eng commented on GitHub (Dec 26, 2024):

There is one difference though: the "ollama" process did not go to 100% CPU. With previous setup (default context window), the output of gibberish G's was invariably associated with ollama process at 100% CPU. With 16K ctx window it seems to be no longer the case.

<!-- gh-comment-id:2563054020 --> @MDev-eng commented on GitHub (Dec 26, 2024): There is one difference though: the "ollama" process did not go to 100% CPU. With previous setup (default context window), the output of gibberish G's was invariably associated with ollama process at 100% CPU. With 16K ctx window it seems to be no longer the case.
Author
Owner

@MDev-eng commented on GitHub (Dec 26, 2024):

Same happens with ctx size = 32768

With ctx size=65536, still no joy, but something interesting happens. First, about 20 seconds delay before any output begins (normally the delay is only 3-4 seconds). And during this time, only 2MB of model were loaded on the second GPU, and nothing on the first GPU (normally, roughly half model gets loaded on each GPU). Then, gibberish G's begin to appear but very slowly; while they appear, ollama process goes at 175% CPU, and when they stop appearing, I noticed that about 11 MB were loaded on EACH GPU. Normally, the 11B model got split almost evenly between the two GPUs (for example 5 GB on the first and 7 GB on the second). So I don't know where these 11+11=22 GB come from. And I wonder why this happens just because I set the context size to 64K.
Finally, with 64K context, the ollama process is left at 100% after the gibberish G's have stopped appearing (whereas with 32K, it went back to idle state at the end of the test)

<!-- gh-comment-id:2563060403 --> @MDev-eng commented on GitHub (Dec 26, 2024): Same happens with ctx size = 32768 With ctx size=65536, still no joy, but something interesting happens. First, about 20 seconds delay before any output begins (normally the delay is only 3-4 seconds). And during this time, only 2MB of model were loaded on the second GPU, and nothing on the first GPU (normally, roughly half model gets loaded on each GPU). Then, gibberish G's begin to appear but very slowly; while they appear, ollama process goes at 175% CPU, and when they stop appearing, I noticed that about 11 MB were loaded on EACH GPU. Normally, the 11B model got split almost evenly between the two GPUs (for example 5 GB on the first and 7 GB on the second). So I don't know where these 11+11=22 GB come from. And I wonder why this happens just because I set the context size to 64K. Finally, with 64K context, the ollama process is left at 100% after the gibberish G's have stopped appearing (whereas with 32K, it went back to idle state at the end of the test)
Author
Owner

@MDev-eng commented on GitHub (Dec 26, 2024):

With 1K and 4K context size, behavior is like with 16K or 32K.

<!-- gh-comment-id:2563068458 --> @MDev-eng commented on GitHub (Dec 26, 2024): With 1K and 4K context size, behavior is like with 16K or 32K.
Author
Owner

@rick-github commented on GitHub (Dec 27, 2024):

Thanks for trying other context sizes.

Just another clarification: the CPU going to 100% during GPU inference is expected. The synchronization mechanism between the CPU and the GPU(s) is a busy wait. The processors use a bit of shared memory to communicate, so when it comes time to perform an inference, the CPU sends the commands to the GPU and then spins on the shared memory waiting for the GPU to say 'yes, finished that command, what next?', the CPU sends another command and then goes back to spinning, etc, until the inference is complete.

I'm surprised that varying the context size causes immediate breakdown. Would it be possible for you to add the logs from the 16K and 64K experiments?

<!-- gh-comment-id:2563235992 --> @rick-github commented on GitHub (Dec 27, 2024): Thanks for trying other context sizes. Just another clarification: the CPU going to 100% during GPU inference is expected. The synchronization mechanism between the CPU and the GPU(s) is a busy wait. The processors use a bit of shared memory to communicate, so when it comes time to perform an inference, the CPU sends the commands to the GPU and then spins on the shared memory waiting for the GPU to say 'yes, finished that command, what next?', the CPU sends another command and then goes back to spinning, etc, until the inference is complete. I'm surprised that varying the context size causes immediate breakdown. Would it be possible for you to add the logs from the 16K and 64K experiments?
Author
Owner

@MDev-eng commented on GitHub (Jan 6, 2025):

Thanks for trying other context sizes.

Just another clarification: the CPU going to 100% during GPU inference is expected. The synchronization mechanism between the CPU and the GPU(s) is a busy wait. The processors use a bit of shared memory to communicate, so when it comes time to perform an inference, the CPU sends the commands to the GPU and then spins on the shared memory waiting for the GPU to say 'yes, finished that command, what next?', the CPU sends another command and then goes back to spinning, etc, until the inference is complete.

100% CPU usage for a polling loop seems a bit a waste of computing power. As results come from GPU only after a little (but not tiny) time, it does not seem necessary to poll billions of times per second. Perhaps 100 times a second would be more than enough. Maybe put a millisecond sleep somewhere in that loop ?

<!-- gh-comment-id:2572760040 --> @MDev-eng commented on GitHub (Jan 6, 2025): > Thanks for trying other context sizes. > > Just another clarification: the CPU going to 100% during GPU inference is expected. The synchronization mechanism between the CPU and the GPU(s) is a busy wait. The processors use a bit of shared memory to communicate, so when it comes time to perform an inference, the CPU sends the commands to the GPU and then spins on the shared memory waiting for the GPU to say 'yes, finished that command, what next?', the CPU sends another command and then goes back to spinning, etc, until the inference is complete. 100% CPU usage for a polling loop seems a bit a waste of computing power. As results come from GPU only after a little (but not tiny) time, it does not seem necessary to poll billions of times per second. Perhaps 100 times a second would be more than enough. Maybe put a millisecond sleep somewhere in that loop ?
Author
Owner

@rick-github commented on GitHub (Jan 7, 2025):

To clarify further, this is a feature of the Nvidia driver and HIP (ROCm) driver. There are discussions on llama.cpp about this (eg https://github.com/ggerganov/llama.cpp/issues/8684) but it hasn't changed so I assume there's a reason to keep this approach.

<!-- gh-comment-id:2575628824 --> @rick-github commented on GitHub (Jan 7, 2025): To clarify further, this is a [feature](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__STREAM.html#group__CUDART__STREAM_1g82b5784f674c17c6df64affe618bf45e) of the Nvidia driver and HIP (ROCm) driver. There are discussions on llama.cpp about this (eg https://github.com/ggerganov/llama.cpp/issues/8684) but it hasn't changed so I assume there's a reason to keep this approach.
Author
Owner

@gionkunz commented on GitHub (Feb 18, 2025):

I had the same issue printing "GGGGGG" after a few chats and after that also in new sessions I only got "GGGGG". I overclocked my GPU (undervolting and +1200MHz mem clock) and suspected that this could be the problem. While running stable in other cases, the AI workload seemed to be destabilize the GPU in that overclocked state. As soon as I tuned it back to factory settings, the issue was gone.

<!-- gh-comment-id:2664254455 --> @gionkunz commented on GitHub (Feb 18, 2025): I had the same issue printing "GGGGGG" after a few chats and after that also in new sessions I only got "GGGGG". I overclocked my GPU (undervolting and +1200MHz mem clock) and suspected that this could be the problem. While running stable in other cases, the AI workload seemed to be destabilize the GPU in that overclocked state. As soon as I tuned it back to factory settings, the issue was gone.
Author
Owner

@MDev-eng commented on GitHub (Feb 18, 2025):

I had the same issue printing "GGGGGG" after a few chats and after that also in new sessions I only got "GGGGG". I overclocked my GPU (undervolting and +1200MHz mem clock) and suspected that this could be the problem. While running stable in other cases, the AI workload seemed to be destabilize the GPU in that overclocked state. As soon as I tuned it back to factory settings, the issue was gone.

Unfortunately in my case there is no overclocking to undo.
My setup uses two 12GB RTX3060's in absolutely stock condition.
So the quest for the root cause continues....

<!-- gh-comment-id:2666428562 --> @MDev-eng commented on GitHub (Feb 18, 2025): > I had the same issue printing "GGGGGG" after a few chats and after that also in new sessions I only got "GGGGG". I overclocked my GPU (undervolting and +1200MHz mem clock) and suspected that this could be the problem. While running stable in other cases, the AI workload seemed to be destabilize the GPU in that overclocked state. As soon as I tuned it back to factory settings, the issue was gone. Unfortunately in my case there is no overclocking to undo. My setup uses two 12GB RTX3060's in absolutely stock condition. So the quest for the root cause continues....
Author
Owner

@arazdow commented on GitHub (Sep 25, 2025):

I'm getting the GGGGGGGG on a single GPU - NVIDIA Jetson Orin Nano board (ARM based). Not always, but often. Running llama3.1:8b.

<!-- gh-comment-id:3336084701 --> @arazdow commented on GitHub (Sep 25, 2025): I'm getting the GGGGGGGG on a single GPU - NVIDIA Jetson Orin Nano board (ARM based). Not always, but often. Running llama3.1:8b.
Author
Owner

@rick-github commented on GitHub (Sep 25, 2025):

https://github.com/ollama/ollama/issues/12209

<!-- gh-comment-id:3336145357 --> @rick-github commented on GitHub (Sep 25, 2025): https://github.com/ollama/ollama/issues/12209
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50045