[GH-ISSUE #7113] llama runner process has terminated: error loading model: error loading model vocabulary: invalid string position #66576

Closed
opened 2026-05-04 07:28:34 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @ImValll on GitHub (Oct 7, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7113

What is the issue?

I finetuned the gemma 2 model and converted it in GGUF, I try to run it with this code but it isn't working. Do you have any idea?

import ollama

import asyncio
from ollama import AsyncClient

async def chat(human_message):
message = {'role': 'human', 'content': human_message}
async for part in await AsyncClient().chat(model='Gemma_chat_bot', messages=[message], stream=True):
print(part['message']['content'], end='', flush=True)

modelfile='''
FROM ./llama.cpp/quantized_model/FP16.gguf
'''
ollama.create(model='Gemma_chat_bot', modelfile=modelfile)

asyncio.run(chat('Salut, comment puis-je faire pour envoyer un mail ?'))

OS

Windows

GPU

AMD

CPU

AMD

Ollama version

ollama version is 0.3.12

Originally created by @ImValll on GitHub (Oct 7, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7113 ### What is the issue? I finetuned the gemma 2 model and converted it in GGUF, I try to run it with this code but it isn't working. Do you have any idea? import ollama import asyncio from ollama import AsyncClient async def chat(human_message): message = {'role': 'human', 'content': human_message} async for part in await AsyncClient().chat(model='Gemma_chat_bot', messages=[message], stream=True): print(part['message']['content'], end='', flush=True) modelfile=''' FROM ./llama.cpp/quantized_model/FP16.gguf ''' ollama.create(model='Gemma_chat_bot', modelfile=modelfile) asyncio.run(chat('Salut, comment puis-je faire pour envoyer un mail ?')) ### OS Windows ### GPU AMD ### CPU AMD ### Ollama version ollama version is 0.3.12
GiteaMirror added the bug label 2026-05-04 07:28:34 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 7, 2024):

Server logs may be helpful.

What's the result of ollama run Gemma_chat_bot hello?

<!-- gh-comment-id:2397321607 --> @rick-github commented on GitHub (Oct 7, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may be helpful. What's the result of `ollama run Gemma_chat_bot hello`?
Author
Owner

@ImValll commented on GitHub (Oct 8, 2024):

Here is my Server logs :

[GIN] 2024/10/08 - 09:31:35 | 200 |     23.0138ms |       127.0.0.1 | GET      "/api/tags"
time=2024-10-08T09:31:36.610+02:00 level=INFO source=server.go:103 msg="system memory" total="7.4 GiB" free="5.0 GiB" free_swap="39.8 GiB"
time=2024-10-08T09:31:36.614+02:00 level=INFO source=memory.go:326 msg="offload to cpu" layers.requested=-1 layers.model=19 layers.offload=0 layers.split="" memory.available="[5.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.9 GiB" memory.required.partial="0 B" memory.required.kv="144.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="2.9 GiB" memory.weights.nonrepeating="1000.0 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB"
time=2024-10-08T09:31:36.631+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="C:\\Users\\vhenry\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\vhenry\\.ollama\\models\\blobs\\sha256-79d06c9d0600b3c7dff05ecaad95e35d27a3a7747a86d337d8eb24e31d26b989 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 52402"
time=2024-10-08T09:31:36.637+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-10-08T09:31:36.638+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
time=2024-10-08T09:31:36.641+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=3670 commit="4dc7409e" tid="4264" timestamp=1728372696
INFO [wmain] system info | n_threads=6 n_threads_batch=6 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="4264" timestamp=1728372696 total_threads=12
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="52402" tid="4264" timestamp=1728372696
llama_model_loader: loaded meta data with 35 key-value pairs and 164 tensors from C:\Users\vhenry\.ollama\models\blobs\sha256-79d06c9d0600b3c7dff05ecaad95e35d27a3a7747a86d337d8eb24e31d26b989 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma 2b It
llama_model_loader: - kv   3:                       general.organization str              = Google
llama_model_loader: - kv   4:                           general.finetune str              = it
llama_model_loader: - kv   5:                           general.basename str              = gemma
llama_model_loader: - kv   6:                         general.size_label str              = 2B
llama_model_loader: - kv   7:                       gemma.context_length u32              = 8192
llama_model_loader: - kv   8:                     gemma.embedding_length u32              = 2048
llama_model_loader: - kv   9:                          gemma.block_count u32              = 18
llama_model_loader: - kv  10:                  gemma.feed_forward_length u32              = 16384
llama_model_loader: - kv  11:                 gemma.attention.head_count u32              = 8
llama_model_loader: - kv  12:              gemma.attention.head_count_kv u32              = 1
llama_model_loader: - kv  13:     gemma.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                 gemma.attention.key_length u32              = 256
llama_model_loader: - kv  15:               gemma.attention.value_length u32              = 256
llama_model_loader: - kv  16:                          general.file_type u32              = 1
llama_model_loader: - kv  17:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  18:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  19:                      tokenizer.ggml.tokens arr[str,256000]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
time=2024-10-08T09:31:36.898+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: - kv  20:                      tokenizer.ggml.scores arr[f32,256000]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  21:                  tokenizer.ggml.token_type arr[i32,256000]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  23:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  24:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 1
llama_model_loader: - kv  26:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  27:               tokenizer.ggml.add_eos_token bool             = true
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {{ bos_token }}{% if messages[0]['rol...
llama_model_loader: - kv  29:             tokenizer.ggml.prefix_token_id u32              = 67
llama_model_loader: - kv  30:             tokenizer.ggml.suffix_token_id u32              = 69
llama_model_loader: - kv  31:             tokenizer.ggml.middle_token_id u32              = 68
llama_model_loader: - kv  32:                tokenizer.ggml.eot_token_id u32              = 107
llama_model_loader: - kv  33:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   37 tensors
llama_model_loader: - type  f16:  127 tensors
llm_load_vocab: special tokens cache size = 217
llama_model_load: error loading model: error loading model vocabulary: invalid string position
llama_load_model_from_file: exception loading model
time=2024-10-08T09:31:37.364+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding"
time=2024-10-08T09:31:38.548+02:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: error loading model vocabulary: invalid string position"
[GIN] 2024/10/08 - 09:31:38 | 500 |    2.6279301s |       127.0.0.1 | POST     "/api/chat"

ollama run Gemma_chat_bot hello gives me exactly the same error

However if I try to run a pretrained model like llama3.1, it's working so I don't the bug comes from my config

<!-- gh-comment-id:2399070959 --> @ImValll commented on GitHub (Oct 8, 2024): Here is my Server logs : ``` [GIN] 2024/10/08 - 09:31:35 | 200 | 23.0138ms | 127.0.0.1 | GET "/api/tags" time=2024-10-08T09:31:36.610+02:00 level=INFO source=server.go:103 msg="system memory" total="7.4 GiB" free="5.0 GiB" free_swap="39.8 GiB" time=2024-10-08T09:31:36.614+02:00 level=INFO source=memory.go:326 msg="offload to cpu" layers.requested=-1 layers.model=19 layers.offload=0 layers.split="" memory.available="[5.0 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.9 GiB" memory.required.partial="0 B" memory.required.kv="144.0 MiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="3.8 GiB" memory.weights.repeating="2.9 GiB" memory.weights.nonrepeating="1000.0 MiB" memory.graph.full="504.0 MiB" memory.graph.partial="914.2 MiB" time=2024-10-08T09:31:36.631+02:00 level=INFO source=server.go:388 msg="starting llama server" cmd="C:\\Users\\vhenry\\AppData\\Local\\Programs\\Ollama\\lib\\ollama\\runners\\cpu_avx2\\ollama_llama_server.exe --model C:\\Users\\vhenry\\.ollama\\models\\blobs\\sha256-79d06c9d0600b3c7dff05ecaad95e35d27a3a7747a86d337d8eb24e31d26b989 --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --parallel 4 --port 52402" time=2024-10-08T09:31:36.637+02:00 level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-08T09:31:36.638+02:00 level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-10-08T09:31:36.641+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=3670 commit="4dc7409e" tid="4264" timestamp=1728372696 INFO [wmain] system info | n_threads=6 n_threads_batch=6 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="4264" timestamp=1728372696 total_threads=12 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="52402" tid="4264" timestamp=1728372696 llama_model_loader: loaded meta data with 35 key-value pairs and 164 tensors from C:\Users\vhenry\.ollama\models\blobs\sha256-79d06c9d0600b3c7dff05ecaad95e35d27a3a7747a86d337d8eb24e31d26b989 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = gemma llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Gemma 2b It llama_model_loader: - kv 3: general.organization str = Google llama_model_loader: - kv 4: general.finetune str = it llama_model_loader: - kv 5: general.basename str = gemma llama_model_loader: - kv 6: general.size_label str = 2B llama_model_loader: - kv 7: gemma.context_length u32 = 8192 llama_model_loader: - kv 8: gemma.embedding_length u32 = 2048 llama_model_loader: - kv 9: gemma.block_count u32 = 18 llama_model_loader: - kv 10: gemma.feed_forward_length u32 = 16384 llama_model_loader: - kv 11: gemma.attention.head_count u32 = 8 llama_model_loader: - kv 12: gemma.attention.head_count_kv u32 = 1 llama_model_loader: - kv 13: gemma.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 14: gemma.attention.key_length u32 = 256 llama_model_loader: - kv 15: gemma.attention.value_length u32 = 256 llama_model_loader: - kv 16: general.file_type u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.model str = llama llama_model_loader: - kv 18: tokenizer.ggml.pre str = default llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,256000] = ["<pad>", "<eos>", "<bos>", "<unk>", ... time=2024-10-08T09:31:36.898+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llama_model_loader: - kv 20: tokenizer.ggml.scores arr[f32,256000] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,256000] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 2 llama_model_loader: - kv 23: tokenizer.ggml.eos_token_id u32 = 1 llama_model_loader: - kv 24: tokenizer.ggml.unknown_token_id u32 = 3 llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 1 llama_model_loader: - kv 26: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 27: tokenizer.ggml.add_eos_token bool = true llama_model_loader: - kv 28: tokenizer.chat_template str = {{ bos_token }}{% if messages[0]['rol... llama_model_loader: - kv 29: tokenizer.ggml.prefix_token_id u32 = 67 llama_model_loader: - kv 30: tokenizer.ggml.suffix_token_id u32 = 69 llama_model_loader: - kv 31: tokenizer.ggml.middle_token_id u32 = 68 llama_model_loader: - kv 32: tokenizer.ggml.eot_token_id u32 = 107 llama_model_loader: - kv 33: tokenizer.ggml.add_space_prefix bool = false llama_model_loader: - kv 34: general.quantization_version u32 = 2 llama_model_loader: - type f32: 37 tensors llama_model_loader: - type f16: 127 tensors llm_load_vocab: special tokens cache size = 217 llama_model_load: error loading model: error loading model vocabulary: invalid string position llama_load_model_from_file: exception loading model time=2024-10-08T09:31:37.364+02:00 level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server not responding" time=2024-10-08T09:31:38.548+02:00 level=ERROR source=sched.go:455 msg="error loading llama server" error="llama runner process has terminated: error loading model: error loading model vocabulary: invalid string position" [GIN] 2024/10/08 - 09:31:38 | 500 | 2.6279301s | 127.0.0.1 | POST "/api/chat" ``` `ollama run Gemma_chat_bot hello` gives me exactly the same error However if I try to run a pretrained model like llama3.1, it's working so I don't the bug comes from my config
Author
Owner

@rick-github commented on GitHub (Oct 8, 2024):

What did you use to finetune the model?

<!-- gh-comment-id:2399079936 --> @rick-github commented on GitHub (Oct 8, 2024): What did you use to finetune the model?
Author
Owner

@ImValll commented on GitHub (Oct 8, 2024):

I followed this tutorial

but I ran it on my CPU instead of my GPU because I can't use my GPU

<!-- gh-comment-id:2399087503 --> @ImValll commented on GitHub (Oct 8, 2024): I followed this tutorial [](https://medium.com/the-ai-forum/instruction-fine-tuning-gemma-2b-on-medical-reasoning-and-convert-the-finetuned-model-into-gguf-844191f8d329) but I ran it on my CPU instead of my GPU because I can't use my GPU
Author
Owner

@ImValll commented on GitHub (Oct 8, 2024):

link of the tutorial

<!-- gh-comment-id:2399088167 --> @ImValll commented on GitHub (Oct 8, 2024): [link of the tutorial](https://medium.com/the-ai-forum/instruction-fine-tuning-gemma-2b-on-medical-reasoning-and-convert-the-finetuned-model-into-gguf-844191f8d329)
Author
Owner

@FAyyn commented on GitHub (Oct 21, 2024):

I'm also having the same issue, please have you solved it now

<!-- gh-comment-id:2425540657 --> @FAyyn commented on GitHub (Oct 21, 2024): I'm also having the same issue, please have you solved it now
Author
Owner

@ImValll commented on GitHub (Feb 17, 2025):

No sorry I didn't solved it so I just gave up

<!-- gh-comment-id:2662687368 --> @ImValll commented on GitHub (Feb 17, 2025): No sorry I didn't solved it so I just gave up
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66576