[GH-ISSUE #4529] error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' #28600

Closed
opened 2026-04-22 07:00:28 -05:00 by GiteaMirror · 22 comments
Owner

Originally created by @Anorid on GitHub (May 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4529

What is the issue?

I carefully read the contents of the readme's documentation to try and found that something went wrong

time=2024-05-20T10:06:02.688+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2132883000/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 33525"
time=2024-05-20T10:06:02.690+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-20T10:06:02.690+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-20T10:06:02.691+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="140401842012160" timestamp=1716170762
INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140401842012160" timestamp=1716170762 total_threads=128
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="33525" tid="140401842012160" timestamp=1716170762
llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = merge5-1
llama_model_loader: - kv 2: qwen2.block_count u32 = 40
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 201 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-05-20T10:06:02.944+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
time=2024-05-20T10:06:03.285+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
time=2024-05-20T10:06:03.535+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "
[GIN] 2024/05/20 - 10:06:03 | 500 | 2.178464527s | 127.0.0.1 | POST "/api/chat"
time=2024-05-20T10:06:07.831+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T10:06:07.832+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB"
time=2024-05-20T10:06:07.832+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2132883000/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 43339"
time=2024-05-20T10:06:07.833+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-05-20T10:06:07.833+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding"
time=2024-05-20T10:06:07.833+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="952d03d" tid="140283378036736" timestamp=1716170767
INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140283378036736" timestamp=1716170767 total_threads=128
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="43339" tid="140283378036736" timestamp=1716170767
llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = merge5-1
llama_model_loader: - kv 2: qwen2.block_count u32 = 40
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 201 tensors
llama_model_loader: - type q4_0: 281 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-05-20T10:06:08.085+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
terminate called after throwing an instance of 'std::runtime_error'
what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
time=2024-05-20T10:06:08.437+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error"
time=2024-05-20T10:06:08.656+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.120574757
time=2024-05-20T10:06:08.688+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) "

I look at the 4b to 72b of qwen1.5 provided, so this should be provided by the tokenizer as well

OS

Linux

GPU

Nvidia

CPU

Other

Ollama version

client version is 0.1.38

Originally created by @Anorid on GitHub (May 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4529 ### What is the issue? I carefully read the contents of the readme's documentation to try and found that something went wrong time=2024-05-20T10:06:02.688+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2132883000/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 33525" time=2024-05-20T10:06:02.690+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-05-20T10:06:02.690+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding" time=2024-05-20T10:06:02.691+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="952d03d" tid="140401842012160" timestamp=1716170762 INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140401842012160" timestamp=1716170762 total_threads=128 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="33525" tid="140401842012160" timestamp=1716170762 llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = merge5-1 llama_model_loader: - kv 2: qwen2.block_count u32 = 40 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 201 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-05-20T10:06:02.944+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model terminate called after throwing an instance of 'std::runtime_error' what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' time=2024-05-20T10:06:03.285+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error" time=2024-05-20T10:06:03.535+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) " [GIN] 2024/05/20 - 10:06:03 | 500 | 2.178464527s | 127.0.0.1 | POST "/api/chat" time=2024-05-20T10:06:07.831+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" time=2024-05-20T10:06:07.832+08:00 level=INFO source=memory.go:133 msg="offload to gpu" layers.requested=-1 layers.real=41 memory.available="47.3 GiB" memory.required.full="9.7 GiB" memory.required.partial="9.7 GiB" memory.required.kv="1.6 GiB" memory.weights.total="7.2 GiB" memory.weights.repeating="6.6 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="307.0 MiB" memory.graph.partial="916.1 MiB" time=2024-05-20T10:06:07.832+08:00 level=INFO source=server.go:320 msg="starting llama server" cmd="/tmp/ollama2132883000/runners/cuda_v11/ollama_llama_server --model /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 41 --parallel 1 --port 43339" time=2024-05-20T10:06:07.833+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-05-20T10:06:07.833+08:00 level=INFO source=server.go:504 msg="waiting for llama runner to start responding" time=2024-05-20T10:06:07.833+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="952d03d" tid="140283378036736" timestamp=1716170767 INFO [main] system info | n_threads=64 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="140283378036736" timestamp=1716170767 total_threads=128 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="127" port="43339" tid="140283378036736" timestamp=1716170767 llama_model_loader: loaded meta data with 21 key-value pairs and 483 tensors from /root/autodl-tmp/models/blobs/sha256-1c751709783923dab2b876d5c5c2ca36d4e205cfef7d88988df45752cb91f245 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = merge5-1 llama_model_loader: - kv 2: qwen2.block_count u32 = 40 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 5120 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 13696 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 40 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 40 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151643 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 201 tensors llama_model_loader: - type q4_0: 281 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-05-20T10:06:08.085+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model terminate called after throwing an instance of 'std::runtime_error' what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' time=2024-05-20T10:06:08.437+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server error" time=2024-05-20T10:06:08.656+08:00 level=WARN source=sched.go:512 msg="gpu VRAM usage didn't recover within timeout" seconds=5.120574757 time=2024-05-20T10:06:08.688+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: signal: aborted (core dumped) " I look at the 4b to 72b of qwen1.5 provided, so this should be provided by the tokenizer as well ### OS Linux ### GPU Nvidia ### CPU Other ### Ollama version client version is 0.1.38
GiteaMirror added the bug label 2026-04-22 07:00:28 -05:00
Author
Owner

@Anorid commented on GitHub (May 20, 2024):

image
This is the GGUF file and the information for the imported model

<!-- gh-comment-id:2119663306 --> @Anorid commented on GitHub (May 20, 2024): ![image](https://github.com/ollama/ollama/assets/139095718/62e816f4-f87a-41a4-91b7-53872f447cca) This is the GGUF file and the information for the imported model
Author
Owner

@liduang commented on GitHub (May 20, 2024):

I have also encountered this problem, and I feel that it is the problem here:
May 20 17:54:48 localhost.localdomain ollama[11885]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
It is estimated that there is a conflict with llama.cpp's update this time

7114

<!-- gh-comment-id:2120139035 --> @liduang commented on GitHub (May 20, 2024): I have also encountered this problem, and I feel that it is the problem here: `May 20 17:54:48 localhost.localdomain ollama[11885]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 ` It is estimated that there is a conflict with llama.cpp's update this time [7114](https://github.com/ggerganov/llama.cpp/pull/7114)
Author
Owner

@GitTurboy commented on GitHub (May 21, 2024):

I got the same error on windows system:
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from D:\lamaModels\blobs\sha256-6b22d907af67d494c1194b1bd688423945b4d3009bded2e5ecbc88d426b0c5a3 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen1___5-1___8B-Chat
llama_model_loader: - kv 2: qwen2.block_count u32 = 24
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 5504
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 1
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 121 tensors
llama_model_loader: - type f16: 170 tensors
time=2024-05-20T16:44:58.427+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model"
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
time=2024-05-20T16:44:58.698+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "

<!-- gh-comment-id:2121605222 --> @GitTurboy commented on GitHub (May 21, 2024): I got the same error on windows system: llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from D:\lamaModels\blobs\sha256-6b22d907af67d494c1194b1bd688423945b4d3009bded2e5ecbc88d426b0c5a3 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen1___5-1___8B-Chat llama_model_loader: - kv 2: qwen2.block_count u32 = 24 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 2048 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 5504 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 16 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 16 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 121 tensors llama_model_loader: - type f16: 170 tensors time=2024-05-20T16:44:58.427+08:00 level=INFO source=server.go:540 msg="waiting for server to become available" status="llm server loading model" llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model time=2024-05-20T16:44:58.698+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "
Author
Owner

@binganao commented on GitHub (May 21, 2024):

临时解决方案可以参考我的,使用 convert-hf-to-gguf.py 合并模型时,注释掉这一行

图片

<!-- gh-comment-id:2121889276 --> @binganao commented on GitHub (May 21, 2024): 临时解决方案可以参考我的,使用 convert-hf-to-gguf.py 合并模型时,注释掉这一行 ![图片](https://github.com/ollama/ollama/assets/70050083/2bee3f28-5bd1-40a3-9d71-98d5cf272773)
Author
Owner

@Treedy2020 commented on GitHub (May 21, 2024):

The specific reason may be that llama.cpp/convert-hf-to-gguf.py encountered issues during the rapid iteration process. I experienced the same problem when exporting and quantizing qwen2 in the latest version of llama.cpp, but the exported and quantized gguf models using an older version of llama.cpp for qwen2 are usable. You can try modifying this file like @binganao did, or simply roll back the version of llama.cpp and try again:

cd llama.cpp 
git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72

check this release for detail. Then import and re-quantize the modelscope / hf folder of qwen2 according to the official ollama documentation. Hopefully this can solve your problem.

<!-- gh-comment-id:2121929243 --> @Treedy2020 commented on GitHub (May 21, 2024): The specific reason may be that **llama.cpp/convert-hf-to-gguf.py** encountered issues during the rapid iteration process. I experienced the same problem when exporting and quantizing qwen2 in the [latest version of llama.cpp](https://github.com/ggerganov/llama.cpp/tree/917dc8cfa67a72fb7c8bf7392270da3bf4833af4), but the exported and quantized gguf models using an older version of llama.cpp for qwen2 are usable. You can try modifying this file like @binganao did, or simply roll back the version of llama.cpp and try again: ```bash cd llama.cpp git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72 ``` check this [release](https://github.com/ggerganov/llama.cpp/tree/46e12c4692a37bdd31a0432fc5153d7d22bc7f72) for detail. Then import and re-quantize the **modelscope / hf** folder of qwen2 according to the [official ollama documentation](https://github.com/ollama/ollama/blob/main/docs/import.md). Hopefully this can solve your problem.
Author
Owner

@xianyuxm commented on GitHub (May 22, 2024):

The specific reason may be that llama.cpp/convert-hf-to-gguf.py encountered issues during the rapid iteration process. I experienced the same problem when exporting and quantizing qwen2 in the latest version of llama.cpp, but the exported and quantized gguf models using an older version of llama.cpp for qwen2 are usable. You can try modifying this file like @binganao did, or simply roll back the version of llama.cpp and try again:

cd llama.cpp 
git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72

check this release for detail. Then import and re-quantize the modelscope / hf folder of qwen2 according to the official ollama documentation. Hopefully this can solve your problem.

I tried binganao's method, but it didn't work. However, following your suggestion to roll back to a previous version successfully resolved the issue. Thank you!

<!-- gh-comment-id:2123737192 --> @xianyuxm commented on GitHub (May 22, 2024): > The specific reason may be that **llama.cpp/convert-hf-to-gguf.py** encountered issues during the rapid iteration process. I experienced the same problem when exporting and quantizing qwen2 in the [latest version of llama.cpp](https://github.com/ggerganov/llama.cpp/tree/917dc8cfa67a72fb7c8bf7392270da3bf4833af4), but the exported and quantized gguf models using an older version of llama.cpp for qwen2 are usable. You can try modifying this file like @binganao did, or simply roll back the version of llama.cpp and try again: > > ```shell > cd llama.cpp > git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72 > ``` > > check this [release](https://github.com/ggerganov/llama.cpp/tree/46e12c4692a37bdd31a0432fc5153d7d22bc7f72) for detail. Then import and re-quantize the **modelscope / hf** folder of qwen2 according to the [official ollama documentation](https://github.com/ollama/ollama/blob/main/docs/import.md). Hopefully this can solve your problem. I tried binganao's method, but it didn't work. However, following your suggestion to roll back to a previous version successfully resolved the issue. Thank you!
Author
Owner

@bartowski1182 commented on GitHub (May 26, 2024):

I just tried a Qwen2 model I made recently with llama.cpp ./main and it loaded and generated with no issues. Are we sure this isn't ollama needing an update?

<!-- gh-comment-id:2132252126 --> @bartowski1182 commented on GitHub (May 26, 2024): I just tried a Qwen2 model I made recently with llama.cpp ./main and it loaded and generated with no issues. Are we sure this isn't ollama needing an update?
Author
Owner

@tk19911120 commented on GitHub (May 26, 2024):

I have the same issue when exporting and quantizing qwen1.5-7b-chat,(Error: llama runner process has terminated: signal: aborted (core dumped)). And I tried Treedy2020's method(sudo git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72), solved the issue.
ollama version is 0.1.37

<!-- gh-comment-id:2132257101 --> @tk19911120 commented on GitHub (May 26, 2024): I have the same issue when exporting and quantizing qwen1.5-7b-chat,(Error: llama runner process has terminated: signal: aborted (core dumped)). And I tried Treedy2020's method(`sudo git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72`), solved the issue. ollama version is 0.1.37
Author
Owner

@pdevine commented on GitHub (May 30, 2024):

The problem was that llama.cpp changed how the tokenizer worked because of changes w/ llama3 tokenization. This should be fixed in 0.1.39 though, so I'll go ahead and close the issue. @Anorid LMK if it's still persisting and I can reopen.

<!-- gh-comment-id:2140451902 --> @pdevine commented on GitHub (May 30, 2024): The problem was that llama.cpp changed how the tokenizer worked because of changes w/ llama3 tokenization. This should be fixed in `0.1.39` though, so I'll go ahead and close the issue. @Anorid LMK if it's still persisting and I can reopen.
Author
Owner

@markg85 commented on GitHub (Jun 7, 2024):

Could this be re-opened?
I have the very same issue too.

Jun 07 02:14:13 newphobos ollama[4528]: {"function":"server_params_parse","level":"INFO","line":2604,"msg":"logging to file is disabled.","tid":"129450009160768","timestamp":1717719253}
Jun 07 02:14:13 newphobos ollama[4528]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2821,"msg":"build info","tid":"129450009160768","timestamp":1717719253}
Jun 07 02:14:13 newphobos ollama[4528]: {"function":"main","level":"INFO","line":2828,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"129450009160768","timestamp":1717719253,"total_threads":32}
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest))
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   0:                       general.architecture str              = qwen2
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   1:                               general.name str              = Qwen2-7B-Instruct
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv  20:               general.quantization_version u32              = 2
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - type  f32:  141 tensors
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - type q4_0:  197 tensors
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - type q6_K:    1 tensors
Jun 07 02:14:13 newphobos ollama[4379]: llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
Jun 07 02:14:13 newphobos ollama[4379]: llama_load_model_from_file: exception loading model
Jun 07 02:14:13 newphobos ollama[4379]: terminate called after throwing an instance of 'std::runtime_error'
Jun 07 02:14:13 newphobos ollama[4379]:   what():  error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'

Now there's something strange going on too.

❯ ollama --version
ollama version is 0.1.34

While i have 0.1.41 installed (arch linux):

❯ pacman -Qi ollama
Name            : ollama-rocm
Version         : 0.1.41-1
Description     : Create, run and share large language models (LLMs) with ROCm
Architecture    : x86_64
URL             : https://github.com/ollama/ollama
Licenses        : MIT
Groups          : None
Provides        : ollama
Depends On      : hipblas
Optional Deps   : None
Required By     : None
Optional For    : None
Conflicts With  : ollama
Replaces        : None
Installed Size  : 66.50 MiB
Packager        : Lukas Fleischer <lfleischer@archlinux.org>
Build Date      : Sun 02 Jun 2024 17:51:45 CEST
Install Date    : Fri 07 Jun 2024 02:22:08 CEST
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : Signature

So upon further inspection, this is how it's build:
https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/blob/main/PKGBUILD?ref_type=heads

Which builds the tag 476fb8e892, that is the 0.1.41 tag: https://github.com/ollama/ollama/releases/tag/v0.1.41

The llama-cpp version is this tag 5921b8f089 which is just a week old.

Am i missing something here to get qwen2 working?
The version thing is weird for sure but that might be it's own bug?

<!-- gh-comment-id:2153635763 --> @markg85 commented on GitHub (Jun 7, 2024): Could this be re-opened? I have the very same issue too. ``` Jun 07 02:14:13 newphobos ollama[4528]: {"function":"server_params_parse","level":"INFO","line":2604,"msg":"logging to file is disabled.","tid":"129450009160768","timestamp":1717719253} Jun 07 02:14:13 newphobos ollama[4528]: {"build":1,"commit":"952d03d","function":"main","level":"INFO","line":2821,"msg":"build info","tid":"129450009160768","timestamp":1717719253} Jun 07 02:14:13 newphobos ollama[4528]: {"function":"main","level":"INFO","line":2828,"msg":"system info","n_threads":16,"n_threads_batch":-1,"system_info":"AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"129450009160768","timestamp":1717719253,"total_threads":32} Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: loaded meta data with 21 key-value pairs and 339 tensors from /var/lib/ollama/.ollama/models/blobs/sha256-43f7a214e5329f672bb05404cfba1913cbb70fdaa1a17497224e1925046b0ed5 (version GGUF V3 (latest)) Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 0: general.architecture str = qwen2 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 1: general.name str = Qwen2-7B-Instruct Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 2: qwen2.block_count u32 = 28 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 10: general.file_type u32 = 2 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - kv 20: general.quantization_version u32 = 2 Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - type f32: 141 tensors Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - type q4_0: 197 tensors Jun 07 02:14:13 newphobos ollama[4379]: llama_model_loader: - type q6_K: 1 tensors Jun 07 02:14:13 newphobos ollama[4379]: llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' Jun 07 02:14:13 newphobos ollama[4379]: llama_load_model_from_file: exception loading model Jun 07 02:14:13 newphobos ollama[4379]: terminate called after throwing an instance of 'std::runtime_error' Jun 07 02:14:13 newphobos ollama[4379]: what(): error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' ``` Now there's something strange going on too. ``` ❯ ollama --version ollama version is 0.1.34 ``` While i have 0.1.41 installed (arch linux): ``` ❯ pacman -Qi ollama Name : ollama-rocm Version : 0.1.41-1 Description : Create, run and share large language models (LLMs) with ROCm Architecture : x86_64 URL : https://github.com/ollama/ollama Licenses : MIT Groups : None Provides : ollama Depends On : hipblas Optional Deps : None Required By : None Optional For : None Conflicts With : ollama Replaces : None Installed Size : 66.50 MiB Packager : Lukas Fleischer <lfleischer@archlinux.org> Build Date : Sun 02 Jun 2024 17:51:45 CEST Install Date : Fri 07 Jun 2024 02:22:08 CEST Install Reason : Explicitly installed Install Script : No Validated By : Signature ``` So upon further inspection, this is how it's build: https://gitlab.archlinux.org/archlinux/packaging/packages/ollama/-/blob/main/PKGBUILD?ref_type=heads Which builds the tag 476fb8e89242720a7cdd57400ba928de4dde9cc1, that is the 0.1.41 tag: https://github.com/ollama/ollama/releases/tag/v0.1.41 The llama-cpp version is this tag https://github.com/ggerganov/llama.cpp/commit/5921b8f089d3b7bda86aac5a66825df6a6c10603 which is just a week old. Am i missing something here to get qwen2 working? The version thing is weird for sure but that might be it's own bug?
Author
Owner

@cyp0633 commented on GitHub (Jun 7, 2024):

Now there's something strange going on too.

❯ ollama --version
ollama version is 0.1.34

Did you reboot your machine or do sudo systemctl restart ollama after upgrading? The running ollama service is not automatically upgraded.

<!-- gh-comment-id:2153756769 --> @cyp0633 commented on GitHub (Jun 7, 2024): > Now there's something strange going on too. > > ``` > ❯ ollama --version > ollama version is 0.1.34 > ``` Did you reboot your machine or do `sudo systemctl restart ollama` after upgrading? The running ollama service is not automatically upgraded.
Author
Owner

@markg85 commented on GitHub (Jun 7, 2024):

@cyp0633 yes! :)

I did both (and a couple times), didn't help.
Let's not spend too much time in the version thing but let's check 1 thing.

Could someone else run ollama --version ion a 0.1.41 release and post your result here? If there's anyone else that has this bug too (wrong version number for the release your using) then I'll make a new issue for that. If this can't be reproduced and the command matches your install then there's something seriously wrong on my setup and I'll have to dig deep to figure it out.

<!-- gh-comment-id:2154247787 --> @markg85 commented on GitHub (Jun 7, 2024): @cyp0633 yes! :) I did both (and a couple times), didn't help. Let's not spend too much time in the version thing but let's check 1 thing. Could someone else run `ollama --version` ion a 0.1.41 release and post your result here? If there's anyone else that has this bug too (wrong version number for the release your using) then I'll make a new issue for that. If this can't be reproduced and the command matches your install then there's something seriously wrong on my setup and I'll have to dig deep to figure it out.
Author
Owner

@I321065 commented on GitHub (Jun 7, 2024):

same issue happened to me

<!-- gh-comment-id:2154481076 --> @I321065 commented on GitHub (Jun 7, 2024): same issue happened to me
Author
Owner

@markg85 commented on GitHub (Jun 7, 2024):

Issue can be closed again.
I had installed ollama using the script on the ollama site.
And i had it installed through my package manager.

Removing the one installed through the script made things work. Version is as expected now.
100% user error, sorry for the noise!

<!-- gh-comment-id:2154916485 --> @markg85 commented on GitHub (Jun 7, 2024): Issue can be closed again. I had installed ollama using the script on the ollama site. **And** i had it installed through my package manager. Removing the one installed through the script made things work. Version is as expected now. 100% user error, sorry for the noise!
Author
Owner

@rallg0535 commented on GitHub (Jun 10, 2024):

update ollama to version 0.1.42 , then ok

<!-- gh-comment-id:2158269653 --> @rallg0535 commented on GitHub (Jun 10, 2024): update ollama to version 0.1.42 , then ok
Author
Owner

@Fau57 commented on GitHub (Jun 10, 2024):

I was using LM studio and just had to update btw

<!-- gh-comment-id:2158866999 --> @Fau57 commented on GitHub (Jun 10, 2024): I was using LM studio and just had to update btw
Author
Owner

@ligson commented on GitHub (Jun 12, 2024):

time=2024-06-12T17:45:14.644+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1
time=2024-06-12T17:45:14.644+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding"
time=2024-06-12T17:45:14.644+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error"
INFO [wmain] build info | build=2770 commit="952d03d" tid="32236" timestamp=1718185514
INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="32236" timestamp=1718185514 total_threads=20
INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="57166" tid="32236" timestamp=1718185514
llama_model_loader: loaded meta data with 21 key-value pairs and 338 tensors from E:\chatglm\ollama\models\blobs\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.name str = Qwen2-1.5B-Instruct
llama_model_loader: - kv 2: qwen2.block_count u32 = 28
llama_model_loader: - kv 3: qwen2.context_length u32 = 32768
llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1536
llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 8960
llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 12
llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo...
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q4_0: 196 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2'
llama_load_model_from_file: exception loading model
time=2024-06-12T17:45:15.283+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 "

ollama version:
ollama version is 0.1.43

windows 11

<!-- gh-comment-id:2162593666 --> @ligson commented on GitHub (Jun 12, 2024): time=2024-06-12T17:45:14.644+08:00 level=INFO source=sched.go:338 msg="loaded runners" count=1 time=2024-06-12T17:45:14.644+08:00 level=INFO source=server.go:529 msg="waiting for llama runner to start responding" time=2024-06-12T17:45:14.644+08:00 level=INFO source=server.go:567 msg="waiting for server to become available" status="llm server error" INFO [wmain] build info | build=2770 commit="952d03d" tid="32236" timestamp=1718185514 INFO [wmain] system info | n_threads=10 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="32236" timestamp=1718185514 total_threads=20 INFO [wmain] HTTP server listening | hostname="127.0.0.1" n_threads_http="19" port="57166" tid="32236" timestamp=1718185514 llama_model_loader: loaded meta data with 21 key-value pairs and 338 tensors from E:\chatglm\ollama\models\blobs\sha256-405b56374e02b21122ae1469db646be0617c02928fd78e246723ebbb98dbca3e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen2-1.5B-Instruct llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 1536 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 8960 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 12 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 2 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q4_0: 196 tensors llama_model_loader: - type q6_K: 1 tensors llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'qwen2' llama_load_model_from_file: exception loading model time=2024-06-12T17:45:15.283+08:00 level=ERROR source=sched.go:344 msg="error loading llama server" error="llama runner process has terminated: exit status 0xc0000409 " ollama version: ollama version is 0.1.43 windows 11
Author
Owner

@QiuZiXian commented on GitHub (Jun 18, 2024):

升级ollama

<!-- gh-comment-id:2175508033 --> @QiuZiXian commented on GitHub (Jun 18, 2024): 升级ollama
Author
Owner

@jmorganca commented on GitHub (Jun 24, 2024):

Hi folks sorry about the errors. Qwen 2 requires a newer version of Ollama: https://ollama.com/download make sure to update and let me know if this issue persists

<!-- gh-comment-id:2186999241 --> @jmorganca commented on GitHub (Jun 24, 2024): Hi folks sorry about the errors. Qwen 2 requires a newer version of Ollama: https://ollama.com/download make sure to update and let me know if this issue persists
Author
Owner

@sorenchiron commented on GitHub (Jul 6, 2024):

git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72

This solution works for me. Similar error voilates LM-Studio's pipeline too.
For those who cloned only one commit:

git fetch https://github.com/ggerganov/llama.cpp.git  46e12c4692a37bdd31a0432fc5153d7d22bc7f72
git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72
<!-- gh-comment-id:2211719060 --> @sorenchiron commented on GitHub (Jul 6, 2024): > ```shell > git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72 > ``` This solution works for me. Similar error voilates LM-Studio's pipeline too. For those who cloned only one commit: ```bash git fetch https://github.com/ggerganov/llama.cpp.git 46e12c4692a37bdd31a0432fc5153d7d22bc7f72 git reset --hard 46e12c4692a37bdd31a0432fc5153d7d22bc7f72 ```
Author
Owner

@Rodert commented on GitHub (Jul 8, 2024):

run llama3 8b How much configuration is required。 Is 4g memory enough?

运行llama3 8b需要多少配置。4g内存够吗?

我得到一个错误:ollama Error: llama runner process has terminated: signal: aborted (core dum .

<!-- gh-comment-id:2212863733 --> @Rodert commented on GitHub (Jul 8, 2024): run llama3 8b How much configuration is required。 Is 4g memory enough? 运行llama3 8b需要多少配置。4g内存够吗? 我得到一个错误:`ollama Error: llama runner process has terminated: signal: aborted (core dum` .
Author
Owner

@hemangjoshi37a commented on GitHub (Jul 15, 2024):

I am running it using docker container . how am i suppose to run git hard reset command. i dont have access to these as i am running in the cloud docker container. how to resolve this in my situation ?

<!-- gh-comment-id:2228073937 --> @hemangjoshi37a commented on GitHub (Jul 15, 2024): I am running it using docker container . how am i suppose to run `git hard reset` command. i dont have access to these as i am running in the cloud docker container. how to resolve this in my situation ?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28600