[GH-ISSUE #5956] Phi3-mini-4k-instruct will need to be updated for latest llama.cpp #65759

Closed
opened 2026-05-03 22:33:39 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @kaetemi on GitHub (Jul 25, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5956

See https://github.com/ggerganov/llama.cpp/pull/8627

The blob from the ollama repository fails to load on the latest llama.cpp.

0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32064]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32064]   = [-1000.000000, -1000.000000, -1000.00...
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32064]   = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 32000
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 32000
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  24:               tokenizer.ggml.add_eos_token bool             = false
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  25:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv  26:               general.quantization_version u32              = 2
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - type  f32:   67 tensors
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - type q4_0:  129 tensors
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - type q6_K:    1 tensors
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: phi3.attention.sliding_window
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_load_model_from_file: failed to load model
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: llama_init_from_gpt_params: error: failed to load model '/root/.ollama/models/blobs/sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a'
0|pv_scheduler  | llama-server [phi3-3.8b:1280:1]: free(): invalid pointer
Originally created by @kaetemi on GitHub (Jul 25, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5956 See https://github.com/ggerganov/llama.cpp/pull/8627 The blob from the ollama repository fails to load on the latest llama.cpp. ``` 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<... 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32064] = [-1000.000000, -1000.000000, -1000.00... 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32064] = [3, 3, 4, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 32000 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 32000 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 25: tokenizer.chat_template str = {% for message in messages %}{% if me... 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - kv 26: general.quantization_version u32 = 2 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - type f32: 67 tensors 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - type q4_0: 129 tensors 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_loader: - type q6_K: 1 tensors 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_model_load: error loading model: error loading model hyperparameters: key not found in model: phi3.attention.sliding_window 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_load_model_from_file: failed to load model 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: llama_init_from_gpt_params: error: failed to load model '/root/.ollama/models/blobs/sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a' 0|pv_scheduler | llama-server [phi3-3.8b:1280:1]: free(): invalid pointer ```
GiteaMirror added the model label 2026-05-03 22:33:39 -05:00
Author
Owner

@mxyng commented on GitHub (Jul 30, 2024):

thanks for the issue. phi3 models are now updated https://ollama.com/library/phi3

<!-- gh-comment-id:2259317610 --> @mxyng commented on GitHub (Jul 30, 2024): thanks for the issue. phi3 models are now updated https://ollama.com/library/phi3
Author
Owner

@Arlodotexe commented on GitHub (Jul 31, 2024):

This seems like more of an upstream error with llama.cpp than an issue with Ollama. Other applications that use llama.cpp like LM Studio are affected.

<!-- gh-comment-id:2259392188 --> @Arlodotexe commented on GitHub (Jul 31, 2024): This seems like more of an upstream error with llama.cpp than an issue with Ollama. Other applications that use llama.cpp like LM Studio are affected.
Author
Owner

@kaetemi commented on GitHub (Jul 31, 2024):

@Arlodotexe No, there's additional data required in the GGUF to make Phi3 fully work correctly. Latest llama.cpp adds it and requires it. It's just an update of the model file. :)

Thanks for the update! :)

<!-- gh-comment-id:2259395119 --> @kaetemi commented on GitHub (Jul 31, 2024): @Arlodotexe No, there's additional data required in the GGUF to make Phi3 fully work correctly. Latest llama.cpp adds it and requires it. It's just an update of the model file. :) Thanks for the update! :)
Author
Owner

@Arlodotexe commented on GitHub (Jul 31, 2024):

Thanks @kaetemi, I think I understand what happened now and why it might not make sense to have the model load even if it worked before. Updated the ticket over here, seems a matter of moving the ecosystem now.

<!-- gh-comment-id:2259408011 --> @Arlodotexe commented on GitHub (Jul 31, 2024): Thanks @kaetemi, I think I understand what happened now and why it might not make sense to have the model load even if it worked before. Updated the ticket over [here](https://github.com/ggerganov/llama.cpp/pull/8627), seems a matter of moving the ecosystem now.
Author
Owner

@thant-eightvectors commented on GitHub (Jul 31, 2024):

./llama-server -m ./models/Phi-3-mini-4k-instruct-q4.gguf -c 2048 ─╯ INFO [ main] build info | tid="0x1f7694c00" timestamp=1722417880 build=3494 commit="268c5660" INFO [ main] system info | tid="0x1f7694c00" timestamp=1722417880 n_threads=4 n_threads_batch=-1 total_threads=8 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 24 key-value pairs and 195 tensors from ./models/Phi-3-mini-4k-instruct-q4.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: general.name str = Phi3 llama_model_loader: - kv 2: phi3.context_length u32 = 4096 llama_model_loader: - kv 3: phi3.embedding_length u32 = 3072 llama_model_loader: - kv 4: phi3.feed_forward_length u32 = 8192 llama_model_loader: - kv 5: phi3.block_count u32 = 32 llama_model_loader: - kv 6: phi3.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi3.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi3.rope.dimension_count u32 = 96 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.pre str = default llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32064] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32064] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 81 tensors llama_model_loader: - type q5_K: 32 tensors llama_model_loader: - type q6_K: 17 tensors llama_model_load: error loading model: error loading model hyperparameters: key not found in model: phi3.attention.sliding_window llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model './models/Phi-3-mini-4k-instruct-q4.gguf' ERR [ load_model] unable to load model | tid="0x1f7694c00" timestamp=1722417880 model="./models/Phi-3-mini-4k-instruct-q4.gguf"

When i run above model in llama.cpp, i got above key not found in model error. How can i fix it?

<!-- gh-comment-id:2260079017 --> @thant-eightvectors commented on GitHub (Jul 31, 2024): `./llama-server -m ./models/Phi-3-mini-4k-instruct-q4.gguf -c 2048 ─╯ INFO [ main] build info | tid="0x1f7694c00" timestamp=1722417880 build=3494 commit="268c5660" INFO [ main] system info | tid="0x1f7694c00" timestamp=1722417880 n_threads=4 n_threads_batch=-1 total_threads=8 system_info="AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " llama_model_loader: loaded meta data with 24 key-value pairs and 195 tensors from ./models/Phi-3-mini-4k-instruct-q4.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = phi3 llama_model_loader: - kv 1: general.name str = Phi3 llama_model_loader: - kv 2: phi3.context_length u32 = 4096 llama_model_loader: - kv 3: phi3.embedding_length u32 = 3072 llama_model_loader: - kv 4: phi3.feed_forward_length u32 = 8192 llama_model_loader: - kv 5: phi3.block_count u32 = 32 llama_model_loader: - kv 6: phi3.attention.head_count u32 = 32 llama_model_loader: - kv 7: phi3.attention.head_count_kv u32 = 32 llama_model_loader: - kv 8: phi3.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 9: phi3.rope.dimension_count u32 = 96 llama_model_loader: - kv 10: general.file_type u32 = 15 llama_model_loader: - kv 11: tokenizer.ggml.model str = llama llama_model_loader: - kv 12: tokenizer.ggml.pre str = default llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32064] = ["<unk>", "<s>", "</s>", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32064] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32064] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 32000 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 32000 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_K: 81 tensors llama_model_loader: - type q5_K: 32 tensors llama_model_loader: - type q6_K: 17 tensors llama_model_load: error loading model: error loading model hyperparameters: key not found in model: phi3.attention.sliding_window llama_load_model_from_file: failed to load model llama_init_from_gpt_params: error: failed to load model './models/Phi-3-mini-4k-instruct-q4.gguf' ERR [ load_model] unable to load model | tid="0x1f7694c00" timestamp=1722417880 model="./models/Phi-3-mini-4k-instruct-q4.gguf"` When i run above model in llama.cpp, i got above key not found in model error. How can i fix it?
Author
Owner

@themanyone commented on GitHub (Aug 1, 2024):

When i run above model in llama.cpp, i got above key not found in model error. How can i fix it?

You may want to build or download a new .gguf with the sliding window key functionality included. They are not advising rolling back llama.cpp to 916248af1f3c16abd7408de848e025da095c621c. Even though phi3 might've worked before, the implementation was less than ideal.

<!-- gh-comment-id:2263592012 --> @themanyone commented on GitHub (Aug 1, 2024): > When i run above model in llama.cpp, i got above key not found in model error. How can i fix it? You may want to build or download a new .gguf with the sliding window key functionality included. They are not advising rolling back llama.cpp to 916248af1f3c16abd7408de848e025da095c621c. Even though phi3 might've worked before, the implementation was less than ideal.
Author
Owner

@zoldaten commented on GitHub (Aug 2, 2024):

quick fix to roll back:

cd llama.cpp
git checkout 916248af1f3c16abd7408de848e025da095c621c
make
<!-- gh-comment-id:2265611593 --> @zoldaten commented on GitHub (Aug 2, 2024): quick fix to roll back: ``` cd llama.cpp git checkout 916248af1f3c16abd7408de848e025da095c621c make ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65759