[GH-ISSUE #5245] Allow importing multi-file GGUF models #29042

Open
opened 2026-04-22 07:39:33 -05:00 by GiteaMirror · 98 comments
Owner

Originally created by @jmorganca on GitHub (Jun 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5245

What is the issue?

Currently Ollama can import GGUF files. However, larger models are sometimes split into separate files. Ollama should support loading multiple GGUF files similar to loading safetensor files.

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @jmorganca on GitHub (Jun 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5245 ### What is the issue? Currently Ollama can [import GGUF files](https://github.com/ollama/ollama/blob/main/docs/import.md). However, larger models are sometimes split into separate files. Ollama should support loading multiple GGUF files similar to loading safetensor files. ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 07:39:33 -05:00
Author
Owner

@gsoul commented on GitHub (Aug 22, 2024):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

  1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)
  2. run command:
    ./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
    for example
    ./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

<!-- gh-comment-id:2305577747 --> @gsoul commented on GitHub (Aug 22, 2024): Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) 2. run command: `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` for example `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` Hope this will help somebody.
Author
Owner

@nauen commented on GitHub (Aug 23, 2024):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

  1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)
  2. run command:
    ./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
    for example
    ./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

yes it does <3

<!-- gh-comment-id:2307488512 --> @nauen commented on GitHub (Aug 23, 2024): > Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: > > 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) > 2. run command: > `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` > for example > `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` > > Hope this will help somebody. yes it does <3
Author
Owner

@werruww commented on GitHub (Oct 17, 2024):

Does Obama support fragmented models? Important: They must be merged before running in Olama by modelfile

<!-- gh-comment-id:2420529773 --> @werruww commented on GitHub (Oct 17, 2024): Does Obama support fragmented models? Important: They must be merged before running in Olama by modelfile
Author
Owner

@werruww commented on GitHub (Oct 17, 2024):

lama cpp and llama cpp python can run multy part

ollama can run it???????????????

<!-- gh-comment-id:2420543794 --> @werruww commented on GitHub (Oct 17, 2024): lama cpp and llama cpp python can run multy part ollama can run it???????????????
Author
Owner

@werruww commented on GitHub (Oct 17, 2024):

/content# ollama run hf.co/goodasdgood/dracarys2-72b-instruct
pulling manifest
Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245
/content#

<!-- gh-comment-id:2420578994 --> @werruww commented on GitHub (Oct 17, 2024): /content# ollama run hf.co/goodasdgood/dracarys2-72b-instruct pulling manifest Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245 /content#
Author
Owner

@mitchross commented on GitHub (Oct 24, 2024):

https://x.com/reach_vb/status/1846545312548360319

<!-- gh-comment-id:2435739830 --> @mitchross commented on GitHub (Oct 24, 2024): https://x.com/reach_vb/status/1846545312548360319
Author
Owner

@ahmetkca commented on GitHub (Nov 11, 2024):

Having this issue with Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0

<!-- gh-comment-id:2468810509 --> @ahmetkca commented on GitHub (Nov 11, 2024): Having this issue with `Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0`
Author
Owner

@rar0n commented on GitHub (Nov 12, 2024):

Having this issue with Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0

Try what gsoul above suggests!
I just did and it worked for qwen2.5-coder-14b-instruct-q5_k_m-00001-of-00002.gguf. Didn't have to specify any other input files.
(Thanks gsoul!)

Still, it'd be nice if ollama could do this natively ofc.

<!-- gh-comment-id:2470129029 --> @rar0n commented on GitHub (Nov 12, 2024): > Having this issue with `Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0` Try what gsoul above suggests! I just did and it worked for qwen2.5-coder-14b-instruct-q5_k_m-00001-of-00002.gguf. Didn't have to specify any other input files. (Thanks gsoul!) Still, it'd be nice if ollama could do this natively ofc.
Author
Owner

@Kamayuq commented on GitHub (Nov 14, 2024):

Even though the workaround works it is a very cumbersome especially if you have to update the model. And you also have to create a manifest AFAIK. Ollama should really do this locally completely transparent for the user.

<!-- gh-comment-id:2476392238 --> @Kamayuq commented on GitHub (Nov 14, 2024): Even though the workaround works it is a very cumbersome especially if you have to update the model. And you also have to create a manifest AFAIK. Ollama should really do this locally completely transparent for the user.
Author
Owner

@rotvaldi commented on GitHub (Nov 17, 2024):

ugh "the falling"

<!-- gh-comment-id:2481038572 --> @rotvaldi commented on GitHub (Nov 17, 2024): ugh "the falling"
Author
Owner

@DrewGalbraith commented on GitHub (Nov 29, 2024):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

  1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)
  2. run command:
    ./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
    for example
    ./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

For anyone else wondering how this merges all the other parts of the model or if you have to run this command for each split of the file to merge it into the new one or something, this reddit post laid it out. The gguf-split utility simply figures out the rest of the shards to merge together given the name of the first. I assume it requires they be names basically the same with only numbers deviating.

<!-- gh-comment-id:2508204807 --> @DrewGalbraith commented on GitHub (Nov 29, 2024): > Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: > > 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) > 2. run command: > `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` > for example > `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` > > Hope this will help somebody. For anyone else wondering how this merges all the other parts of the model or if you have to run this command for each split of the file to merge it into the new one or something, this [reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1cf6n18/comment/l1o0opp/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) laid it out. The gguf-split utility simply figures out the rest of the shards to merge together given the name of the first. I assume it requires they be names basically the same with only numbers deviating.
Author
Owner

@BornSupercharged commented on GitHub (Dec 17, 2024):

In case you want a way to easily handle this, add this to your .bashrc or .zshrc file:

add_gguf() {
    set -x
    args="$@"
    cd /Volumes/MP44/Ollama
    s="${args/ollama run /}"
    IFS=: read -r ggufUrl quant <<< "$s"
    IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl"
    echo "[$ggufUrl] [$gguf] [$quant]"
    ggufN="${gguf/-GGUF/}"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf"
    bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff"
    echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model"
    bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model"
    bash -c "rm -f ${ggufN}*"
}

Usage example:
add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out):
add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

What it does:

  1. Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files)
  2. Downloads the two .gguf files for the model using wget
  3. Executes llama-gguf-split to merge the .gguf files into a single .guff
  4. Creates the .model file
  5. Executes ollama create using the .model file
  6. Cleans up the downloaded files

After executing the command, you can run your model like this:
ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K

Verify the model information by entering:
/show info

  Model
    architecture        llama
    parameters          70.6B
    context length      131072
    embedding length    8192
    quantization        Q6_K

To stop, enter:
/bye

<!-- gh-comment-id:2548657698 --> @BornSupercharged commented on GitHub (Dec 17, 2024): In case you want a way to easily handle this, add this to your .bashrc or .zshrc file: ``` add_gguf() { set -x args="$@" cd /Volumes/MP44/Ollama s="${args/ollama run /}" IFS=: read -r ggufUrl quant <<< "$s" IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl" echo "[$ggufUrl] [$gguf] [$quant]" ggufN="${gguf/-GGUF/}" wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf" wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf" bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff" echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model" bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model" bash -c "rm -f ${ggufN}*" } ``` Usage example: add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out): add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K What it does: 1. Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files) 2. Downloads the two .gguf files for the model using wget 3. Executes llama-gguf-split to merge the .gguf files into a single .guff 4. Creates the .model file 5. Executes ollama create using the .model file 6. Cleans up the downloaded files After executing the command, you can run your model like this: ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K Verify the model information by entering: /show info ``` Model architecture llama parameters 70.6B context length 131072 embedding length 8192 quantization Q6_K ``` To stop, enter: /bye
Author
Owner

@AlgorithmicKing737 commented on GitHub (Dec 27, 2024):

In case you want a way to easily handle this, add this to your .bashrc or .zshrc file:

add_gguf() {
    set -x
    args="$@"
    cd /Volumes/MP44/Ollama
    s="${args/ollama run /}"
    IFS=: read -r ggufUrl quant <<< "$s"
    IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl"
    echo "[$ggufUrl] [$gguf] [$quant]"
    ggufN="${gguf/-GGUF/}"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf"
    bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff"
    echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model"
    bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model"
    bash -c "rm -f ${ggufN}*"
}

Usage example: add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out): add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

What it does:

  1. Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files)
  2. Downloads the two .gguf files for the model using wget
  3. Executes llama-gguf-split to merge the .gguf files into a single .guff
  4. Creates the .model file
  5. Executes ollama create using the .model file
  6. Cleans up the downloaded files

After executing the command, you can run your model like this: ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K

Verify the model information by entering: /show info

  Model
    architecture        llama
    parameters          70.6B
    context length      131072
    embedding length    8192
    quantization        Q6_K

To stop, enter: /bye

where are "bashrc or .zshrc file" located?

<!-- gh-comment-id:2563352859 --> @AlgorithmicKing737 commented on GitHub (Dec 27, 2024): > In case you want a way to easily handle this, add this to your .bashrc or .zshrc file: > > ``` > add_gguf() { > set -x > args="$@" > cd /Volumes/MP44/Ollama > s="${args/ollama run /}" > IFS=: read -r ggufUrl quant <<< "$s" > IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl" > echo "[$ggufUrl] [$gguf] [$quant]" > ggufN="${gguf/-GGUF/}" > wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf" > wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf" > bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff" > echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model" > bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model" > bash -c "rm -f ${ggufN}*" > } > ``` > > Usage example: add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K > > It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out): add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K > > What it does: > > 1. Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files) > 2. Downloads the two .gguf files for the model using wget > 3. Executes llama-gguf-split to merge the .gguf files into a single .guff > 4. Creates the .model file > 5. Executes ollama create using the .model file > 6. Cleans up the downloaded files > > After executing the command, you can run your model like this: ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K > > Verify the model information by entering: /show info > > ``` > Model > architecture llama > parameters 70.6B > context length 131072 > embedding length 8192 > quantization Q6_K > ``` > > To stop, enter: /bye where are "bashrc or .zshrc file" located?
Author
Owner

@BornSupercharged commented on GitHub (Dec 30, 2024):

@AlgorithmicKing in your user directory, i.e. ~/.bashrc

<!-- gh-comment-id:2564928733 --> @BornSupercharged commented on GitHub (Dec 30, 2024): @AlgorithmicKing in your user directory, i.e. ~/.bashrc
Author
Owner

@ngxson commented on GitHub (Jan 16, 2025):

Upstream llama.cpp added a new API called llama_model_load_from_splits that may help implementing this function on ollama. Let's hope that they will work on this on the next version!

<!-- gh-comment-id:2595752561 --> @ngxson commented on GitHub (Jan 16, 2025): Upstream llama.cpp added a new API called `llama_model_load_from_splits` that may help implementing this function on ollama. Let's hope that they will work on this on the next version!
Author
Owner

@mattapperson commented on GitHub (Jan 23, 2025):

Here is an updated version of @BornSupercharged's script that:

  • Accounts for splits that have an unknown number of parts.
  • Works on Mac
  • Does some pre-flight checks to ensure everything works as expected
  • Downloads files in parallel for faster pulls
  • Returns the user to their previous directory when done
add_gguf() {
    set -x
    args="$@"
 
    # Check if ollama is installed and get version
    version_output=$(ollama --version 2>&1)
    
    # Check if the output contains "Warning"
    if echo "$version_output" | grep -q "Warning"; then
        echo "Error: Ollama version check failed with warning: $version_output"
        echo "Please be sure that ollama client and server are installed, running, and on the same version."
        return 1
    fi

    # Check if llama-gguf-split is available in PATH
    if ! command -v llama-gguf-split &> /dev/null; then
        echo "Error: llama-gguf-split not found in PATH"
        echo "Please install the latest version of llama.cpp first"
        return 1
    fi

    # Save current directory to restore it later
    current_dir=$(pwd)
    
    # Use a more standard Mac location
    OLLAMA_DIR="$HOME/Library/Ollama"
    mkdir -p "$OLLAMA_DIR"
    cd "$OLLAMA_DIR" || return
    s="${args/ollama run /}"
    IFS=: read -r ggufUrl quant <<< "$s"
    IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl"
    echo "[$ggufUrl] [$gguf] [$quant]"
    ggufN="${gguf/-GGUF/}"

    git clone --no-checkout --depth 1 "https://${ggufUrl}"
    cd "${gguf}/.git"
    files=$(git ls-tree --full-name --name-only -r HEAD)
    cd ".."

    
    # Create array to store wget commands
    wget_commands=()
    
    while IFS= read -r file; do
      if [[ "$file" == *"$quant"* ]]; then
        if [[ -z "${firstFileName}" ]] && [[ "${file}" == *"01-of"* ]]; then
            firstFileName="${file}"
        fi
        if [[ -f "${file}" ]]; then
            echo "File ${file} already exists, skipping..."
            continue
        fi
        echo "Downloading https://${ggufUrl}/resolve/main/${file}"
        wget_commands+=("wget https://${ggufUrl}/resolve/main/${file}")
      fi
    done <<< "$files"
    
    # Execute all wget commands in parallel
    printf "%s\n" "${wget_commands[@]}" | xargs -P 8 -I {} bash -c "{}"


    bash -c "llama-gguf-split --merge ${firstFileName} ${ggufN}-${quant}.guff"
    echo "FROM ./${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model"
    bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model"
    bash -c "rm -f ${ggufN}*"

    cd "$current_dir" || return
}
<!-- gh-comment-id:2610450171 --> @mattapperson commented on GitHub (Jan 23, 2025): Here is an updated version of @BornSupercharged's script that: - Accounts for splits that have an unknown number of parts. - Works on Mac - Does some pre-flight checks to ensure everything works as expected - Downloads files in parallel for faster pulls - Returns the user to their previous directory when done ``` add_gguf() { set -x args="$@" # Check if ollama is installed and get version version_output=$(ollama --version 2>&1) # Check if the output contains "Warning" if echo "$version_output" | grep -q "Warning"; then echo "Error: Ollama version check failed with warning: $version_output" echo "Please be sure that ollama client and server are installed, running, and on the same version." return 1 fi # Check if llama-gguf-split is available in PATH if ! command -v llama-gguf-split &> /dev/null; then echo "Error: llama-gguf-split not found in PATH" echo "Please install the latest version of llama.cpp first" return 1 fi # Save current directory to restore it later current_dir=$(pwd) # Use a more standard Mac location OLLAMA_DIR="$HOME/Library/Ollama" mkdir -p "$OLLAMA_DIR" cd "$OLLAMA_DIR" || return s="${args/ollama run /}" IFS=: read -r ggufUrl quant <<< "$s" IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl" echo "[$ggufUrl] [$gguf] [$quant]" ggufN="${gguf/-GGUF/}" git clone --no-checkout --depth 1 "https://${ggufUrl}" cd "${gguf}/.git" files=$(git ls-tree --full-name --name-only -r HEAD) cd ".." # Create array to store wget commands wget_commands=() while IFS= read -r file; do if [[ "$file" == *"$quant"* ]]; then if [[ -z "${firstFileName}" ]] && [[ "${file}" == *"01-of"* ]]; then firstFileName="${file}" fi if [[ -f "${file}" ]]; then echo "File ${file} already exists, skipping..." continue fi echo "Downloading https://${ggufUrl}/resolve/main/${file}" wget_commands+=("wget https://${ggufUrl}/resolve/main/${file}") fi done <<< "$files" # Execute all wget commands in parallel printf "%s\n" "${wget_commands[@]}" | xargs -P 8 -I {} bash -c "{}" bash -c "llama-gguf-split --merge ${firstFileName} ${ggufN}-${quant}.guff" echo "FROM ./${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model" bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model" bash -c "rm -f ${ggufN}*" cd "$current_dir" || return } ```
Author
Owner

@sakthi-geek commented on GitHub (Jan 23, 2025):

Why is this not natively supported yet? Is it being worked on for future updates?

<!-- gh-comment-id:2611022372 --> @sakthi-geek commented on GitHub (Jan 23, 2025): Why is this not natively supported yet? Is it being worked on for future updates?
Author
Owner

@renato-umeton commented on GitHub (Jan 26, 2025):

Please work on this guys <3 we love ollama and want to continue using it!

ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M

pulling manifest
Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245 (this page)

<!-- gh-comment-id:2614502016 --> @renato-umeton commented on GitHub (Jan 26, 2025): Please work on this guys <3 we love ollama and want to continue using it! `ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M` pulling manifest **Error**: _pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [https://github.com/ollama/ollama/issues/5245](https://github.com/ollama/ollama/issues/5245)_ (this page)
Author
Owner

@AlgorithmicKing737 commented on GitHub (Jan 27, 2025):

Please work on this guys <3 we love ollama and want to continue using it!

ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M

pulling manifest Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: #5245 (this page)

I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on ollama you can run ollama run deepseek-r1:671b

<!-- gh-comment-id:2614749975 --> @AlgorithmicKing737 commented on GitHub (Jan 27, 2025): > Please work on this guys <3 we love ollama and want to continue using it! > > `ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M` > > pulling manifest **Error**: _pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [#5245](https://github.com/ollama/ollama/issues/5245)_ (this page) I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on [ollama](https://ollama.com/library/[deepseek-r1:671b](https://ollama.com/library/deepseek-r1:671b)) you can run `ollama run deepseek-r1:671b`
Author
Owner

@LeiHao0 commented on GitHub (Jan 28, 2025):

Please work on this guys <3 we love ollama and want to continue using it!
ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
pulling manifest Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: #5245 (this page)

I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on ollama you can run ollama run deepseek-r1:671b

Will ollama considering unsloth/DeepSeek-R1-GGUF, the 1.58-bit + 2-bit Dynamic Quants?

<!-- gh-comment-id:2617414744 --> @LeiHao0 commented on GitHub (Jan 28, 2025): > > Please work on this guys <3 we love ollama and want to continue using it! > > `ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M` > > pulling manifest **Error**: _pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [#5245](https://github.com/ollama/ollama/issues/5245)_ (this page) > > I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on [ollama](https://ollama.com/library/%5Bdeepseek-r1:671b%5D(https://ollama.com/library/deepseek-r1:671b)) you can run `ollama run deepseek-r1:671b` Will ollama considering unsloth/DeepSeek-R1-GGUF, the 1.58-bit + 2-bit Dynamic Quants?
Author
Owner

@corticalstack commented on GitHub (Jan 28, 2025):

Watching this space for solutions to run lower quantized versions of DeepSeek-R1 that mere mortals can self host, e.g. 1.58-bit + 2-bit Dynamic Quants?

<!-- gh-comment-id:2618893626 --> @corticalstack commented on GitHub (Jan 28, 2025): Watching this space for solutions to run lower quantized versions of DeepSeek-R1 that mere mortals can self host, e.g. 1.58-bit + 2-bit Dynamic Quants?
Author
Owner

@Tanote650 commented on GitHub (Jan 29, 2025):

Implementing this in Ollama and giving a large number of less experienced users access to the models would be great!

<!-- gh-comment-id:2621336970 --> @Tanote650 commented on GitHub (Jan 29, 2025): Implementing this in Ollama and giving a large number of less experienced users access to the models would be great!
Author
Owner

@LeiHao0 commented on GitHub (Jan 29, 2025):

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic.

Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits
model generate nonsensical output despite multiple attempts.

Image
<!-- gh-comment-id:2621641040 --> @LeiHao0 commented on GitHub (Jan 29, 2025): ## Good News: I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic. ## Bad News: Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits model generate nonsensical output despite multiple attempts. <img width="577" alt="Image" src="https://github.com/user-attachments/assets/d68bf7fa-22f5-4845-9ed9-1d39cf51b98a" />
Author
Owner

@corticalstack commented on GitHub (Jan 29, 2025):

Hoping some smaller DeepSeek quants / multi-file GGUF support provided soon, given the HUGE interest in this model right now.

<!-- gh-comment-id:2621662473 --> @corticalstack commented on GitHub (Jan 29, 2025): Hoping some smaller DeepSeek quants / multi-file GGUF support provided soon, given the HUGE interest in this model right now.
Author
Owner

@fserb commented on GitHub (Jan 29, 2025):

@LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?

<!-- gh-comment-id:2622002402 --> @fserb commented on GitHub (Jan 29, 2025): @LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?
Author
Owner

@LeiHao0 commented on GitHub (Jan 30, 2025):

@LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?

You need to merge these split model files into a single file, then you can load it with ollama.
details in here https://unsloth.ai/blog/deepseekr1-dynamic

./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf

<!-- gh-comment-id:2623357115 --> @LeiHao0 commented on GitHub (Jan 30, 2025): > [@LeiHao0](https://github.com/LeiHao0) can you confirm you were able to run it with ollama? If so, can you share your Modelfile? You need to merge these split model files into a single file, then you can load it with ollama. details in here https://unsloth.ai/blog/deepseekr1-dynamic `./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf`
Author
Owner

@verygreen commented on GitHub (Jan 30, 2025):

@LeiHao0 don't forget to also implement this context fix if you want for the output to actually run to its completion: https://github.com/ollama/ollama/issues/5975#issuecomment-2295330804 (obviously don't use 24k context unless you have gobs of video RAM).

But overall I am having bad results with this particular quant in ollama, even thinking tags don't appear and the model seems to be rambling on endlessly until eventually cutting out

<!-- gh-comment-id:2623520090 --> @verygreen commented on GitHub (Jan 30, 2025): @LeiHao0 don't forget to also implement this context fix if you want for the output to actually run to its completion: https://github.com/ollama/ollama/issues/5975#issuecomment-2295330804 (obviously don't use 24k context unless you have gobs of video RAM). But overall I am having bad results with this particular quant in ollama, even thinking tags don't appear and the model seems to be rambling on endlessly until eventually cutting out
Author
Owner

@dmatora commented on GitHub (Feb 1, 2025):

What about TEMPLATE section for DeepSeek Modelfile?
Doesn't it need one?

<!-- gh-comment-id:2628978052 --> @dmatora commented on GitHub (Feb 1, 2025): What about `TEMPLATE` section for DeepSeek Modelfile? Doesn't it need one?
Author
Owner

@dmatora commented on GitHub (Feb 1, 2025):

Ok, found a tip at issue #8571

FROM merged_file.gguf
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<|User|>{{ .Content }}
{{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }}
{{- end }}"""
PARAMETER stop <|begin▁of▁sentence|>
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
PARAMETER stop <|Assistant|>
<!-- gh-comment-id:2628989521 --> @dmatora commented on GitHub (Feb 1, 2025): Ok, found a tip at issue #8571 ``` FROM merged_file.gguf TEMPLATE """{{- if .System }}{{ .System }}{{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1}} {{- if eq .Role "user" }}<|User|>{{ .Content }} {{- else if eq .Role "assistant" }}<|Assistant|>{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }} {{- end }} {{- if and $last (ne .Role "assistant") }}<|Assistant|>{{- end }} {{- end }}""" PARAMETER stop <|begin▁of▁sentence|> PARAMETER stop <|end▁of▁sentence|> PARAMETER stop <|User|> PARAMETER stop <|Assistant|> ```
Author
Owner

@mistrjirka commented on GitHub (Feb 6, 2025):

@LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?

You need to merge these split model files into a single file, then you can load it with ollama. details in here https://unsloth.ai/blog/deepseekr1-dynamic

./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf

It seems that the ollama does not support IQ1 bit quantization. So for the 1bit quantization to work maybe some update to ollama is needed. But it is weird because llama.cpp supports the one bit quantization. The ollama errors out on wrong magic number.
Is there some technical explanation why implementing multi-file gguf is difficult? Or why does ollama does not suppport 1 bit quantization? Is that lack of developers having time for this or something more fundamental?

<!-- gh-comment-id:2639840767 --> @mistrjirka commented on GitHub (Feb 6, 2025): > > [@LeiHao0](https://github.com/LeiHao0) can you confirm you were able to run it with ollama? If so, can you share your Modelfile? > > You need to merge these split model files into a single file, then you can load it with ollama. details in here https://unsloth.ai/blog/deepseekr1-dynamic > > `./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf` It seems that the ollama does not support IQ1 bit quantization. So for the 1bit quantization to work maybe some update to ollama is needed. But it is weird because llama.cpp supports the one bit quantization. The ollama errors out on wrong magic number. Is there some technical explanation why implementing multi-file gguf is difficult? Or why does ollama does not suppport 1 bit quantization? Is that lack of developers having time for this or something more fundamental?
Author
Owner

@peng3502 commented on GitHub (Feb 18, 2025):

There are 10 files for DeepSeek
DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5_K_M-00005-of-00010.gguf DeepSeek-R1-Q5_K_M-00009-of-00010.gguf
DeepSeek-R1-Q5_K_M-00002-of-00010.gguf DeepSeek-R1-Q5_K_M-00006-of-00010.gguf DeepSeek-R1-Q5_K_M-00010-of-00010.gguf
DeepSeek-R1-Q5_K_M-00003-of-00010.gguf DeepSeek-R1-Q5_K_M-00007-of-00010.gguf
DeepSeek-R1-Q5_K_M-00004-of-00010.gguf DeepSeek-R1-Q5_K_M-00008-of-00010.gguf

An error throwed out after llama merge.

llama-gguf-split --merge DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5.gguf
gguf_merge: DeepSeek-R1-Q5_K_M-00001-of-00010.gguf -> DeepSeek-R1-Q5.gguf
terminate called after throwing an instance of 'std::__ios_failure'
what(): basic_ios::clear: iostream error

Does any one else have a solution?

<!-- gh-comment-id:2664469685 --> @peng3502 commented on GitHub (Feb 18, 2025): There are 10 files for DeepSeek DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5_K_M-00005-of-00010.gguf DeepSeek-R1-Q5_K_M-00009-of-00010.gguf DeepSeek-R1-Q5_K_M-00002-of-00010.gguf DeepSeek-R1-Q5_K_M-00006-of-00010.gguf DeepSeek-R1-Q5_K_M-00010-of-00010.gguf DeepSeek-R1-Q5_K_M-00003-of-00010.gguf DeepSeek-R1-Q5_K_M-00007-of-00010.gguf DeepSeek-R1-Q5_K_M-00004-of-00010.gguf DeepSeek-R1-Q5_K_M-00008-of-00010.gguf An error throwed out after llama merge. llama-gguf-split --merge DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5.gguf gguf_merge: DeepSeek-R1-Q5_K_M-00001-of-00010.gguf -> DeepSeek-R1-Q5.gguf terminate called after throwing an instance of 'std::__ios_failure' what(): basic_ios::clear: iostream error Does any one else have a solution?
Author
Owner

@MaoJianwei commented on GitHub (Feb 21, 2025):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

  1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)
  2. run command:
    ./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
    for example
    ./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

This works for me! Thanks!

./llama-gguf-split --merge DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf DeepSeek-R1-Distill-Qwen-32B-F16.gguf

gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf -> DeepSeek-R1-Distill-Qwen-32B-F16.gguf
gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done
gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done
gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done
gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done
gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16.gguf merged from 2 split with 771 tensors.

<!-- gh-comment-id:2674098187 --> @MaoJianwei commented on GitHub (Feb 21, 2025): > Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: > > 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) > 2. run command: > `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` > for example > `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` > > Hope this will help somebody. This works for me! Thanks! ``` ./llama-gguf-split --merge DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf DeepSeek-R1-Distill-Qwen-32B-F16.gguf gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf -> DeepSeek-R1-Distill-Qwen-32B-F16.gguf gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16.gguf merged from 2 split with 771 tensors. ```
Author
Owner

@thalesluoyx commented on GitHub (Mar 4, 2025):

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic.

Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits model generate nonsensical output despite multiple attempts.

Image

I face the smiliar issue, the the model generate nonsensical output. Even that there is only one .gguf file donwloaded. May I know if your problem had been solved? Thanks

Image

<!-- gh-comment-id:2697010075 --> @thalesluoyx commented on GitHub (Mar 4, 2025): > ## Good News: > I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic. > > ## Bad News: > Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits model generate nonsensical output despite multiple attempts. > > <img alt="Image" width="577" src="https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME"> I face the smiliar issue, the the model generate nonsensical output. Even that there is only one .gguf file donwloaded. May I know if your problem had been solved? Thanks ![Image](https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b)
Author
Owner

@renato-umeton commented on GitHub (Mar 4, 2025):

For prototyping, I ended up switching from Ollama to LM Studio
https://lmstudio.ai/

For prod, I'm still on Ollama

🤷

On Tue, Mar 4, 2025, 4:35 AM thalesluoyx @.***> wrote:

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on
my M2 Ultra, using the instructions from
https://unsloth.ai/blog/deepseekr1-dynamic.
Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to
run. More importantly, even the successful 1.58 bits model generate
nonsensical output despite multiple attempts.
[image: Image]
https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME

I face the smiliar issue, the the model generate nonsensical output. Even
that there is only one .gguf file donwloaded. May I know if your problem
had been solved? Thanks

image.png (view on web)
https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU
.
You are receiving this because you commented.Message ID:
@.***>
[image: thalesluoyx]thalesluoyx left a comment (ollama/ollama#5245)
https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on
my M2 Ultra, using the instructions from
https://unsloth.ai/blog/deepseekr1-dynamic.
Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to
run. More importantly, even the successful 1.58 bits model generate
nonsensical output despite multiple attempts.
[image: Image]
https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME

I face the smiliar issue, the the model generate nonsensical output. Even
that there is only one .gguf file donwloaded. May I know if your problem
had been solved? Thanks

image.png (view on web)
https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b


Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU
.
You are receiving this because you commented.Message ID:
@.***>

<!-- gh-comment-id:2699217710 --> @renato-umeton commented on GitHub (Mar 4, 2025): For prototyping, I ended up switching from Ollama to LM Studio https://lmstudio.ai/ For prod, I'm still on Ollama 🤷 On Tue, Mar 4, 2025, 4:35 AM thalesluoyx ***@***.***> wrote: > Good News: > > I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on > my M2 Ultra, using the instructions from > https://unsloth.ai/blog/deepseekr1-dynamic. > Bad News: > > Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to > run. More importantly, even the successful 1.58 bits model generate > nonsensical output despite multiple attempts. > [image: Image] > <https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME> > > I face the smiliar issue, the the model generate nonsensical output. Even > that there is only one .gguf file donwloaded. May I know if your problem > had been solved? Thanks > > image.png (view on web) > <https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b> > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU> > . > You are receiving this because you commented.Message ID: > ***@***.***> > [image: thalesluoyx]*thalesluoyx* left a comment (ollama/ollama#5245) > <https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075> > > Good News: > > I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on > my M2 Ultra, using the instructions from > https://unsloth.ai/blog/deepseekr1-dynamic. > Bad News: > > Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to > run. More importantly, even the successful 1.58 bits model generate > nonsensical output despite multiple attempts. > [image: Image] > <https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME> > > I face the smiliar issue, the the model generate nonsensical output. Even > that there is only one .gguf file donwloaded. May I know if your problem > had been solved? Thanks > > image.png (view on web) > <https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b> > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU> > . > You are receiving this because you commented.Message ID: > ***@***.***> >
Author
Owner

@PaulGilmartin commented on GitHub (Mar 12, 2025):

Hi all,

I am attempting to merge the gguf files from a hugging face DeepSeek V3 download. Using llama-gguf-split as follows:

llama.cpp/llama-gguf-split --merge DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/temp-merged.gguf

I encounter the following error:

gguf_merge: reading metadata DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf ...gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF'

gguf_merge:  failed to load input GGUF from DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf

The files were downloaded by cloning https://huggingface.co/unsloth/DeepSeek-V3-GGUF. This is the content of the DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L folder:

~/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L # ls
DeepSeek-V3-Q2_K_L-00001-of-00005.gguf
DeepSeek-V3-Q2_K_L-00002-of-00005.gguf
DeepSeek-V3-Q2_K_L-00003-of-00005.gguf
DeepSeek-V3-Q2_K_L-00004-of-00005.gguf
DeepSeek-V3-Q2_K_L-00005-of-00005.gguf

Does anyone know what's going wrong here/ how to fix this? Thanks in advance!

<!-- gh-comment-id:2716743796 --> @PaulGilmartin commented on GitHub (Mar 12, 2025): Hi all, I am attempting to merge the gguf files from a hugging face DeepSeek V3 download. Using llama-gguf-split as follows: ``` llama.cpp/llama-gguf-split --merge DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/temp-merged.gguf ``` I encounter the following error: ``` gguf_merge: reading metadata DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf ...gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF' gguf_merge: failed to load input GGUF from DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf ``` The files were downloaded by cloning https://huggingface.co/unsloth/DeepSeek-V3-GGUF. This is the content of the DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L folder: ``` ~/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L # ls DeepSeek-V3-Q2_K_L-00001-of-00005.gguf DeepSeek-V3-Q2_K_L-00002-of-00005.gguf DeepSeek-V3-Q2_K_L-00003-of-00005.gguf DeepSeek-V3-Q2_K_L-00004-of-00005.gguf DeepSeek-V3-Q2_K_L-00005-of-00005.gguf ``` Does anyone know what's going wrong here/ how to fix this? Thanks in advance!
Author
Owner

@MrEdigital commented on GitHub (Mar 27, 2025):

More and more models are going the sharded route, exclusively. This needs to be addressed soon.

<!-- gh-comment-id:2756289093 --> @MrEdigital commented on GitHub (Mar 27, 2025): More and more models are going the sharded route, exclusively. This needs to be addressed soon.
Author
Owner

@mcDandy commented on GitHub (Apr 5, 2025):

It is highly needed. I want to use a custom gemma 3 but vision tower is separate. Llama.cpp does not help to merge vision into the model.

<!-- gh-comment-id:2780604877 --> @mcDandy commented on GitHub (Apr 5, 2025): It is highly needed. I want to use a custom gemma 3 but vision tower is separate. Llama.cpp does not help to merge vision into the model.
Author
Owner

@1472583610 commented on GitHub (Apr 17, 2025):

I plus-one this. Ollama needs to support this natively. Aside from it being an unreasonably large amount of work to constantly download and manually merge models, there seem to be issues when running the merged models on multiple GPUs.

We have a 2x A6000 AI server and it doesn't load merged models larger than what fits on a single card (48 GB).

Support for sharded models is becoming a must.

<!-- gh-comment-id:2814102049 --> @1472583610 commented on GitHub (Apr 17, 2025): I plus-one this. Ollama needs to support this natively. Aside from it being an unreasonably large amount of work to constantly download and manually merge models, there seem to be issues when running the merged models on multiple GPUs. We have a 2x A6000 AI server and it doesn't load merged models larger than what fits on a single card (48 GB). Support for sharded models is becoming a must.
Author
Owner

@mNandhu commented on GitHub (May 21, 2025):

+1 - There's no mention about the incompatibility in ModelFile Docs either

<!-- gh-comment-id:2897224242 --> @mNandhu commented on GitHub (May 21, 2025): +1 - There's no mention about the incompatibility in ModelFile Docs either
Author
Owner

@lknight commented on GitHub (May 28, 2025):

+1 Its almost impossible to do something with DeepSeek (multiple files model) in Ollama and multiple A6000.

<!-- gh-comment-id:2917808691 --> @lknight commented on GitHub (May 28, 2025): +1 Its almost impossible to do something with DeepSeek (multiple files model) in Ollama and multiple A6000.
Author
Owner

@LastMinuteStudio commented on GitHub (Jun 8, 2025):

Relatively new to LLMs so pardon my ignorance. I've tried merging and running the split GGUFs especially on the latest R1 from unsloth https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-IQ1_S.

The way I run it is through their recommended command
./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/Deepseek-R1-0528-UD-IQ1_S-Merged.gguf
./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf

They work both merged and split in llama.cpp using llama-server.exe.
I've then tried creating a Modelfile for the merged gguf in ollama

FROM a:\Deepseek-R1-0528-UD-IQ1_S-Merged.gguf
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<|User|>{{ .Content }}
{{- else if eq .Role "assistant" }}<|Assistant|>
  {{- if and $.IsThinkSet (and $last .Thinking) -}}
<think>
{{ .Thinking }}
</think>
{{- end }}{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<|Assistant|>
{{- if and $.IsThinkSet (not $.Think) -}}
<think>

</think>

{{ end }}
{{- end -}}
{{- end }}"""
PARAMETER min_p 0.01
PARAMETER repeat_penalty 1
PARAMETER top_p 0.95
PARAMETER num_predict 16384
PARAMETER num_ctx 1024
PARAMETER num_gpu 40
PARAMETER stop <|begin▁of▁sentence|>
PARAMETER stop <|end▁of▁sentence|>
PARAMETER stop <|User|>
PARAMETER stop <|Assistant|>
PARAMETER temperature 0.6

I create the model using ollama create DeepseekR1 -f Modelfile
However when I run it through ollama, I keep getting an error telling me that the memory required is insufficient even though llama.cpp has no issues with loading it slowly.
Error: model requires more system memory (164.9 GiB) than is available (94.4 GiB)
Is this due to the gguf being split or some other reason? The merged gguf works fine in llama.cpp though

edit: I've increased my swap size just to get past the memory error but now I get the following
llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer

<!-- gh-comment-id:2954230380 --> @LastMinuteStudio commented on GitHub (Jun 8, 2025): Relatively new to LLMs so pardon my ignorance. I've tried merging and running the split GGUFs especially on the latest R1 from unsloth https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-IQ1_S. The way I run it is through their recommended command `./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/Deepseek-R1-0528-UD-IQ1_S-Merged.gguf` `./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf` They work both merged and split in llama.cpp using llama-server.exe. I've then tried creating a Modelfile for the merged gguf in ollama ``` FROM a:\Deepseek-R1-0528-UD-IQ1_S-Merged.gguf TEMPLATE """{{- if .System }}{{ .System }}{{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1}} {{- if eq .Role "user" }}<|User|>{{ .Content }} {{- else if eq .Role "assistant" }}<|Assistant|> {{- if and $.IsThinkSet (and $last .Thinking) -}} <think> {{ .Thinking }} </think> {{- end }}{{ .Content }}{{- if not $last }}<|end▁of▁sentence|>{{- end }} {{- end }} {{- if and $last (ne .Role "assistant") }}<|Assistant|> {{- if and $.IsThinkSet (not $.Think) -}} <think> </think> {{ end }} {{- end -}} {{- end }}""" PARAMETER min_p 0.01 PARAMETER repeat_penalty 1 PARAMETER top_p 0.95 PARAMETER num_predict 16384 PARAMETER num_ctx 1024 PARAMETER num_gpu 40 PARAMETER stop <|begin▁of▁sentence|> PARAMETER stop <|end▁of▁sentence|> PARAMETER stop <|User|> PARAMETER stop <|Assistant|> PARAMETER temperature 0.6 ``` I create the model using `ollama create DeepseekR1 -f Modelfile` However when I run it through ollama, I keep getting an error telling me that the memory required is insufficient even though llama.cpp has no issues with loading it slowly. `Error: model requires more system memory (164.9 GiB) than is available (94.4 GiB)` Is this due to the gguf being split or some other reason? The merged gguf works fine in llama.cpp though edit: I've increased my swap size just to get past the memory error but now I get the following `llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer`
Author
Owner

@MrEdigital commented on GitHub (Jul 22, 2025):

This doesn't appear to be given the weight it deserves. It's now been more than a year since this was raised.

<!-- gh-comment-id:3100810166 --> @MrEdigital commented on GitHub (Jul 22, 2025): This doesn't appear to be given the weight it deserves. It's now been more than a year since this was raised.
Author
Owner

@giorgostheo commented on GitHub (Jul 22, 2025):

This needs to be prioritized IMO. Shared GGUFs are the norm now. No support for them in probably the most used platform for local LLMs is bonkers...

<!-- gh-comment-id:3101924734 --> @giorgostheo commented on GitHub (Jul 22, 2025): This needs to be prioritized IMO. Shared GGUFs are the norm now. No support for them in probably the most used platform for local LLMs is bonkers...
Author
Owner

@kappa8219 commented on GitHub (Jul 23, 2025):

Qwen Coder is coming :) Also sharded...

<!-- gh-comment-id:3108829561 --> @kappa8219 commented on GitHub (Jul 23, 2025): Qwen Coder is coming :) Also sharded...
Author
Owner

@giorgostheo commented on GitHub (Jul 24, 2025):

I think its fair to say that this should become priority No.1 for the dev team at this point. Not having shared GGUF support will soon make ollama unusable...

<!-- gh-comment-id:3112305874 --> @giorgostheo commented on GitHub (Jul 24, 2025): I think its fair to say that this should become priority No.1 for the dev team at this point. Not having shared GGUF support will soon make ollama unusable...
Author
Owner

@tolysz commented on GitHub (Jul 27, 2025):

If llama.cpp has llama_model_load_from_splits we are almost half way... I wonder what is the desired solution... currently all the models are stored with hashed names, naive implementation could have some virtual to the top level folder and inside all files symlinked to the hashed files, filenames should follow the splits pattern... otherwise, the model could accept the list of splits and skip the symlinking
edit:
We just need to provide all the hashed files as list... the filename itself is irrelevant...

    // Load the model from a file
    // If the file is split into multiple parts, the file name must follow this pattern: <name>-%05d-of-%05d.gguf
    // If the split file name does not follow this pattern, use llama_model_load_from_splits
    LLAMA_API struct llama_model * llama_model_load_from_file(
                             const char * path_model,
              struct llama_model_params   params);

    // Load the model from multiple splits (support custom naming scheme)
    // The paths must be in the correct order
    LLAMA_API struct llama_model * llama_model_load_from_splits(
                             const char ** paths,
                                 size_t    n_paths,
              struct llama_model_params    params);
<!-- gh-comment-id:3124427434 --> @tolysz commented on GitHub (Jul 27, 2025): If `llama.cpp` has `llama_model_load_from_splits` ~we are almost half way... I wonder what is the desired solution... currently all the models are stored with hashed names, naive implementation could have some `virtual` to the top level folder and inside all files symlinked to the hashed files, filenames should follow the splits pattern... otherwise, the model could accept the list of splits and skip the symlinking~ edit: We just need to provide all the hashed files as list... the filename itself is irrelevant... ``` // Load the model from a file // If the file is split into multiple parts, the file name must follow this pattern: <name>-%05d-of-%05d.gguf // If the split file name does not follow this pattern, use llama_model_load_from_splits LLAMA_API struct llama_model * llama_model_load_from_file( const char * path_model, struct llama_model_params params); // Load the model from multiple splits (support custom naming scheme) // The paths must be in the correct order LLAMA_API struct llama_model * llama_model_load_from_splits( const char ** paths, size_t n_paths, struct llama_model_params params); ```
Author
Owner

@xNefas commented on GitHub (Aug 10, 2025):

Just gonna add to the "noise" and say this should be a priority, it's really rough being unable to load sharded GGUFs with Ollama.

<!-- gh-comment-id:3172281301 --> @xNefas commented on GitHub (Aug 10, 2025): Just gonna add to the "noise" and say this should be a priority, it's really rough being unable to load sharded GGUFs with Ollama.
Author
Owner

@Likkkez commented on GitHub (Aug 16, 2025):

Is it really this hard to just add two files together automatically? Is this one of those delusional ideological stances or what?

<!-- gh-comment-id:3193883350 --> @Likkkez commented on GitHub (Aug 16, 2025): Is it really this hard to just add two files together automatically? Is this one of those delusional ideological stances or what?
Author
Owner

@giorgostheo commented on GitHub (Aug 16, 2025):

Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same.

<!-- gh-comment-id:3193947834 --> @giorgostheo commented on GitHub (Aug 16, 2025): Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same.
Author
Owner

@OdinVex commented on GitHub (Aug 21, 2025):

Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same.

? Turbo? What is that, some upsell or something? Is it time to fork Ollama?

<!-- gh-comment-id:3208625092 --> @OdinVex commented on GitHub (Aug 21, 2025): > Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same. ? Turbo? What is that, some upsell or something? Is it time to fork Ollama?
Author
Owner

@kappa8219 commented on GitHub (Sep 1, 2025):

"The Little Engine That Could" (c)

<!-- gh-comment-id:3241299150 --> @kappa8219 commented on GitHub (Sep 1, 2025): "The Little Engine That Could" (c)
Author
Owner

@bokkob556644-coder commented on GitHub (Oct 22, 2025):

ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M

https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF

Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

<!-- gh-comment-id:3430143017 --> @bokkob556644-coder commented on GitHub (Oct 22, 2025): ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}
Author
Owner

@kappa8219 commented on GitHub (Oct 27, 2025):

ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M

https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF

Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

Welcome to the club

<!-- gh-comment-id:3450157982 --> @kappa8219 commented on GitHub (Oct 27, 2025): > ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M > > https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [https://github.com/ollama/ollama/issues/5245"}](https://github.com/ollama/ollama/issues/5245%22%7D) Welcome to the club
Author
Owner

@SvenMeyer commented on GitHub (Nov 8, 2025):

$ ollama pull hf.co/unsloth/MiniMax-M2-GGUF:Q3_K_XL
pulling manifest 
Error: pull model manifest: 400: {"error":"The specified tag is a sharded GGUF. Ollama does not support this yet. Please use another tag or \"latest\". Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

nobody looking into this ?

<!-- gh-comment-id:3506517656 --> @SvenMeyer commented on GitHub (Nov 8, 2025): ```bash $ ollama pull hf.co/unsloth/MiniMax-M2-GGUF:Q3_K_XL pulling manifest Error: pull model manifest: 400: {"error":"The specified tag is a sharded GGUF. Ollama does not support this yet. Please use another tag or \"latest\". Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} ``` nobody looking into this ?
Author
Owner

@slenderq commented on GitHub (Nov 9, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

It would be great to run this model in ollama!

<!-- gh-comment-id:3507790717 --> @slenderq commented on GitHub (Nov 9, 2025): ``` ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S pulling manifest Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} ``` It would be great to run this model in ollama!
Author
Owner

@cvrunmin commented on GitHub (Nov 13, 2025):

not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here:
8a75d8b015/llama/llama.go (L259-L309)
at L303, C.llama_model_load_from_file is called to load model, which the implementation is here:
8a75d8b015/llama/llama.cpp/src/llama.cpp (L304-L325)
Yes, there is a function named llama_model_load_from_splits that should be able to load split files!
This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

<!-- gh-comment-id:3525234232 --> @cvrunmin commented on GitHub (Nov 13, 2025): not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here: https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.go#L259-L309 at L303, `C.llama_model_load_from_file` is called to load model, which the implementation is here: https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.cpp/src/llama.cpp#L304-L325 Yes, there is a function named `llama_model_load_from_splits` that should be able to load split files! This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)
Author
Owner

@tolysz commented on GitHub (Nov 13, 2025):

The challenge is in the config files, as the files are renamed to some-hash... the config file needs to support storing a list of the filenames... like the type of file is not a string but a list of them.

<!-- gh-comment-id:3526797862 --> @tolysz commented on GitHub (Nov 13, 2025): The challenge is in the config files, as the files are renamed to some-hash... the config file needs to support storing a list of the filenames... like the type of file is not a string but a list of them.
Author
Owner

@FearL0rd commented on GitHub (Nov 14, 2025):

Looks like Ollama has more focus on their cloud instead of working on this

<!-- gh-comment-id:3533908787 --> @FearL0rd commented on GitHub (Nov 14, 2025): Looks like Ollama has more focus on their cloud instead of working on this
Author
Owner

@OdinVex commented on GitHub (Nov 14, 2025):

So it seems the only thing holding it back is a config-representation of split-files and a simple 'if split call load-splits' instead?

<!-- gh-comment-id:3533939058 --> @OdinVex commented on GitHub (Nov 14, 2025): So it seems the only thing holding it back is a config-representation of split-files and a simple 'if split call load-splits' instead?
Author
Owner

@rdeforest commented on GitHub (Nov 14, 2025):

Looks like Ollama has more focus on their cloud instead of working on this

One of the many great things about open-source projects is that everyone gets to work on whatever they want to work on. I bet if you put together a quality PR to address this issue, the team would consider merging it. Or if you don't want to wait you could just maintain your own fork.

If you don't want to help, that's fine too. Just don't complain about the priorities of volunteers please?

<!-- gh-comment-id:3533942550 --> @rdeforest commented on GitHub (Nov 14, 2025): > Looks like Ollama has more focus on their cloud instead of working on this One of the many great things about open-source projects is that everyone gets to work on whatever they want to work on. I bet if you put together a quality PR to address this issue, the team would consider merging it. Or if you don't want to wait you could just maintain your own fork. If you don't want to help, that's fine too. Just don't complain about the priorities of volunteers please?
Author
Owner

@OdinVex commented on GitHub (Nov 14, 2025):

I think extension of Ollama manifests to describe split GGUFs (and their order) is necessary, first. Perhaps any Ollama developer could chime in for that?

<!-- gh-comment-id:3534241125 --> @OdinVex commented on GitHub (Nov 14, 2025): I think extension of Ollama manifests to describe split GGUFs (and their order) is necessary, first. Perhaps any Ollama developer could chime in for that?
Author
Owner

@Mikec78660 commented on GitHub (Nov 14, 2025):

EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run:
ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model
And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work.


I use the llama.cpp method to combine my model:

GLM-4.5-Air-Q4_0-00001-of-00002.gguf
GLM-4.5-Air-Q4_0-00002-of-00002.gguf

And it seemed to work:

./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors.

But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong?

<!-- gh-comment-id:3534618679 --> @Mikec78660 commented on GitHub (Nov 14, 2025): EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run: `ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model` And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work. ******* I use the llama.cpp method to combine my model: ``` GLM-4.5-Air-Q4_0-00001-of-00002.gguf GLM-4.5-Air-Q4_0-00002-of-00002.gguf ``` And it seemed to work: ``` ./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors. ``` But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong?
Author
Owner

@OdinVex commented on GitHub (Nov 14, 2025):

EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run: ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work.

I use the llama.cpp method to combine my model:

GLM-4.5-Air-Q4_0-00001-of-00002.gguf
GLM-4.5-Air-Q4_0-00002-of-00002.gguf

And it seemed to work:

./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors.

But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong?

This Issue is about importing shard GGUF files, not about llama.cpp. But on a side note, I've never had llama.cpp produce a file that didn't end up in gibberish or corrupt output.

<!-- gh-comment-id:3534746251 --> @OdinVex commented on GitHub (Nov 14, 2025): > EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run: `ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model` And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work. > > I use the llama.cpp method to combine my model: > > ``` > GLM-4.5-Air-Q4_0-00001-of-00002.gguf > GLM-4.5-Air-Q4_0-00002-of-00002.gguf > ``` > > And it seemed to work: > > ``` > ./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf > gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf > gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done > gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done > gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done > gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done > gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors. > ``` > > But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong? This Issue is about importing shard GGUF files, not about llama.cpp. But on a side note, I've never had llama.cpp produce a file that didn't end up in gibberish or corrupt output.
Author
Owner

@shimmyshimmer commented on GitHub (Nov 14, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

It would be great to run this model in ollama!

We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf

Usually we do these non-sharded files for any model under 300B parameters or so. But it is very small and 1.77-bit ish

<!-- gh-comment-id:3534801391 --> @shimmyshimmer commented on GitHub (Nov 14, 2025): > ``` > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > pulling manifest > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > ``` > > It would be great to run this model in ollama! We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf Usually we do these non-sharded files for any model under 300B parameters or so. But it is very small and 1.77-bit ish
Author
Owner

@cvrunmin commented on GitHub (Nov 17, 2025):

not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here:

ollama/llama/llama.go

Lines 259 to 309 in 8a75d8b

func LoadModelFromFile(modelPath string, params ModelParams) (*Model, error) {
cparams := C.llama_model_default_params()
cparams.n_gpu_layers = C.int(params.NumGpuLayers)
cparams.main_gpu = C.int32_t(params.MainGpu)
cparams.use_mmap = C.bool(params.UseMmap)
cparams.vocab_only = C.bool(params.VocabOnly)

var devices []C.ggml_backend_dev_t
for _, llamaID := range params.Devices {
devices = append(devices, C.ggml_backend_dev_get(C.size_t(llamaID)))
}
if len(devices) > 0 {
devices = append(devices, C.ggml_backend_dev_t(C.NULL))
devicesData := &devices[0]

  var devicesPin runtime.Pinner 
  devicesPin.Pin(devicesData) 
  defer devicesPin.Unpin() 

  cparams.devices = devicesData 

}

if len(params.TensorSplit) > 0 {
tensorSplitData := &params.TensorSplit[0]

  var tensorSplitPin runtime.Pinner 
  tensorSplitPin.Pin(tensorSplitData) 
  defer tensorSplitPin.Unpin() 

  cparams.tensor_split = (*C.float)(unsafe.Pointer(tensorSplitData)) 

}

if params.Progress != nil {
handle := cgo.NewHandle(params.Progress)
defer handle.Delete()

  var handlePin runtime.Pinner 
  handlePin.Pin(&handle) 
  defer handlePin.Unpin() 

  cparams.progress_callback = C.llama_progress_callback(C.llamaProgressCallback) 
  cparams.progress_callback_user_data = unsafe.Pointer(&handle) 

}

m := Model{c: C.llama_model_load_from_file(C.CString(modelPath), cparams)}
if m.c == nil {
return nil, fmt.Errorf("unable to load model: %s", modelPath)
}

return &m, nil
}

at L303, C.llama_model_load_from_file is called to load model, which the implementation is here:
ollama/llama/llama.cpp/src/llama.cpp

Lines 304 to 325 in 8a75d8b

struct llama_model * llama_model_load_from_file(
const char * path_model,
struct llama_model_params params) {
std::vectorstd::string splits = {};
return llama_model_load_from_file_impl(path_model, splits, params);
}

struct llama_model * llama_model_load_from_splits(
const char ** paths,
size_t n_paths,
struct llama_model_params params) {
std::vectorstd::string splits;
if (n_paths == 0) {
LLAMA_LOG_ERROR("%s: list of splits is empty\n", func);
return nullptr;
}
splits.reserve(n_paths);
for (size_t i = 0; i < n_paths; ++i) {
splits.push_back(paths[i]);
}
return llama_model_load_from_file_impl(splits.front(), splits, params);
}

Yes, there is a function named llama_model_load_from_splits that should be able to load split files!
This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

I only focused on llamarunner and didn't recognize that ollama has its own runner (ollamarunner) which is a different story. This thing will load models at ml/backend/ggml/ggml.go which really not supported multi-file gguf.
Anyway the modification of config format to support file path of sharded model in correct order is still necessary.

<!-- gh-comment-id:3540810485 --> @cvrunmin commented on GitHub (Nov 17, 2025): > not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here: > > [ollama/llama/llama.go](https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.go#L259-L309) > > Lines 259 to 309 in [8a75d8b](/ollama/ollama/commit/8a75d8b0154511d2bafe16f230e9268ee7a511da) > > func LoadModelFromFile(modelPath string, params ModelParams) (*Model, error) { > cparams := C.llama_model_default_params() > cparams.n_gpu_layers = C.int(params.NumGpuLayers) > cparams.main_gpu = C.int32_t(params.MainGpu) > cparams.use_mmap = C.bool(params.UseMmap) > cparams.vocab_only = C.bool(params.VocabOnly) > > var devices []C.ggml_backend_dev_t > for _, llamaID := range params.Devices { > devices = append(devices, C.ggml_backend_dev_get(C.size_t(llamaID))) > } > if len(devices) > 0 { > devices = append(devices, C.ggml_backend_dev_t(C.NULL)) > devicesData := &devices[0] > > var devicesPin runtime.Pinner > devicesPin.Pin(devicesData) > defer devicesPin.Unpin() > > cparams.devices = devicesData > } > > if len(params.TensorSplit) > 0 { > tensorSplitData := &params.TensorSplit[0] > > var tensorSplitPin runtime.Pinner > tensorSplitPin.Pin(tensorSplitData) > defer tensorSplitPin.Unpin() > > cparams.tensor_split = (*C.float)(unsafe.Pointer(tensorSplitData)) > } > > if params.Progress != nil { > handle := cgo.NewHandle(params.Progress) > defer handle.Delete() > > var handlePin runtime.Pinner > handlePin.Pin(&handle) > defer handlePin.Unpin() > > cparams.progress_callback = C.llama_progress_callback(C.llamaProgressCallback) > cparams.progress_callback_user_data = unsafe.Pointer(&handle) > } > > m := Model{c: C.llama_model_load_from_file(C.CString(modelPath), cparams)} > if m.c == nil { > return nil, fmt.Errorf("unable to load model: %s", modelPath) > } > > return &m, nil > } > > at L303, `C.llama_model_load_from_file` is called to load model, which the implementation is here: > [ollama/llama/llama.cpp/src/llama.cpp](https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.cpp/src/llama.cpp#L304-L325) > > Lines 304 to 325 in [8a75d8b](/ollama/ollama/commit/8a75d8b0154511d2bafe16f230e9268ee7a511da) > > struct llama_model * llama_model_load_from_file( > const char * path_model, > struct llama_model_params params) { > std::vector<std::string> splits = {}; > return llama_model_load_from_file_impl(path_model, splits, params); > } > > struct llama_model * llama_model_load_from_splits( > const char ** paths, > size_t n_paths, > struct llama_model_params params) { > std::vector<std::string> splits; > if (n_paths == 0) { > LLAMA_LOG_ERROR("%s: list of splits is empty\n", __func__); > return nullptr; > } > splits.reserve(n_paths); > for (size_t i = 0; i < n_paths; ++i) { > splits.push_back(paths[i]); > } > return llama_model_load_from_file_impl(splits.front(), splits, params); > } > > Yes, there is a function named `llama_model_load_from_splits` that should be able to load split files! > This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik) I only focused on llamarunner and didn't recognize that ollama has its own runner (ollamarunner) which is a different story. This thing will load models at `ml/backend/ggml/ggml.go` which really not supported multi-file gguf. Anyway the modification of config format to support file path of sharded model in correct order is still necessary.
Author
Owner

@giorgostheo commented on GitHub (Nov 17, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

It would be great to run this model in ollama!

We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf

Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish

Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now.

Thanks for all your work.

<!-- gh-comment-id:3541048826 --> @giorgostheo commented on GitHub (Nov 17, 2025): > > ``` > > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > > pulling manifest > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > > ``` > > > > > > > > > > > > > > > > > > > > > > > > It would be great to run this model in ollama! > > We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf > > Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now. Thanks for all your work.
Author
Owner

@OdinVex commented on GitHub (Nov 17, 2025):

... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of splitNumber-totalSplits. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine.

<!-- gh-comment-id:3542224235 --> @OdinVex commented on GitHub (Nov 17, 2025): > ... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik) To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of `splitNumber-totalSplits`. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine.
Author
Owner

@cvrunmin commented on GitHub (Nov 18, 2025):

... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of splitNumber-totalSplits. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine.

If the multi-file GGUF model is created by user using ollama create -f Modelfile and the filenames of split ggufs are nicely named as xxxxxx-00001-of-00003.gguf then this is what your case. However, when the model is pulled from Internet, we only have the hash of the file.
For example, this is the manifest of gpt-oss hosted on ollama registry (https://registry.ollama.ai/v2/library/gpt-oss/manifests/latest):

{
  "schemaVersion":2,
  "mediaType":"application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType":"application/vnd.docker.container.image.v1+json",
    "digest":"sha256:776beb3adb235076157cfea408b8ea2a2d25eae99d7f5da997f607f6b69fa0fa",
    "size":489
  },
  "layers":[
    {
      "mediaType":"application/vnd.ollama.image.model",
      "digest":"sha256:e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb",
      "size":13793422144
    },
    {
      "mediaType":"application/vnd.ollama.image.template",
      "digest":"sha256:fa6710a93d78da62641e192361344be7a8c0a1c3737f139cf89f20ce1626b99c",
      "size":7240
    },
    {
      "mediaType":"application/vnd.ollama.image.license",
      "digest":"sha256:f60356777647e927149cbd4c0ec1314a90caba9400ad205ddc4ce47ed001c2d6",
      "size":11353
    },
    {
      "mediaType":"application/vnd.ollama.image.params",
      "digest":"sha256:d8ba2f9a17b3bbdeb5690efaa409b3fcb0b56296a777c7a69c78aa33bbddf182",
      "size":18
    }
  ]
}

In GGUF metadata of split GGUF, we have the split file information split.no, split.tensors.count and split.count. This is where llama.cpp check if the split file is provided in order:
584e2d646f/llama/llama.cpp/src/llama-model-loader.cpp (L526-L573)
In worst case scenario, we can cache the ordering from the metadata when the model is first created or pulled, then changes towards the config spec could be minimal.

<!-- gh-comment-id:3544861409 --> @cvrunmin commented on GitHub (Nov 18, 2025): > > ... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik) > > To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of `splitNumber-totalSplits`. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine. If the multi-file GGUF model is created by user using `ollama create -f Modelfile` and the filenames of split ggufs are nicely named as `xxxxxx-00001-of-00003.gguf` then this is what your case. However, when the model is pulled from Internet, we only have the hash of the file. For example, this is the manifest of gpt-oss hosted on ollama registry (`https://registry.ollama.ai/v2/library/gpt-oss/manifests/latest`): ```json { "schemaVersion":2, "mediaType":"application/vnd.docker.distribution.manifest.v2+json", "config": { "mediaType":"application/vnd.docker.container.image.v1+json", "digest":"sha256:776beb3adb235076157cfea408b8ea2a2d25eae99d7f5da997f607f6b69fa0fa", "size":489 }, "layers":[ { "mediaType":"application/vnd.ollama.image.model", "digest":"sha256:e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb", "size":13793422144 }, { "mediaType":"application/vnd.ollama.image.template", "digest":"sha256:fa6710a93d78da62641e192361344be7a8c0a1c3737f139cf89f20ce1626b99c", "size":7240 }, { "mediaType":"application/vnd.ollama.image.license", "digest":"sha256:f60356777647e927149cbd4c0ec1314a90caba9400ad205ddc4ce47ed001c2d6", "size":11353 }, { "mediaType":"application/vnd.ollama.image.params", "digest":"sha256:d8ba2f9a17b3bbdeb5690efaa409b3fcb0b56296a777c7a69c78aa33bbddf182", "size":18 } ] } ``` In GGUF metadata of split GGUF, we have the split file information `split.no`, `split.tensors.count` and `split.count`. This is where llama.cpp check if the split file is provided in order: https://github.com/ollama/ollama/blob/584e2d646fb4d2f1643b4da81a096d01114f5b2b/llama/llama.cpp/src/llama-model-loader.cpp#L526-L573 In worst case scenario, we can cache the ordering from the metadata when the model is first created or pulled, then changes towards the config spec could be minimal.
Author
Owner

@shimmyshimmer commented on GitHub (Nov 21, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

It would be great to run this model in ollama!

We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf
Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish

Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now.

Thanks for all your work.

Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which

<!-- gh-comment-id:3562150680 --> @shimmyshimmer commented on GitHub (Nov 21, 2025): > > > ``` > > > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > > > pulling manifest > > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It would be great to run this model in ollama! > > > > > > We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf > > Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish > > Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now. > > Thanks for all your work. Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which
Author
Owner

@eurekin commented on GitHub (Dec 9, 2025):

This still a thing?

Happy birthday to the issue I guess

<!-- gh-comment-id:3634450547 --> @eurekin commented on GitHub (Dec 9, 2025): This still a thing? Happy birthday to the issue I guess
Author
Owner

@FearL0rd commented on GitHub (Dec 9, 2025):

This still a thing?

Happy birthday to the issue I guess

Now all the focus is on cloud

<!-- gh-comment-id:3634455565 --> @FearL0rd commented on GitHub (Dec 9, 2025): > This still a thing? > > Happy birthday to the issue I guess Now all the focus is on cloud
Author
Owner

@johnml1135 commented on GitHub (Dec 17, 2025):

I had a similar issue so I spun up my own tooling to adapt poor-fitting GGUF models into ollama by reworking the top layers - https://github.com/johnml1135/ollama-copilot-fixer.

<!-- gh-comment-id:3666498931 --> @johnml1135 commented on GitHub (Dec 17, 2025): I had a similar issue so I spun up my own tooling to adapt poor-fitting GGUF models into ollama by reworking the top layers - https://github.com/johnml1135/ollama-copilot-fixer.
Author
Owner

@giorgostheo commented on GitHub (Dec 23, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

It would be great to run this model in ollama!

We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf
Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish

Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now.

Thanks for all your work.

Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which

Hey,

With GLM4.7 out, would it be possible that you upload a single gguf for the main config (q4_K_M or whatever it is)? You can add some sort of id like "mono" or "single" to make sure that users are not confused. I know it's not pretty, but it's an easy way to solve the complete lack of compatibility with ollama and allow tons more to use the newest and best models!

Keep up the awesome work.

<!-- gh-comment-id:3687645425 --> @giorgostheo commented on GitHub (Dec 23, 2025): > > > > ``` > > > > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > > > > pulling manifest > > > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It would be great to run this model in ollama! > > > > > > > > > We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf > > > Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish > > > > Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now. > > > > Thanks for all your work. > > Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which Hey, With GLM4.7 out, would it be possible that you upload a single gguf for the main config (q4_K_M or whatever it is)? You can add some sort of id like "mono" or "single" to make sure that users are not confused. I know it's not pretty, but it's an easy way to solve the complete lack of compatibility with ollama and allow tons more to use the newest and best models! Keep up the awesome work.
Author
Owner

@scorpion7slayer commented on GitHub (Jan 30, 2026):

I have the problem with kimi k2.5 will this be added in the future?

<!-- gh-comment-id:3824245062 --> @scorpion7slayer commented on GitHub (Jan 30, 2026): I have the problem with kimi k2.5 will this be added in the future?
Author
Owner

@boomam commented on GitHub (Feb 20, 2026):

Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p

<!-- gh-comment-id:3935070724 --> @boomam commented on GitHub (Feb 20, 2026): Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p
Author
Owner

@cvrunmin commented on GitHub (Feb 21, 2026):

Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p

While such error messages is more likely provided from Huggingface side and not from ollama itself (same error message can be triggered by trying to access the manifest file of the model in browsers), it is still very amusing that more issues have been marked duplicate of this issue, meaning that some contributors know that this issue exists, but an pull request that claims that can solve this issues is now about four months old with zero comments from any contributors. Neither "this solution looks good!", nor "this solution is not good".

<!-- gh-comment-id:3938350686 --> @cvrunmin commented on GitHub (Feb 21, 2026): > Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p While such error messages is more likely provided from Huggingface side and not from ollama itself (same error message can be triggered by trying to access the manifest file of the model in browsers), it is still very amusing that more issues have been marked duplicate of this issue, meaning that some contributors know that this issue exists, but an pull request that claims that can solve this issues is now about four months old with zero comments from any contributors. Neither "this solution looks good!", nor "this solution is not good".
Author
Owner

@OdinVex commented on GitHub (Feb 21, 2026):

Anyone know of an alternative to Ollama?

<!-- gh-comment-id:3938355550 --> @OdinVex commented on GitHub (Feb 21, 2026): Anyone know of an alternative to Ollama?
Author
Owner

@SvenMeyer commented on GitHub (Feb 22, 2026):

@OdinVex I switched to LMstudio which does not have this problem and has a nice GUI as well.
Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models.

<!-- gh-comment-id:3939810272 --> @SvenMeyer commented on GitHub (Feb 22, 2026): @OdinVex I switched to LMstudio which does not have this problem and has a nice GUI as well. Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models.
Author
Owner

@OdinVex commented on GitHub (Feb 22, 2026):

@OdinVex I switched to LMstudio which does not have this problem and has a nice GUI as well. Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models.

Doesn't appear at all to be an alternative, though. Ollama's use at the moment is container-supported for network-based interactions.

<!-- gh-comment-id:3939813182 --> @OdinVex commented on GitHub (Feb 22, 2026): > [@OdinVex](https://github.com/OdinVex) I switched to LMstudio which does not have this problem and has a nice GUI as well. Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models. Doesn't appear at all to be an alternative, though. Ollama's use at the moment is container-supported for network-based interactions.
Author
Owner

@elkay commented on GitHub (Mar 2, 2026):

How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself.

<!-- gh-comment-id:3981389833 --> @elkay commented on GitHub (Mar 2, 2026): How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself.
Author
Owner

@FearL0rd commented on GitHub (Mar 3, 2026):

How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself.

looks like the focus today is Ollama Cloud

<!-- gh-comment-id:3993061190 --> @FearL0rd commented on GitHub (Mar 3, 2026): > How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself. looks like the focus today is Ollama Cloud
Author
Owner

@alexanderjacuna commented on GitHub (Mar 10, 2026):

Running into this issue as well with: hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q8_K_XL

This bug has been open since June of 2024, but lets put in a error message that references this issue with no traction after 2 years.

<!-- gh-comment-id:4031043708 --> @alexanderjacuna commented on GitHub (Mar 10, 2026): Running into this issue as well with: hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q8_K_XL This bug has been open since June of 2024, but lets put in a error message that references this issue with no traction after 2 years.
Author
Owner

@SvenMeyer commented on GitHub (Mar 10, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

<!-- gh-comment-id:4031157387 --> @SvenMeyer commented on GitHub (Mar 10, 2026): I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.
Author
Owner

@OdinVex commented on GitHub (Mar 10, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

<!-- gh-comment-id:4031607686 --> @OdinVex commented on GitHub (Mar 10, 2026): > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.
Author
Owner

@alexanderjacuna commented on GitHub (Mar 10, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

My setup doesn't allow for this unfortunately.

<!-- gh-comment-id:4031738203 --> @alexanderjacuna commented on GitHub (Mar 10, 2026): > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. My setup doesn't allow for this unfortunately.
Author
Owner

@SvenMeyer commented on GitHub (Mar 11, 2026):

@OdinVex @alexanderjacuna what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way.
Also, if you prefer CLI and do not need the GUI, just use llama.cpp

<!-- gh-comment-id:4036312272 --> @SvenMeyer commented on GitHub (Mar 11, 2026): @OdinVex @alexanderjacuna what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way. Also, if you prefer CLI and do not need the GUI, just use llama.cpp
Author
Owner

@OdinVex commented on GitHub (Mar 11, 2026):

@OdinVex @alexanderjacuna what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way. Also, if you prefer CLI and do not need the GUI, just use llama.cpp

Several, but most commonly Open-WebUI.

<!-- gh-comment-id:4036321873 --> @OdinVex commented on GitHub (Mar 11, 2026): > [@OdinVex](https://github.com/OdinVex) [@alexanderjacuna](https://github.com/alexanderjacuna) what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way. Also, if you prefer CLI and do not need the GUI, just use llama.cpp Several, but most commonly Open-WebUI.
Author
Owner

@FearL0rd commented on GitHub (Mar 14, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

I have a drop-in Solution.
I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm
Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.

<!-- gh-comment-id:4060816091 --> @FearL0rd commented on GitHub (Mar 14, 2026): > > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. > > Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama. I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.
Author
Owner

@OdinVex commented on GitHub (Mar 14, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.

If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution.

Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more).

<!-- gh-comment-id:4060822796 --> @OdinVex commented on GitHub (Mar 14, 2026): > > > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. > > > > > > Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama. > > I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM. If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution. Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more).
Author
Owner

@FearL0rd commented on GitHub (Mar 14, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.

If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution.

Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more).

Thx. This is the first release, and it will become more mature over time. It works for my needs with OpenWebUI and custom apps. It also merges the .gguf (it works with safetensors also, only pass the hf location ie. google/gemma-7b-it)

<!-- gh-comment-id:4060855264 --> @FearL0rd commented on GitHub (Mar 14, 2026): > > > > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. > > > > > > > > > Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama. > > > > > > I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM. > > If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution. > > Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more). Thx. This is the first release, and it will become more mature over time. It works for my needs with OpenWebUI and custom apps. It also merges the .gguf (it works with safetensors also, only pass the hf location ie. google/gemma-7b-it)
Author
Owner

@Mikec78660 commented on GitHub (Mar 20, 2026):

llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files.

<!-- gh-comment-id:4097634131 --> @Mikec78660 commented on GitHub (Mar 20, 2026): llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files.
Author
Owner

@OdinVex commented on GitHub (Mar 20, 2026):

llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files.

So software like Open-WebUI can (without any changes except IP address and port) speak to it as if it were Ollama, even with the Ollama-specific code? And there's an official container for it as well? Not seeing it at all, so... Edit: See my lower post about how this went (failure, does not at all work like Ollama).

<!-- gh-comment-id:4099559513 --> @OdinVex commented on GitHub (Mar 20, 2026): > llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files. So software like Open-WebUI can (without any changes except IP address and port) speak to it as if it were Ollama, even with the Ollama-specific code? And there's an official container for it as well? Not seeing it at all, so... Edit: See my lower post about how this went (failure, does not at all work like Ollama).
Author
Owner

@Mikec78660 commented on GitHub (Mar 23, 2026):

@OdinVex yes.
A very minimal implementation is:
llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI
if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui.

Even better is using a config.ini file where you can set the setting for each model
llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini
This will allow you to set a custom ctx size, kv cache setting, etc.

<!-- gh-comment-id:4111290917 --> @Mikec78660 commented on GitHub (Mar 23, 2026): @OdinVex yes. A very minimal implementation is: `llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI` if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui. Even better is using a config.ini file where you can set the setting for each model `llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini` This will allow you to set a custom ctx size, kv cache setting, etc.
Author
Owner

@OdinVex commented on GitHub (Mar 24, 2026):

@OdinVex yes. A very minimal implementation is: llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui.

Even better is using a config.ini file where you can set the setting for each model llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini This will allow you to set a custom ctx size, kv cache setting, etc.

Edit: I see, you meant an OpenAI endpoint, not an Ollama endpoint. Will try and report back if it works.
Edit: It does not work, unfortunately. Trying to download just results in the API reporting 404.

<!-- gh-comment-id:4119484147 --> @OdinVex commented on GitHub (Mar 24, 2026): > [@OdinVex](https://github.com/OdinVex) yes. A very minimal implementation is: `llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI` if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui. > > Even better is using a config.ini file where you can set the setting for each model `llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini` This will allow you to set a custom ctx size, kv cache setting, etc. Edit: I see, you meant an OpenAI endpoint, not an *Ollama* endpoint. Will try and report back if it works. Edit: It does *not* work, unfortunately. Trying to download just results in the API reporting 404.
Author
Owner

@raro42 commented on GitHub (Mar 24, 2026):

.

<!-- gh-comment-id:4121413942 --> @raro42 commented on GitHub (Mar 24, 2026): .
Author
Owner

@CleyFaye commented on GitHub (Mar 25, 2026):

We already know about the issue, and the general outline of what should be done.

I don't see the value of regurgitating the existing discussion, especially to end on the suggestion to "do what was proposed, then tests and docs".

<!-- gh-comment-id:4125408831 --> @CleyFaye commented on GitHub (Mar 25, 2026): We already know about the issue, and the general outline of what should be done. I don't see the value of regurgitating the existing discussion, especially to end on the suggestion to "do what was proposed, then tests and docs".
Author
Owner

@raro42 commented on GitHub (Mar 25, 2026):

@CleyFaye sorry for disturbing. Will edit the comment. Thanks for commenting.

<!-- gh-comment-id:4125529443 --> @raro42 commented on GitHub (Mar 25, 2026): @CleyFaye sorry for disturbing. Will edit the comment. Thanks for commenting.
Author
Owner

@lawcontinue commented on GitHub (Apr 17, 2026):

We ran into this exact limitation while building Hippo, a lightweight local LLM manager that uses llama-cpp-python to load GGUF models.

What we found:

llama-cpp-python handles multi-file GGUF transparently — when you pass a path like model-00001-of-00005.gguf, it detects and loads all shards automatically. So the technical blocker isn't on the inference side; it's on Ollama's import/registry layer that expects a single file.

Workaround we use in Hippo:

For our users, we document a pre-import step: concatenate sharded GGUF files using gguf-split --merge, then point Hippo at the merged file. It works but adds friction — a 70B model merge takes 2-3 minutes and temporarily doubles disk usage.

Implementation thoughts for Ollama:

The cleanest path would be to extend ollama create to accept glob patterns or a directory:

# Current (single file)
ollama create mymodel -f Modelfile  # FROM ./model.gguf

# Proposed (multi-file)
ollama create mymodel -f Modelfile  # FROM ./model-*.gguf

Under the hood, this would pass the first shard path to the GGUF loader (which already handles multi-file), or merge them during the blob creation step.

One consideration: sharded files from HuggingFace often have naming conventions like model-00001-of-00005.gguf. Detecting and sorting these correctly matters — lexicographic sort on the counter portion, not the full filename.

Happy to share more details from our Hippo implementation if helpful.

<!-- gh-comment-id:4269147220 --> @lawcontinue commented on GitHub (Apr 17, 2026): We ran into this exact limitation while building [Hippo](https://github.com/lawcontinue/hippo), a lightweight local LLM manager that uses llama-cpp-python to load GGUF models. **What we found:** llama-cpp-python handles multi-file GGUF transparently — when you pass a path like `model-00001-of-00005.gguf`, it detects and loads all shards automatically. So the technical blocker isn't on the inference side; it's on Ollama's import/registry layer that expects a single file. **Workaround we use in Hippo:** For our users, we document a pre-import step: concatenate sharded GGUF files using `gguf-split --merge`, then point Hippo at the merged file. It works but adds friction — a 70B model merge takes 2-3 minutes and temporarily doubles disk usage. **Implementation thoughts for Ollama:** The cleanest path would be to extend `ollama create` to accept glob patterns or a directory: ```bash # Current (single file) ollama create mymodel -f Modelfile # FROM ./model.gguf # Proposed (multi-file) ollama create mymodel -f Modelfile # FROM ./model-*.gguf ``` Under the hood, this would pass the first shard path to the GGUF loader (which already handles multi-file), or merge them during the blob creation step. **One consideration:** sharded files from HuggingFace often have naming conventions like `model-00001-of-00005.gguf`. Detecting and sorting these correctly matters — lexicographic sort on the counter portion, not the full filename. Happy to share more details from our Hippo implementation if helpful.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29042