[GH-ISSUE #5245] Allow importing multi-file GGUF models #3282

New Issue

GiteaMirror · 2026-04-12T13:49:47-05:00

GiteaMirror commented

2026-04-12 13:49:47 -05:00

Originally created by @jmorganca on GitHub (Jun 23, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5245

What is the issue?

Currently Ollama can import GGUF files. However, larger models are sometimes split into separate files. Ollama should support loading multiple GGUF files similar to loading safetensor files.

OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @jmorganca on GitHub (Jun 23, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5245 ### What is the issue? Currently Ollama can [import GGUF files](https://github.com/ollama/ollama/blob/main/docs/import.md). However, larger models are sometimes split into separate files. Ollama should support loading multiple GGUF files similar to loading safetensor files. ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_

GiteaMirror added the bug label 2026-04-12 13:49:47 -05:00

GiteaMirror commented

2026-04-12 13:49:49 -05:00

@gsoul commented on GitHub (Aug 22, 2024):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)
run command:
./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
for example
./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

@gsoul commented on GitHub (Aug 22, 2024): Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) 2. run command: `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` for example `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` Hope this will help somebody.

GiteaMirror commented

2026-04-12 13:49:51 -05:00

@nauen commented on GitHub (Aug 23, 2024):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)

run command:
./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
for example
./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

yes it does <3

@nauen commented on GitHub (Aug 23, 2024): > Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: > > 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) > 2. run command: > `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` > for example > `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` > > Hope this will help somebody. yes it does <3

GiteaMirror commented

2026-04-12 13:49:52 -05:00

@werruww commented on GitHub (Oct 17, 2024):

Does Obama support fragmented models? Important: They must be merged before running in Olama by modelfile

@werruww commented on GitHub (Oct 17, 2024): Does Obama support fragmented models? Important: They must be merged before running in Olama by modelfile

GiteaMirror commented

2026-04-12 13:49:52 -05:00

@werruww commented on GitHub (Oct 17, 2024):

lama cpp and llama cpp python can run multy part

ollama can run it???????????????

@werruww commented on GitHub (Oct 17, 2024): lama cpp and llama cpp python can run multy part ollama can run it???????????????

GiteaMirror commented

2026-04-12 13:49:53 -05:00

@werruww commented on GitHub (Oct 17, 2024):

/content# ollama run hf.co/goodasdgood/dracarys2-72b-instruct
pulling manifest
Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245
/content#

@werruww commented on GitHub (Oct 17, 2024): /content# ollama run hf.co/goodasdgood/dracarys2-72b-instruct pulling manifest Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245 /content#

GiteaMirror commented

2026-04-12 13:49:53 -05:00

@mitchross commented on GitHub (Oct 24, 2024):

https://x.com/reach_vb/status/1846545312548360319

@mitchross commented on GitHub (Oct 24, 2024): https://x.com/reach_vb/status/1846545312548360319

GiteaMirror commented

2026-04-12 13:49:54 -05:00

@ahmetkca commented on GitHub (Nov 11, 2024):

Having this issue with Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0

@ahmetkca commented on GitHub (Nov 11, 2024): Having this issue with `Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0`

GiteaMirror commented

2026-04-12 13:49:54 -05:00

@rar0n commented on GitHub (Nov 12, 2024):

Having this issue with Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0

Try what gsoul above suggests!
I just did and it worked for qwen2.5-coder-14b-instruct-q5_k_m-00001-of-00002.gguf. Didn't have to specify any other input files.
(Thanks gsoul!)

Still, it'd be nice if ollama could do this natively ofc.

@rar0n commented on GitHub (Nov 12, 2024): > Having this issue with `Qwen/Qwen2.5-Coder-32B-Instruct-GGUF:Q8_0` Try what gsoul above suggests! I just did and it worked for qwen2.5-coder-14b-instruct-q5_k_m-00001-of-00002.gguf. Didn't have to specify any other input files. (Thanks gsoul!) Still, it'd be nice if ollama could do this natively ofc.

GiteaMirror commented

2026-04-12 13:49:55 -05:00

@Kamayuq commented on GitHub (Nov 14, 2024):

Even though the workaround works it is a very cumbersome especially if you have to update the model. And you also have to create a manifest AFAIK. Ollama should really do this locally completely transparent for the user.

@Kamayuq commented on GitHub (Nov 14, 2024): Even though the workaround works it is a very cumbersome especially if you have to update the model. And you also have to create a manifest AFAIK. Ollama should really do this locally completely transparent for the user.

GiteaMirror commented

2026-04-12 13:49:55 -05:00

@rotvaldi commented on GitHub (Nov 17, 2024):

ugh "the falling"

@rotvaldi commented on GitHub (Nov 17, 2024): ugh "the falling"

GiteaMirror commented

2026-04-12 13:49:56 -05:00

@DrewGalbraith commented on GitHub (Nov 29, 2024):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)

run command:
./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
for example
./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

For anyone else wondering how this merges all the other parts of the model or if you have to run this command for each split of the file to merge it into the new one or something, this reddit post laid it out. The gguf-split utility simply figures out the rest of the shards to merge together given the name of the first. I assume it requires they be names basically the same with only numbers deviating.

@DrewGalbraith commented on GitHub (Nov 29, 2024): > Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: > > 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) > 2. run command: > `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` > for example > `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` > > Hope this will help somebody. For anyone else wondering how this merges all the other parts of the model or if you have to run this command for each split of the file to merge it into the new one or something, this [reddit post](https://www.reddit.com/r/LocalLLaMA/comments/1cf6n18/comment/l1o0opp/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button) laid it out. The gguf-split utility simply figures out the rest of the shards to merge together given the name of the first. I assume it requires they be names basically the same with only numbers deviating.

GiteaMirror commented

2026-04-12 13:49:56 -05:00

@BornSupercharged commented on GitHub (Dec 17, 2024):

In case you want a way to easily handle this, add this to your .bashrc or .zshrc file:

add_gguf() {
    set -x
    args="$@"
    cd /Volumes/MP44/Ollama
    s="${args/ollama run /}"
    IFS=: read -r ggufUrl quant <<< "$s"
    IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl"
    echo "[$ggufUrl] [$gguf] [$quant]"
    ggufN="${gguf/-GGUF/}"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf"
    bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff"
    echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model"
    bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model"
    bash -c "rm -f ${ggufN}*"
}

Usage example:
add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out):
add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

What it does:

Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files)
Downloads the two .gguf files for the model using wget
Executes llama-gguf-split to merge the .gguf files into a single .guff
Creates the .model file
Executes ollama create using the .model file
Cleans up the downloaded files

After executing the command, you can run your model like this:
ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K

Verify the model information by entering:
/show info

  Model
    architecture        llama
    parameters          70.6B
    context length      131072
    embedding length    8192
    quantization        Q6_K

To stop, enter:
/bye

@BornSupercharged commented on GitHub (Dec 17, 2024): In case you want a way to easily handle this, add this to your .bashrc or .zshrc file: ``` add_gguf() { set -x args="$@" cd /Volumes/MP44/Ollama s="${args/ollama run /}" IFS=: read -r ggufUrl quant <<< "$s" IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl" echo "[$ggufUrl] [$gguf] [$quant]" ggufN="${gguf/-GGUF/}" wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf" wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf" bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff" echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model" bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model" bash -c "rm -f ${ggufN}*" } ``` Usage example: add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out): add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K What it does: 1. Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files) 2. Downloads the two .gguf files for the model using wget 3. Executes llama-gguf-split to merge the .gguf files into a single .guff 4. Creates the .model file 5. Executes ollama create using the .model file 6. Cleans up the downloaded files After executing the command, you can run your model like this: ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K Verify the model information by entering: /show info ``` Model architecture llama parameters 70.6B context length 131072 embedding length 8192 quantization Q6_K ``` To stop, enter: /bye

GiteaMirror commented

2026-04-12 13:49:56 -05:00

@AlgorithmicKing737 commented on GitHub (Dec 27, 2024):

In case you want a way to easily handle this, add this to your .bashrc or .zshrc file:
add_gguf() {
    set -x
    args="$@"
    cd /Volumes/MP44/Ollama
    s="${args/ollama run /}"
    IFS=: read -r ggufUrl quant <<< "$s"
    IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl"
    echo "[$ggufUrl] [$gguf] [$quant]"
    ggufN="${gguf/-GGUF/}"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf"
    wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf"
    bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff"
    echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model"
    bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model"
    bash -c "rm -f ${ggufN}*"
}
Usage example: add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out): add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K

What it does:

Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files)

Downloads the two .gguf files for the model using wget

Executes llama-gguf-split to merge the .gguf files into a single .guff

Creates the .model file

Executes ollama create using the .model file

Cleans up the downloaded files

After executing the command, you can run your model like this: ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K

Verify the model information by entering: /show info
  Model
    architecture        llama
    parameters          70.6B
    context length      131072
    embedding length    8192
    quantization        Q6_K
To stop, enter: /bye

where are "bashrc or .zshrc file" located?

@AlgorithmicKing737 commented on GitHub (Dec 27, 2024): > In case you want a way to easily handle this, add this to your .bashrc or .zshrc file: > > ``` > add_gguf() { > set -x > args="$@" > cd /Volumes/MP44/Ollama > s="${args/ollama run /}" > IFS=: read -r ggufUrl quant <<< "$s" > IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl" > echo "[$ggufUrl] [$gguf] [$quant]" > ggufN="${gguf/-GGUF/}" > wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00001-of-00002.gguf" > wget "${ggufUrl}/resolve/main/${ggufN}-${quant}/${ggufN}-${quant}-00002-of-00002.gguf" > bash -c "llama-gguf-split --merge ${ggufN}-${quant}-00001-of-00002.gguf ${ggufN}-${quant}.guff" > echo "FROM /Volumes/MP44/Ollama/${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model" > bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model" > bash -c "rm -f ${ggufN}*" > } > ``` > > Usage example: add_gguf hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K > > It also handles the edge case where you've copied the full command out of hugging face (strips "ollama run" out): add_gguf ollama run hf.co/bartowski/EVA-LLaMA-3.33-70B-v0.1-GGUF:Q6_K > > What it does: > > 1. Changes directory to /Volumes/MP44/Ollama (you can customize this for your own external drive, or wherever you want to temporarily store the downloaded .guff files) > 2. Downloads the two .gguf files for the model using wget > 3. Executes llama-gguf-split to merge the .gguf files into a single .guff > 4. Creates the .model file > 5. Executes ollama create using the .model file > 6. Cleans up the downloaded files > > After executing the command, you can run your model like this: ollama run EVA-LLaMA-3.33-70B-v0.1-Q6_K > > Verify the model information by entering: /show info > > ``` > Model > architecture llama > parameters 70.6B > context length 131072 > embedding length 8192 > quantization Q6_K > ``` > > To stop, enter: /bye where are "bashrc or .zshrc file" located?

GiteaMirror commented

2026-04-12 13:49:57 -05:00

@BornSupercharged commented on GitHub (Dec 30, 2024):

@AlgorithmicKing in your user directory, i.e. ~/.bashrc

@BornSupercharged commented on GitHub (Dec 30, 2024): @AlgorithmicKing in your user directory, i.e. ~/.bashrc

GiteaMirror commented

2026-04-12 13:49:57 -05:00

@ngxson commented on GitHub (Jan 16, 2025):

Upstream llama.cpp added a new API called llama_model_load_from_splits that may help implementing this function on ollama. Let's hope that they will work on this on the next version!

@ngxson commented on GitHub (Jan 16, 2025): Upstream llama.cpp added a new API called `llama_model_load_from_splits` that may help implementing this function on ollama. Let's hope that they will work on this on the next version!

GiteaMirror commented

2026-04-12 13:49:57 -05:00

@mattapperson commented on GitHub (Jan 23, 2025):

Here is an updated version of @BornSupercharged's script that:

Accounts for splits that have an unknown number of parts.
Works on Mac
Does some pre-flight checks to ensure everything works as expected
Downloads files in parallel for faster pulls
Returns the user to their previous directory when done

add_gguf() {
    set -x
    args="$@"
 
    # Check if ollama is installed and get version
    version_output=$(ollama --version 2>&1)
    
    # Check if the output contains "Warning"
    if echo "$version_output" | grep -q "Warning"; then
        echo "Error: Ollama version check failed with warning: $version_output"
        echo "Please be sure that ollama client and server are installed, running, and on the same version."
        return 1
    fi

    # Check if llama-gguf-split is available in PATH
    if ! command -v llama-gguf-split &> /dev/null; then
        echo "Error: llama-gguf-split not found in PATH"
        echo "Please install the latest version of llama.cpp first"
        return 1
    fi

    # Save current directory to restore it later
    current_dir=$(pwd)
    
    # Use a more standard Mac location
    OLLAMA_DIR="$HOME/Library/Ollama"
    mkdir -p "$OLLAMA_DIR"
    cd "$OLLAMA_DIR" || return
    s="${args/ollama run /}"
    IFS=: read -r ggufUrl quant <<< "$s"
    IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl"
    echo "[$ggufUrl] [$gguf] [$quant]"
    ggufN="${gguf/-GGUF/}"

    git clone --no-checkout --depth 1 "https://${ggufUrl}"
    cd "${gguf}/.git"
    files=$(git ls-tree --full-name --name-only -r HEAD)
    cd ".."

    
    # Create array to store wget commands
    wget_commands=()
    
    while IFS= read -r file; do
      if [[ "$file" == *"$quant"* ]]; then
        if [[ -z "${firstFileName}" ]] && [[ "${file}" == *"01-of"* ]]; then
            firstFileName="${file}"
        fi
        if [[ -f "${file}" ]]; then
            echo "File ${file} already exists, skipping..."
            continue
        fi
        echo "Downloading https://${ggufUrl}/resolve/main/${file}"
        wget_commands+=("wget https://${ggufUrl}/resolve/main/${file}")
      fi
    done <<< "$files"
    
    # Execute all wget commands in parallel
    printf "%s\n" "${wget_commands[@]}" | xargs -P 8 -I {} bash -c "{}"


    bash -c "llama-gguf-split --merge ${firstFileName} ${ggufN}-${quant}.guff"
    echo "FROM ./${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model"
    bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model"
    bash -c "rm -f ${ggufN}*"

    cd "$current_dir" || return
}

@mattapperson commented on GitHub (Jan 23, 2025): Here is an updated version of @BornSupercharged's script that: - Accounts for splits that have an unknown number of parts. - Works on Mac - Does some pre-flight checks to ensure everything works as expected - Downloads files in parallel for faster pulls - Returns the user to their previous directory when done ``` add_gguf() { set -x args="$@" # Check if ollama is installed and get version version_output=$(ollama --version 2>&1) # Check if the output contains "Warning" if echo "$version_output" | grep -q "Warning"; then echo "Error: Ollama version check failed with warning: $version_output" echo "Please be sure that ollama client and server are installed, running, and on the same version." return 1 fi # Check if llama-gguf-split is available in PATH if ! command -v llama-gguf-split &> /dev/null; then echo "Error: llama-gguf-split not found in PATH" echo "Please install the latest version of llama.cpp first" return 1 fi # Save current directory to restore it later current_dir=$(pwd) # Use a more standard Mac location OLLAMA_DIR="$HOME/Library/Ollama" mkdir -p "$OLLAMA_DIR" cd "$OLLAMA_DIR" || return s="${args/ollama run /}" IFS=: read -r ggufUrl quant <<< "$s" IFS=/ read -r ggufD ggufU gguf <<< "$ggufUrl" echo "[$ggufUrl] [$gguf] [$quant]" ggufN="${gguf/-GGUF/}" git clone --no-checkout --depth 1 "https://${ggufUrl}" cd "${gguf}/.git" files=$(git ls-tree --full-name --name-only -r HEAD) cd ".." # Create array to store wget commands wget_commands=() while IFS= read -r file; do if [[ "$file" == *"$quant"* ]]; then if [[ -z "${firstFileName}" ]] && [[ "${file}" == *"01-of"* ]]; then firstFileName="${file}" fi if [[ -f "${file}" ]]; then echo "File ${file} already exists, skipping..." continue fi echo "Downloading https://${ggufUrl}/resolve/main/${file}" wget_commands+=("wget https://${ggufUrl}/resolve/main/${file}") fi done <<< "$files" # Execute all wget commands in parallel printf "%s\n" "${wget_commands[@]}" | xargs -P 8 -I {} bash -c "{}" bash -c "llama-gguf-split --merge ${firstFileName} ${ggufN}-${quant}.guff" echo "FROM ./${ggufN}-${quant}.guff" > "${ggufN}-${quant}.model" bash -c "ollama create ${ggufN}-${quant} -f ${ggufN}-${quant}.model" bash -c "rm -f ${ggufN}*" cd "$current_dir" || return } ```

GiteaMirror commented

2026-04-12 13:49:58 -05:00

@sakthi-geek commented on GitHub (Jan 23, 2025):

Why is this not natively supported yet? Is it being worked on for future updates?

@sakthi-geek commented on GitHub (Jan 23, 2025): Why is this not natively supported yet? Is it being worked on for future updates?

GiteaMirror commented

2026-04-12 13:49:58 -05:00

@renato-umeton commented on GitHub (Jan 26, 2025):

Please work on this guys <3 we love ollama and want to continue using it!

ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M

pulling manifest
Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245 (this page)

@renato-umeton commented on GitHub (Jan 26, 2025): Please work on this guys <3 we love ollama and want to continue using it! `ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M` pulling manifest **Error**: _pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [https://github.com/ollama/ollama/issues/5245](https://github.com/ollama/ollama/issues/5245)_ (this page)

GiteaMirror commented

2026-04-12 13:49:59 -05:00

@AlgorithmicKing737 commented on GitHub (Jan 27, 2025):

Please work on this guys <3 we love ollama and want to continue using it!

ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M

pulling manifest Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: #5245 (this page)

I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on ollama you can run ollama run deepseek-r1:671b

@AlgorithmicKing737 commented on GitHub (Jan 27, 2025): > Please work on this guys <3 we love ollama and want to continue using it! > > `ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M` > > pulling manifest **Error**: _pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [#5245](https://github.com/ollama/ollama/issues/5245)_ (this page) I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on [ollama](https://ollama.com/library/[deepseek-r1:671b](https://ollama.com/library/deepseek-r1:671b)) you can run `ollama run deepseek-r1:671b`

GiteaMirror commented

2026-04-12 13:50:00 -05:00

@LeiHao0 commented on GitHub (Jan 28, 2025):

Please work on this guys <3 we love ollama and want to continue using it!
ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M
pulling manifest Error: pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: #5245 (this page)

I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on ollama you can run ollama run deepseek-r1:671b

Will ollama considering unsloth/DeepSeek-R1-GGUF, the 1.58-bit + 2-bit Dynamic Quants?

@LeiHao0 commented on GitHub (Jan 28, 2025): > > Please work on this guys <3 we love ollama and want to continue using it! > > `ollama run hf.co/unsloth/DeepSeek-R1-GGUF:Q4_K_M` > > pulling manifest **Error**: _pull model manifest: 400: The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [#5245](https://github.com/ollama/ollama/issues/5245)_ (this page) > > I 100% agree with you but if you want to pull the full version of deepseek r1 it is already on [ollama](https://ollama.com/library/%5Bdeepseek-r1:671b%5D(https://ollama.com/library/deepseek-r1:671b)) you can run `ollama run deepseek-r1:671b` Will ollama considering unsloth/DeepSeek-R1-GGUF, the 1.58-bit + 2-bit Dynamic Quants?

GiteaMirror commented

2026-04-12 13:50:01 -05:00

@corticalstack commented on GitHub (Jan 28, 2025):

Watching this space for solutions to run lower quantized versions of DeepSeek-R1 that mere mortals can self host, e.g. 1.58-bit + 2-bit Dynamic Quants?

@corticalstack commented on GitHub (Jan 28, 2025): Watching this space for solutions to run lower quantized versions of DeepSeek-R1 that mere mortals can self host, e.g. 1.58-bit + 2-bit Dynamic Quants?

GiteaMirror commented

2026-04-12 13:50:02 -05:00

@Tanote650 commented on GitHub (Jan 29, 2025):

Implementing this in Ollama and giving a large number of less experienced users access to the models would be great!

@Tanote650 commented on GitHub (Jan 29, 2025): Implementing this in Ollama and giving a large number of less experienced users access to the models would be great!

GiteaMirror commented

2026-04-12 13:50:03 -05:00

@LeiHao0 commented on GitHub (Jan 29, 2025):

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic.

Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits
model generate nonsensical output despite multiple attempts.

@LeiHao0 commented on GitHub (Jan 29, 2025): ## Good News: I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic. ## Bad News: Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits model generate nonsensical output despite multiple attempts. <img width="577" alt="Image" src="https://github.com/user-attachments/assets/d68bf7fa-22f5-4845-9ed9-1d39cf51b98a" />

GiteaMirror commented

2026-04-12 13:50:04 -05:00

@corticalstack commented on GitHub (Jan 29, 2025):

Hoping some smaller DeepSeek quants / multi-file GGUF support provided soon, given the HUGE interest in this model right now.

@corticalstack commented on GitHub (Jan 29, 2025): Hoping some smaller DeepSeek quants / multi-file GGUF support provided soon, given the HUGE interest in this model right now.

GiteaMirror commented

2026-04-12 13:50:04 -05:00

@fserb commented on GitHub (Jan 29, 2025):

@LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?

@fserb commented on GitHub (Jan 29, 2025): @LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?

GiteaMirror commented

2026-04-12 13:50:05 -05:00

@LeiHao0 commented on GitHub (Jan 30, 2025):

@LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?

You need to merge these split model files into a single file, then you can load it with ollama.
details in here https://unsloth.ai/blog/deepseekr1-dynamic

./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf

@LeiHao0 commented on GitHub (Jan 30, 2025): > [@LeiHao0](https://github.com/LeiHao0) can you confirm you were able to run it with ollama? If so, can you share your Modelfile? You need to merge these split model files into a single file, then you can load it with ollama. details in here https://unsloth.ai/blog/deepseekr1-dynamic `./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf`

GiteaMirror commented

2026-04-12 13:50:06 -05:00

@verygreen commented on GitHub (Jan 30, 2025):

@LeiHao0 don't forget to also implement this context fix if you want for the output to actually run to its completion: https://github.com/ollama/ollama/issues/5975#issuecomment-2295330804 (obviously don't use 24k context unless you have gobs of video RAM).

But overall I am having bad results with this particular quant in ollama, even thinking tags don't appear and the model seems to be rambling on endlessly until eventually cutting out

@verygreen commented on GitHub (Jan 30, 2025): @LeiHao0 don't forget to also implement this context fix if you want for the output to actually run to its completion: https://github.com/ollama/ollama/issues/5975#issuecomment-2295330804 (obviously don't use 24k context unless you have gobs of video RAM). But overall I am having bad results with this particular quant in ollama, even thinking tags don't appear and the model seems to be rambling on endlessly until eventually cutting out

GiteaMirror commented

2026-04-12 13:50:06 -05:00

@dmatora commented on GitHub (Feb 1, 2025):

What about TEMPLATE section for DeepSeek Modelfile?
Doesn't it need one?

@dmatora commented on GitHub (Feb 1, 2025): What about `TEMPLATE` section for DeepSeek Modelfile? Doesn't it need one?

GiteaMirror commented

2026-04-12 13:50:06 -05:00

@dmatora commented on GitHub (Feb 1, 2025):

Ok, found a tip at issue #8571

FROM merged_file.gguf
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<｜User｜>{{ .Content }}
{{- else if eq .Role "assistant" }}<｜Assistant｜>{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<｜Assistant｜>{{- end }}
{{- end }}"""
PARAMETER stop <｜begin▁of▁sentence｜>
PARAMETER stop <｜end▁of▁sentence｜>
PARAMETER stop <｜User｜>
PARAMETER stop <｜Assistant｜>

@dmatora commented on GitHub (Feb 1, 2025): Ok, found a tip at issue #8571 ``` FROM merged_file.gguf TEMPLATE """{{- if .System }}{{ .System }}{{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1}} {{- if eq .Role "user" }}<｜User｜>{{ .Content }} {{- else if eq .Role "assistant" }}<｜Assistant｜>{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }} {{- end }} {{- if and $last (ne .Role "assistant") }}<｜Assistant｜>{{- end }} {{- end }}""" PARAMETER stop <｜begin▁of▁sentence｜> PARAMETER stop <｜end▁of▁sentence｜> PARAMETER stop <｜User｜> PARAMETER stop <｜Assistant｜> ```

GiteaMirror commented

2026-04-12 13:50:07 -05:00

@mistrjirka commented on GitHub (Feb 6, 2025):

@LeiHao0 can you confirm you were able to run it with ollama? If so, can you share your Modelfile?

You need to merge these split model files into a single file, then you can load it with ollama. details in here https://unsloth.ai/blog/deepseekr1-dynamic

./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf

It seems that the ollama does not support IQ1 bit quantization. So for the 1bit quantization to work maybe some update to ollama is needed. But it is weird because llama.cpp supports the one bit quantization. The ollama errors out on wrong magic number.
Is there some technical explanation why implementing multi-file gguf is difficult? Or why does ollama does not suppport 1 bit quantization? Is that lack of developers having time for this or something more fundamental?

@mistrjirka commented on GitHub (Feb 6, 2025): > > [@LeiHao0](https://github.com/LeiHao0) can you confirm you were able to run it with ollama? If so, can you share your Modelfile? > > You need to merge these split model files into a single file, then you can load it with ollama. details in here https://unsloth.ai/blog/deepseekr1-dynamic > > `./llama.cpp/llama-gguf-split --merge \ DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \ merged_file.gguf` It seems that the ollama does not support IQ1 bit quantization. So for the 1bit quantization to work maybe some update to ollama is needed. But it is weird because llama.cpp supports the one bit quantization. The ollama errors out on wrong magic number. Is there some technical explanation why implementing multi-file gguf is difficult? Or why does ollama does not suppport 1 bit quantization? Is that lack of developers having time for this or something more fundamental?

GiteaMirror commented

2026-04-12 13:50:08 -05:00

@peng3502 commented on GitHub (Feb 18, 2025):

There are 10 files for DeepSeek
DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5_K_M-00005-of-00010.gguf DeepSeek-R1-Q5_K_M-00009-of-00010.gguf
DeepSeek-R1-Q5_K_M-00002-of-00010.gguf DeepSeek-R1-Q5_K_M-00006-of-00010.gguf DeepSeek-R1-Q5_K_M-00010-of-00010.gguf
DeepSeek-R1-Q5_K_M-00003-of-00010.gguf DeepSeek-R1-Q5_K_M-00007-of-00010.gguf
DeepSeek-R1-Q5_K_M-00004-of-00010.gguf DeepSeek-R1-Q5_K_M-00008-of-00010.gguf

An error throwed out after llama merge.

llama-gguf-split --merge DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5.gguf
gguf_merge: DeepSeek-R1-Q5_K_M-00001-of-00010.gguf -> DeepSeek-R1-Q5.gguf
terminate called after throwing an instance of 'std::__ios_failure'
what(): basic_ios::clear: iostream error

Does any one else have a solution?

@peng3502 commented on GitHub (Feb 18, 2025): There are 10 files for DeepSeek DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5_K_M-00005-of-00010.gguf DeepSeek-R1-Q5_K_M-00009-of-00010.gguf DeepSeek-R1-Q5_K_M-00002-of-00010.gguf DeepSeek-R1-Q5_K_M-00006-of-00010.gguf DeepSeek-R1-Q5_K_M-00010-of-00010.gguf DeepSeek-R1-Q5_K_M-00003-of-00010.gguf DeepSeek-R1-Q5_K_M-00007-of-00010.gguf DeepSeek-R1-Q5_K_M-00004-of-00010.gguf DeepSeek-R1-Q5_K_M-00008-of-00010.gguf An error throwed out after llama merge. llama-gguf-split --merge DeepSeek-R1-Q5_K_M-00001-of-00010.gguf DeepSeek-R1-Q5.gguf gguf_merge: DeepSeek-R1-Q5_K_M-00001-of-00010.gguf -> DeepSeek-R1-Q5.gguf terminate called after throwing an instance of 'std::__ios_failure' what(): basic_ios::clear: iostream error Does any one else have a solution?

GiteaMirror commented

2026-04-12 13:50:08 -05:00

@MaoJianwei commented on GitHub (Feb 21, 2025):

Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to:

Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual)

run command:
./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf
for example
./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf

Hope this will help somebody.

This works for me! Thanks!

./llama-gguf-split --merge DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf DeepSeek-R1-Distill-Qwen-32B-F16.gguf

gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf -> DeepSeek-R1-Distill-Qwen-32B-F16.gguf
gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done
gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done
gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done
gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done
gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16.gguf merged from 2 split with 771 tensors.

@MaoJianwei commented on GitHub (Feb 21, 2025): > Just in case someone would find this issue, like I did a few weeks ago, without knowing any workaround. Currently probably one of the easiest ways import multifile gguf into Ollama would be to: > > 1. Download pre-compiled binaries of llama.cpp: https://github.com/ggerganov/llama.cpp/releases (or install according to their manual) > 2. run command: > `./llama-gguf-split --merge mymodel-00001-of-00002.gguf out_file_name.gguf` > for example > `./llama-gguf-split --merge Mistral-Large-Instruct-2407-IQ4_XS-00001-of-00002.gguf outfile.gguf` > > Hope this will help somebody. This works for me! Thanks! ``` ./llama-gguf-split --merge DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf DeepSeek-R1-Distill-Qwen-32B-F16.gguf gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf -> DeepSeek-R1-Distill-Qwen-32B-F16.gguf gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done gguf_merge: reading metadata DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00001-of-00002.gguf done gguf_merge: writing tensors DeepSeek-R1-Distill-Qwen-32B-F16-00002-of-00002.gguf done gguf_merge: DeepSeek-R1-Distill-Qwen-32B-F16.gguf merged from 2 split with 771 tensors. ```

GiteaMirror commented

2026-04-12 13:50:09 -05:00

@thalesluoyx commented on GitHub (Mar 4, 2025):

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic.

Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits model generate nonsensical output despite multiple attempts.

I face the smiliar issue, the the model generate nonsensical output. Even that there is only one .gguf file donwloaded. May I know if your problem had been solved? Thanks

@thalesluoyx commented on GitHub (Mar 4, 2025): > ## Good News: > I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on my M2 Ultra, using the instructions from https://unsloth.ai/blog/deepseekr1-dynamic. > > ## Bad News: > Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to run. More importantly, even the successful 1.58 bits model generate nonsensical output despite multiple attempts. > > <img alt="Image" width="577" src="https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME"> I face the smiliar issue, the the model generate nonsensical output. Even that there is only one .gguf file donwloaded. May I know if your problem had been solved? Thanks ![Image](https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b)

GiteaMirror commented

2026-04-12 13:50:09 -05:00

@renato-umeton commented on GitHub (Mar 4, 2025):

For prototyping, I ended up switching from Ollama to LM Studio
https://lmstudio.ai/

For prod, I'm still on Ollama

🤷

On Tue, Mar 4, 2025, 4:35 AM thalesluoyx @.***> wrote:

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on
my M2 Ultra, using the instructions from
https://unsloth.ai/blog/deepseekr1-dynamic.
Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to
run. More importantly, even the successful 1.58 bits model generate
nonsensical output despite multiple attempts.
[image: Image]
https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME

I face the smiliar issue, the the model generate nonsensical output. Even
that there is only one .gguf file donwloaded. May I know if your problem
had been solved? Thanks

image.png (view on web)
https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b

—
Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU
.
You are receiving this because you commented.Message ID:
@.***>
[image: thalesluoyx]thalesluoyx left a comment (ollama/ollama#5245)
https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075

Good News:

I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on
my M2 Ultra, using the instructions from
https://unsloth.ai/blog/deepseekr1-dynamic.
Bad News:

Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to
run. More importantly, even the successful 1.58 bits model generate
nonsensical output despite multiple attempts.
[image: Image]
https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME

I face the smiliar issue, the the model generate nonsensical output. Even
that there is only one .gguf file donwloaded. May I know if your problem
had been solved? Thanks

image.png (view on web)
https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b

—
Reply to this email directly, view it on GitHub
https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU
.
You are receiving this because you commented.Message ID:
@.***>

@renato-umeton commented on GitHub (Mar 4, 2025): For prototyping, I ended up switching from Ollama to LM Studio https://lmstudio.ai/ For prod, I'm still on Ollama 🤷 On Tue, Mar 4, 2025, 4:35 AM thalesluoyx ***@***.***> wrote: > Good News: > > I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on > my M2 Ultra, using the instructions from > https://unsloth.ai/blog/deepseekr1-dynamic. > Bad News: > > Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to > run. More importantly, even the successful 1.58 bits model generate > nonsensical output despite multiple attempts. > [image: Image] > <https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME> > > I face the smiliar issue, the the model generate nonsensical output. Even > that there is only one .gguf file donwloaded. May I know if your problem > had been solved? Thanks > > image.png (view on web) > <https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b> > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU> > . > You are receiving this because you commented.Message ID: > ***@***.***> > [image: thalesluoyx]*thalesluoyx* left a comment (ollama/ollama#5245) > <https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075> > > Good News: > > I've successfully quantized and run the DeepSeek-R1 model at 1.58 bits on > my M2 Ultra, using the instructions from > https://unsloth.ai/blog/deepseekr1-dynamic. > Bad News: > > Higher quantization levels of 1.73 bits, 2.22 bits and 2.51 bits failed to > run. More importantly, even the successful 1.58 bits model generate > nonsensical output despite multiple attempts. > [image: Image] > <https://private-user-images.githubusercontent.com/3134202/407746029-d68bf7fa-22f5-4845-9ed9-1d39cf51b98a.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NDEwODQ2NjAsIm5iZiI6MTc0MTA4NDM2MCwicGF0aCI6Ii8zMTM0MjAyLzQwNzc0NjAyOS1kNjhiZjdmYS0yMmY1LTQ4NDUtOWVkOS0xZDM5Y2Y1MWI5OGEucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDMwNCUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAzMDRUMTAzMjQwWiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9NDlkODViYTBjMTE1YWEwMDBlODMyMTFlMDY0MTgwMzAxZWQzOWFlMTBlZTQ3MWVjOGUzNDQ3M2IyY2I4YzI3YyZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.uuXCaSJyOPCy7oKg7LDjMPBliG35hF_oIgGuKmAJrME> > > I face the smiliar issue, the the model generate nonsensical output. Even > that there is only one .gguf file donwloaded. May I know if your problem > had been solved? Thanks > > image.png (view on web) > <https://github.com/user-attachments/assets/31e94c61-96ee-4f5b-bc40-cb686c82c75b> > > — > Reply to this email directly, view it on GitHub > <https://github.com/ollama/ollama/issues/5245#issuecomment-2697010075>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AGBTFDPF7YIT2BTUP6OMQ6T2SV6XLAVCNFSM6AAAAABJYWW2BGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJXGAYTAMBXGU> > . > You are receiving this because you commented.Message ID: > ***@***.***> >

GiteaMirror commented

2026-04-12 13:50:10 -05:00

@PaulGilmartin commented on GitHub (Mar 12, 2025):

Hi all,

I am attempting to merge the gguf files from a hugging face DeepSeek V3 download. Using llama-gguf-split as follows:

llama.cpp/llama-gguf-split --merge DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/temp-merged.gguf

I encounter the following error:

gguf_merge: reading metadata DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf ...gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF'

gguf_merge:  failed to load input GGUF from DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf

The files were downloaded by cloning https://huggingface.co/unsloth/DeepSeek-V3-GGUF. This is the content of the DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L folder:

~/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L # ls
DeepSeek-V3-Q2_K_L-00001-of-00005.gguf
DeepSeek-V3-Q2_K_L-00002-of-00005.gguf
DeepSeek-V3-Q2_K_L-00003-of-00005.gguf
DeepSeek-V3-Q2_K_L-00004-of-00005.gguf
DeepSeek-V3-Q2_K_L-00005-of-00005.gguf

Does anyone know what's going wrong here/ how to fix this? Thanks in advance!

@PaulGilmartin commented on GitHub (Mar 12, 2025): Hi all, I am attempting to merge the gguf files from a hugging face DeepSeek V3 download. Using llama-gguf-split as follows: ``` llama.cpp/llama-gguf-split --merge DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/temp-merged.gguf ``` I encounter the following error: ``` gguf_merge: reading metadata DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf ...gguf_init_from_file_impl: invalid magic characters: 'vers', expected 'GGUF' gguf_merge: failed to load input GGUF from DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L/DeepSeek-V3-Q2_K_L-00001-of-00005.gguf ``` The files were downloaded by cloning https://huggingface.co/unsloth/DeepSeek-V3-GGUF. This is the content of the DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L folder: ``` ~/DeepSeek-V3-GGUF/DeepSeek-V3-Q2_K_L # ls DeepSeek-V3-Q2_K_L-00001-of-00005.gguf DeepSeek-V3-Q2_K_L-00002-of-00005.gguf DeepSeek-V3-Q2_K_L-00003-of-00005.gguf DeepSeek-V3-Q2_K_L-00004-of-00005.gguf DeepSeek-V3-Q2_K_L-00005-of-00005.gguf ``` Does anyone know what's going wrong here/ how to fix this? Thanks in advance!

GiteaMirror commented

2026-04-12 13:50:10 -05:00

@MrEdigital commented on GitHub (Mar 27, 2025):

More and more models are going the sharded route, exclusively. This needs to be addressed soon.

@MrEdigital commented on GitHub (Mar 27, 2025): More and more models are going the sharded route, exclusively. This needs to be addressed soon.

GiteaMirror commented

2026-04-12 13:50:11 -05:00

@mcDandy commented on GitHub (Apr 5, 2025):

It is highly needed. I want to use a custom gemma 3 but vision tower is separate. Llama.cpp does not help to merge vision into the model.

@mcDandy commented on GitHub (Apr 5, 2025): It is highly needed. I want to use a custom gemma 3 but vision tower is separate. Llama.cpp does not help to merge vision into the model.

GiteaMirror commented

2026-04-12 13:50:12 -05:00

@1472583610 commented on GitHub (Apr 17, 2025):

I plus-one this. Ollama needs to support this natively. Aside from it being an unreasonably large amount of work to constantly download and manually merge models, there seem to be issues when running the merged models on multiple GPUs.

We have a 2x A6000 AI server and it doesn't load merged models larger than what fits on a single card (48 GB).

Support for sharded models is becoming a must.

@1472583610 commented on GitHub (Apr 17, 2025): I plus-one this. Ollama needs to support this natively. Aside from it being an unreasonably large amount of work to constantly download and manually merge models, there seem to be issues when running the merged models on multiple GPUs. We have a 2x A6000 AI server and it doesn't load merged models larger than what fits on a single card (48 GB). Support for sharded models is becoming a must.

GiteaMirror commented

2026-04-12 13:50:12 -05:00

@mNandhu commented on GitHub (May 21, 2025):

+1 - There's no mention about the incompatibility in ModelFile Docs either

@mNandhu commented on GitHub (May 21, 2025): +1 - There's no mention about the incompatibility in ModelFile Docs either

GiteaMirror commented

2026-04-12 13:50:13 -05:00

@lknight commented on GitHub (May 28, 2025):

+1 Its almost impossible to do something with DeepSeek (multiple files model) in Ollama and multiple A6000.

@lknight commented on GitHub (May 28, 2025): +1 Its almost impossible to do something with DeepSeek (multiple files model) in Ollama and multiple A6000.

GiteaMirror commented

2026-04-12 13:50:13 -05:00

@LastMinuteStudio commented on GitHub (Jun 8, 2025):

Relatively new to LLMs so pardon my ignorance. I've tried merging and running the split GGUFs especially on the latest R1 from unsloth https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-IQ1_S.

The way I run it is through their recommended command
./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/Deepseek-R1-0528-UD-IQ1_S-Merged.gguf
./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf

They work both merged and split in llama.cpp using llama-server.exe.
I've then tried creating a Modelfile for the merged gguf in ollama

FROM a:\Deepseek-R1-0528-UD-IQ1_S-Merged.gguf
TEMPLATE """{{- if .System }}{{ .System }}{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1}}
{{- if eq .Role "user" }}<｜User｜>{{ .Content }}
{{- else if eq .Role "assistant" }}<｜Assistant｜>
  {{- if and $.IsThinkSet (and $last .Thinking) -}}
<think>
{{ .Thinking }}
</think>
{{- end }}{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }}
{{- end }}
{{- if and $last (ne .Role "assistant") }}<｜Assistant｜>
{{- if and $.IsThinkSet (not $.Think) -}}
<think>

</think>

{{ end }}
{{- end -}}
{{- end }}"""
PARAMETER min_p 0.01
PARAMETER repeat_penalty 1
PARAMETER top_p 0.95
PARAMETER num_predict 16384
PARAMETER num_ctx 1024
PARAMETER num_gpu 40
PARAMETER stop <｜begin▁of▁sentence｜>
PARAMETER stop <｜end▁of▁sentence｜>
PARAMETER stop <｜User｜>
PARAMETER stop <｜Assistant｜>
PARAMETER temperature 0.6

I create the model using ollama create DeepseekR1 -f Modelfile
However when I run it through ollama, I keep getting an error telling me that the memory required is insufficient even though llama.cpp has no issues with loading it slowly.
Error: model requires more system memory (164.9 GiB) than is available (94.4 GiB)
Is this due to the gguf being split or some other reason? The merged gguf works fine in llama.cpp though

edit: I've increased my swap size just to get past the memory error but now I get the following
llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer

@LastMinuteStudio commented on GitHub (Jun 8, 2025): Relatively new to LLMs so pardon my ignorance. I've tried merging and running the split GGUFs especially on the latest R1 from unsloth https://huggingface.co/unsloth/DeepSeek-R1-0528-GGUF/tree/main/UD-IQ1_S. The way I run it is through their recommended command `./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/Deepseek-R1-0528-UD-IQ1_S-Merged.gguf` `./llama-server --port 10000 --ctx-size 1024 --n-gpu-layers 40 --model a:/DeepSeek-R1-0528-UD-IQ1_S-00001-of-00004.gguf` They work both merged and split in llama.cpp using llama-server.exe. I've then tried creating a Modelfile for the merged gguf in ollama ``` FROM a:\Deepseek-R1-0528-UD-IQ1_S-Merged.gguf TEMPLATE """{{- if .System }}{{ .System }}{{ end }} {{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1}} {{- if eq .Role "user" }}<｜User｜>{{ .Content }} {{- else if eq .Role "assistant" }}<｜Assistant｜> {{- if and $.IsThinkSet (and $last .Thinking) -}} <think> {{ .Thinking }} </think> {{- end }}{{ .Content }}{{- if not $last }}<｜end▁of▁sentence｜>{{- end }} {{- end }} {{- if and $last (ne .Role "assistant") }}<｜Assistant｜> {{- if and $.IsThinkSet (not $.Think) -}} <think> </think> {{ end }} {{- end -}} {{- end }}""" PARAMETER min_p 0.01 PARAMETER repeat_penalty 1 PARAMETER top_p 0.95 PARAMETER num_predict 16384 PARAMETER num_ctx 1024 PARAMETER num_gpu 40 PARAMETER stop <｜begin▁of▁sentence｜> PARAMETER stop <｜end▁of▁sentence｜> PARAMETER stop <｜User｜> PARAMETER stop <｜Assistant｜> PARAMETER temperature 0.6 ``` I create the model using `ollama create DeepseekR1 -f Modelfile` However when I run it through ollama, I keep getting an error telling me that the memory required is insufficient even though llama.cpp has no issues with loading it slowly. `Error: model requires more system memory (164.9 GiB) than is available (94.4 GiB)` Is this due to the gguf being split or some other reason? The merged gguf works fine in llama.cpp though edit: I've increased my swap size just to get past the memory error but now I get the following `llama runner process has terminated: error loading model: unable to allocate CUDA0 buffer`

GiteaMirror commented

2026-04-12 13:50:14 -05:00

@MrEdigital commented on GitHub (Jul 22, 2025):

This doesn't appear to be given the weight it deserves. It's now been more than a year since this was raised.

@MrEdigital commented on GitHub (Jul 22, 2025): This doesn't appear to be given the weight it deserves. It's now been more than a year since this was raised.

GiteaMirror commented

2026-04-12 13:50:14 -05:00

@giorgostheo commented on GitHub (Jul 22, 2025):

This needs to be prioritized IMO. Shared GGUFs are the norm now. No support for them in probably the most used platform for local LLMs is bonkers...

@giorgostheo commented on GitHub (Jul 22, 2025): This needs to be prioritized IMO. Shared GGUFs are the norm now. No support for them in probably the most used platform for local LLMs is bonkers...

GiteaMirror commented

2026-04-12 13:50:15 -05:00

@kappa8219 commented on GitHub (Jul 23, 2025):

Qwen Coder is coming :) Also sharded...

@kappa8219 commented on GitHub (Jul 23, 2025): Qwen Coder is coming :) Also sharded...

GiteaMirror commented

2026-04-12 13:50:15 -05:00

@giorgostheo commented on GitHub (Jul 24, 2025):

I think its fair to say that this should become priority No.1 for the dev team at this point. Not having shared GGUF support will soon make ollama unusable...

@giorgostheo commented on GitHub (Jul 24, 2025): I think its fair to say that this should become priority No.1 for the dev team at this point. Not having shared GGUF support will soon make ollama unusable...

GiteaMirror commented

2026-04-12 13:50:16 -05:00

@tolysz commented on GitHub (Jul 27, 2025):

If llama.cpp has llama_model_load_from_splits we are almost half way... I wonder what is the desired solution... currently all the models are stored with hashed names, naive implementation could have some virtual to the top level folder and inside all files symlinked to the hashed files, filenames should follow the splits pattern... otherwise, the model could accept the list of splits and skip the symlinking
edit:
We just need to provide all the hashed files as list... the filename itself is irrelevant...

    // Load the model from a file
    // If the file is split into multiple parts, the file name must follow this pattern: <name>-%05d-of-%05d.gguf
    // If the split file name does not follow this pattern, use llama_model_load_from_splits
    LLAMA_API struct llama_model * llama_model_load_from_file(
                             const char * path_model,
              struct llama_model_params   params);

    // Load the model from multiple splits (support custom naming scheme)
    // The paths must be in the correct order
    LLAMA_API struct llama_model * llama_model_load_from_splits(
                             const char ** paths,
                                 size_t    n_paths,
              struct llama_model_params    params);

@tolysz commented on GitHub (Jul 27, 2025): If `llama.cpp` has `llama_model_load_from_splits` ~we are almost half way... I wonder what is the desired solution... currently all the models are stored with hashed names, naive implementation could have some `virtual` to the top level folder and inside all files symlinked to the hashed files, filenames should follow the splits pattern... otherwise, the model could accept the list of splits and skip the symlinking~ edit: We just need to provide all the hashed files as list... the filename itself is irrelevant... ``` // Load the model from a file // If the file is split into multiple parts, the file name must follow this pattern: <name>-%05d-of-%05d.gguf // If the split file name does not follow this pattern, use llama_model_load_from_splits LLAMA_API struct llama_model * llama_model_load_from_file( const char * path_model, struct llama_model_params params); // Load the model from multiple splits (support custom naming scheme) // The paths must be in the correct order LLAMA_API struct llama_model * llama_model_load_from_splits( const char ** paths, size_t n_paths, struct llama_model_params params); ```

GiteaMirror commented

2026-04-12 13:50:17 -05:00

@xNefas commented on GitHub (Aug 10, 2025):

Just gonna add to the "noise" and say this should be a priority, it's really rough being unable to load sharded GGUFs with Ollama.

@xNefas commented on GitHub (Aug 10, 2025): Just gonna add to the "noise" and say this should be a priority, it's really rough being unable to load sharded GGUFs with Ollama.

GiteaMirror commented

2026-04-12 13:50:18 -05:00

@Likkkez commented on GitHub (Aug 16, 2025):

Is it really this hard to just add two files together automatically? Is this one of those delusional ideological stances or what?

@Likkkez commented on GitHub (Aug 16, 2025): Is it really this hard to just add two files together automatically? Is this one of those delusional ideological stances or what?

GiteaMirror commented

2026-04-12 13:50:19 -05:00

@giorgostheo commented on GitHub (Aug 16, 2025):

Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same.

@giorgostheo commented on GitHub (Aug 16, 2025): Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same.

GiteaMirror commented

2026-04-12 13:50:20 -05:00

@OdinVex commented on GitHub (Aug 21, 2025):

Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same.

? Turbo? What is that, some upsell or something? Is it time to fork Ollama?

@OdinVex commented on GitHub (Aug 21, 2025): > Honestly at this point Im not even sure this will ever be implemented. After all the fuss with GPT-OSS and the push for ollama turbo, it seems that this will be another one of those os projects that remains os mostly for show... I really hope Im wrong but tbh Im already transitioning to llama-cpp cause of this. I suggest that others do the same. ? Turbo? What is that, some upsell or something? Is it time to fork Ollama?

GiteaMirror commented

2026-04-12 13:50:21 -05:00

@kappa8219 commented on GitHub (Sep 1, 2025):

"The Little Engine That Could" (c)

@kappa8219 commented on GitHub (Sep 1, 2025): "The Little Engine That Could" (c)

GiteaMirror commented

2026-04-12 13:50:22 -05:00

@bokkob556644-coder commented on GitHub (Oct 22, 2025):

ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M

https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF

Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

@bokkob556644-coder commented on GitHub (Oct 22, 2025): ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

GiteaMirror commented

2026-04-12 13:50:23 -05:00

@kappa8219 commented on GitHub (Oct 27, 2025):

ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M

https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF

Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

Welcome to the club

@kappa8219 commented on GitHub (Oct 27, 2025): > ollama run hf.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF:Q4_K_M > > https://huggingface.co/asdgad/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M-GGUF > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: [https://github.com/ollama/ollama/issues/5245"}](https://github.com/ollama/ollama/issues/5245%22%7D) Welcome to the club

GiteaMirror commented

2026-04-12 13:50:24 -05:00

@SvenMeyer commented on GitHub (Nov 8, 2025):

$ ollama pull hf.co/unsloth/MiniMax-M2-GGUF:Q3_K_XL
pulling manifest 
Error: pull model manifest: 400: {"error":"The specified tag is a sharded GGUF. Ollama does not support this yet. Please use another tag or \"latest\". Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

nobody looking into this ?

@SvenMeyer commented on GitHub (Nov 8, 2025): ```bash $ ollama pull hf.co/unsloth/MiniMax-M2-GGUF:Q3_K_XL pulling manifest Error: pull model manifest: 400: {"error":"The specified tag is a sharded GGUF. Ollama does not support this yet. Please use another tag or \"latest\". Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} ``` nobody looking into this ?

GiteaMirror commented

2026-04-12 13:50:25 -05:00

@slenderq commented on GitHub (Nov 9, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

It would be great to run this model in ollama!

@slenderq commented on GitHub (Nov 9, 2025): ``` ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S pulling manifest Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} ``` It would be great to run this model in ollama!

GiteaMirror commented

2026-04-12 13:50:25 -05:00

@cvrunmin commented on GitHub (Nov 13, 2025):

not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here:
8a75d8b015/llama/llama.go (L259-L309)
at L303, C.llama_model_load_from_file is called to load model, which the implementation is here:
8a75d8b015/llama/llama.cpp/src/llama.cpp (L304-L325)
Yes, there is a function named llama_model_load_from_splits that should be able to load split files!
This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

@cvrunmin commented on GitHub (Nov 13, 2025): not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here: https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.go#L259-L309 at L303, `C.llama_model_load_from_file` is called to load model, which the implementation is here: https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.cpp/src/llama.cpp#L304-L325 Yes, there is a function named `llama_model_load_from_splits` that should be able to load split files! This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

GiteaMirror commented

2026-04-12 13:50:25 -05:00

@tolysz commented on GitHub (Nov 13, 2025):

The challenge is in the config files, as the files are renamed to some-hash... the config file needs to support storing a list of the filenames... like the type of file is not a string but a list of them.

@tolysz commented on GitHub (Nov 13, 2025): The challenge is in the config files, as the files are renamed to some-hash... the config file needs to support storing a list of the filenames... like the type of file is not a string but a list of them.

GiteaMirror commented

2026-04-12 13:50:26 -05:00

@FearL0rd commented on GitHub (Nov 14, 2025):

Looks like Ollama has more focus on their cloud instead of working on this

@FearL0rd commented on GitHub (Nov 14, 2025): Looks like Ollama has more focus on their cloud instead of working on this

GiteaMirror commented

2026-04-12 13:50:26 -05:00

@OdinVex commented on GitHub (Nov 14, 2025):

So it seems the only thing holding it back is a config-representation of split-files and a simple 'if split call load-splits' instead?

@OdinVex commented on GitHub (Nov 14, 2025): So it seems the only thing holding it back is a config-representation of split-files and a simple 'if split call load-splits' instead?

GiteaMirror commented

2026-04-12 13:50:27 -05:00

@rdeforest commented on GitHub (Nov 14, 2025):

Looks like Ollama has more focus on their cloud instead of working on this

One of the many great things about open-source projects is that everyone gets to work on whatever they want to work on. I bet if you put together a quality PR to address this issue, the team would consider merging it. Or if you don't want to wait you could just maintain your own fork.

If you don't want to help, that's fine too. Just don't complain about the priorities of volunteers please?

@rdeforest commented on GitHub (Nov 14, 2025): > Looks like Ollama has more focus on their cloud instead of working on this One of the many great things about open-source projects is that everyone gets to work on whatever they want to work on. I bet if you put together a quality PR to address this issue, the team would consider merging it. Or if you don't want to wait you could just maintain your own fork. If you don't want to help, that's fine too. Just don't complain about the priorities of volunteers please?

GiteaMirror commented

2026-04-12 13:50:27 -05:00

@OdinVex commented on GitHub (Nov 14, 2025):

I think extension of Ollama manifests to describe split GGUFs (and their order) is necessary, first. Perhaps any Ollama developer could chime in for that?

@OdinVex commented on GitHub (Nov 14, 2025): I think extension of Ollama manifests to describe split GGUFs (and their order) is necessary, first. Perhaps any Ollama developer could chime in for that?

GiteaMirror commented

2026-04-12 13:50:28 -05:00

@Mikec78660 commented on GitHub (Nov 14, 2025):

EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run:
ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model
And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work.

I use the llama.cpp method to combine my model:

GLM-4.5-Air-Q4_0-00001-of-00002.gguf
GLM-4.5-Air-Q4_0-00002-of-00002.gguf

And it seemed to work:

./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors.

But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong?

@Mikec78660 commented on GitHub (Nov 14, 2025): EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run: `ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model` And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work. ******* I use the llama.cpp method to combine my model: ``` GLM-4.5-Air-Q4_0-00001-of-00002.gguf GLM-4.5-Air-Q4_0-00002-of-00002.gguf ``` And it seemed to work: ``` ./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors. ``` But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong?

GiteaMirror commented

2026-04-12 13:50:29 -05:00

@OdinVex commented on GitHub (Nov 14, 2025):

EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run: ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work.

I use the llama.cpp method to combine my model:
GLM-4.5-Air-Q4_0-00001-of-00002.gguf
GLM-4.5-Air-Q4_0-00002-of-00002.gguf
And it seemed to work:
./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done
gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done
gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors.
But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong?

This Issue is about importing shard GGUF files, not about llama.cpp. But on a side note, I've never had llama.cpp produce a file that didn't end up in gibberish or corrupt output.

@OdinVex commented on GitHub (Nov 14, 2025): > EDIT: Leaving this in case anyone else has this problem. Seems from the bash script that was posted you can then run: `ollama create GLM-4.5-Air-Q4_0.gguf -f GLM-4.5-Air-Q4_0.gguf.model` And voilà, the model is showing up in ollama, no need to use the import in openweui which doesn't seem to work. > > I use the llama.cpp method to combine my model: > > ``` > GLM-4.5-Air-Q4_0-00001-of-00002.gguf > GLM-4.5-Air-Q4_0-00002-of-00002.gguf > ``` > > And it seemed to work: > > ``` > ./llama-gguf-split --merge /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf /mnt/AI/GLM-4.5-Air-Q4_0.gguf > gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf -> /mnt/AI/GLM-4.5-Air-Q4_0.gguf > gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done > gguf_merge: reading metadata /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done > gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00001-of-00002.gguf done > gguf_merge: writing tensors /mnt/AI/GLM-4.5-Air-Q4_0-00002-of-00002.gguf done > gguf_merge: /mnt/AI/GLM-4.5-Air-Q4_0.gguf merged from 2 split with 803 tensors. > ``` > > But trying to import "GLM-4.5-Air-Q4_0.gguf" into ollama after a minute or so I get an error saying error parsing the body. Any idea what I am doing wrong? This Issue is about importing shard GGUF files, not about llama.cpp. But on a side note, I've never had llama.cpp produce a file that didn't end up in gibberish or corrupt output.

GiteaMirror commented

2026-04-12 13:50:30 -05:00

@shimmyshimmer commented on GitHub (Nov 14, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}

It would be great to run this model in ollama!

We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf

Usually we do these non-sharded files for any model under 300B parameters or so. But it is very small and 1.77-bit ish

@shimmyshimmer commented on GitHub (Nov 14, 2025): > ``` > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > pulling manifest > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > ``` > > It would be great to run this model in ollama! We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf Usually we do these non-sharded files for any model under 300B parameters or so. But it is very small and 1.77-bit ish

GiteaMirror commented

2026-04-12 13:50:30 -05:00

@cvrunmin commented on GitHub (Nov 17, 2025):

not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here:

ollama/llama/llama.go

Lines 259 to 309 in 8a75d8b

func LoadModelFromFile(modelPath string, params ModelParams) (*Model, error) {
cparams := C.llama_model_default_params()
cparams.n_gpu_layers = C.int(params.NumGpuLayers)
cparams.main_gpu = C.int32_t(params.MainGpu)
cparams.use_mmap = C.bool(params.UseMmap)
cparams.vocab_only = C.bool(params.VocabOnly)

var devices []C.ggml_backend_dev_t
for _, llamaID := range params.Devices {
devices = append(devices, C.ggml_backend_dev_get(C.size_t(llamaID)))
}
if len(devices) > 0 {
devices = append(devices, C.ggml_backend_dev_t(C.NULL))
devicesData := &devices[0]
  var devicesPin runtime.Pinner 
  devicesPin.Pin(devicesData) 
  defer devicesPin.Unpin() 

  cparams.devices = devicesData 
}

if len(params.TensorSplit) > 0 {
tensorSplitData := &params.TensorSplit[0]
  var tensorSplitPin runtime.Pinner 
  tensorSplitPin.Pin(tensorSplitData) 
  defer tensorSplitPin.Unpin() 

  cparams.tensor_split = (*C.float)(unsafe.Pointer(tensorSplitData)) 
}

if params.Progress != nil {
handle := cgo.NewHandle(params.Progress)
defer handle.Delete()
  var handlePin runtime.Pinner 
  handlePin.Pin(&handle) 
  defer handlePin.Unpin() 

  cparams.progress_callback = C.llama_progress_callback(C.llamaProgressCallback) 
  cparams.progress_callback_user_data = unsafe.Pointer(&handle) 
}

m := Model{c: C.llama_model_load_from_file(C.CString(modelPath), cparams)}
if m.c == nil {
return nil, fmt.Errorf("unable to load model: %s", modelPath)
}

return &m, nil
}

at L303, C.llama_model_load_from_file is called to load model, which the implementation is here:
ollama/llama/llama.cpp/src/llama.cpp

Lines 304 to 325 in 8a75d8b

struct llama_model * llama_model_load_from_file(
const char * path_model,
struct llama_model_params params) {
std::vectorstd::string splits = {};
return llama_model_load_from_file_impl(path_model, splits, params);
}

struct llama_model * llama_model_load_from_splits(
const char ** paths,
size_t n_paths,
struct llama_model_params params) {
std::vectorstd::string splits;
if (n_paths == 0) {
LLAMA_LOG_ERROR("%s: list of splits is empty\n", func);
return nullptr;
}
splits.reserve(n_paths);
for (size_t i = 0; i < n_paths; ++i) {
splits.push_back(paths[i]);
}
return llama_model_load_from_file_impl(splits.front(), splits, params);
}

Yes, there is a function named llama_model_load_from_splits that should be able to load split files!
This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

I only focused on llamarunner and didn't recognize that ollama has its own runner (ollamarunner) which is a different story. This thing will load models at ml/backend/ggml/ggml.go which really not supported multi-file gguf.
Anyway the modification of config format to support file path of sharded model in correct order is still necessary.

@cvrunmin commented on GitHub (Nov 17, 2025): > not a golang user tho, from my glance on the code, there is only one place which actually call model load function using llama.cpp here: > > [ollama/llama/llama.go](https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.go#L259-L309) > > Lines 259 to 309 in [8a75d8b](/ollama/ollama/commit/8a75d8b0154511d2bafe16f230e9268ee7a511da) > > func LoadModelFromFile(modelPath string, params ModelParams) (*Model, error) { > cparams := C.llama_model_default_params() > cparams.n_gpu_layers = C.int(params.NumGpuLayers) > cparams.main_gpu = C.int32_t(params.MainGpu) > cparams.use_mmap = C.bool(params.UseMmap) > cparams.vocab_only = C.bool(params.VocabOnly) > > var devices []C.ggml_backend_dev_t > for _, llamaID := range params.Devices { > devices = append(devices, C.ggml_backend_dev_get(C.size_t(llamaID))) > } > if len(devices) > 0 { > devices = append(devices, C.ggml_backend_dev_t(C.NULL)) > devicesData := &devices[0] > > var devicesPin runtime.Pinner > devicesPin.Pin(devicesData) > defer devicesPin.Unpin() > > cparams.devices = devicesData > } > > if len(params.TensorSplit) > 0 { > tensorSplitData := &params.TensorSplit[0] > > var tensorSplitPin runtime.Pinner > tensorSplitPin.Pin(tensorSplitData) > defer tensorSplitPin.Unpin() > > cparams.tensor_split = (*C.float)(unsafe.Pointer(tensorSplitData)) > } > > if params.Progress != nil { > handle := cgo.NewHandle(params.Progress) > defer handle.Delete() > > var handlePin runtime.Pinner > handlePin.Pin(&handle) > defer handlePin.Unpin() > > cparams.progress_callback = C.llama_progress_callback(C.llamaProgressCallback) > cparams.progress_callback_user_data = unsafe.Pointer(&handle) > } > > m := Model{c: C.llama_model_load_from_file(C.CString(modelPath), cparams)} > if m.c == nil { > return nil, fmt.Errorf("unable to load model: %s", modelPath) > } > > return &m, nil > } > > at L303, `C.llama_model_load_from_file` is called to load model, which the implementation is here: > [ollama/llama/llama.cpp/src/llama.cpp](https://github.com/ollama/ollama/blob/8a75d8b0154511d2bafe16f230e9268ee7a511da/llama/llama.cpp/src/llama.cpp#L304-L325) > > Lines 304 to 325 in [8a75d8b](/ollama/ollama/commit/8a75d8b0154511d2bafe16f230e9268ee7a511da) > > struct llama_model * llama_model_load_from_file( > const char * path_model, > struct llama_model_params params) { > std::vector<std::string> splits = {}; > return llama_model_load_from_file_impl(path_model, splits, params); > } > > struct llama_model * llama_model_load_from_splits( > const char ** paths, > size_t n_paths, > struct llama_model_params params) { > std::vector<std::string> splits; > if (n_paths == 0) { > LLAMA_LOG_ERROR("%s: list of splits is empty\n", __func__); > return nullptr; > } > splits.reserve(n_paths); > for (size_t i = 0; i < n_paths; ++i) { > splits.push_back(paths[i]); > } > return llama_model_load_from_file_impl(splits.front(), splits, params); > } > > Yes, there is a function named `llama_model_load_from_splits` that should be able to load split files! > This function requires a list of split file paths as the parameter. Thus we have to know the correct order of the model split files (assume the function that actually load them don't guess how they are split). We might need to add fields in Modelfile to provide this information. This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik) I only focused on llamarunner and didn't recognize that ollama has its own runner (ollamarunner) which is a different story. This thing will load models at `ml/backend/ggml/ggml.go` which really not supported multi-file gguf. Anyway the modification of config format to support file path of sharded model in correct order is still necessary.

GiteaMirror commented

2026-04-12 13:50:31 -05:00

@giorgostheo commented on GitHub (Nov 17, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}
It would be great to run this model in ollama!
We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf

Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish

Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now.

Thanks for all your work.

@giorgostheo commented on GitHub (Nov 17, 2025): > > ``` > > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > > pulling manifest > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > > ``` > > > > > > > > > > > > > > > > > > > > > > > > It would be great to run this model in ollama! > > We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf > > Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now. Thanks for all your work.

GiteaMirror commented

2026-04-12 13:50:32 -05:00

@OdinVex commented on GitHub (Nov 17, 2025):

... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of splitNumber-totalSplits. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine.

@OdinVex commented on GitHub (Nov 17, 2025): > ... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik) To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of `splitNumber-totalSplits`. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine.

GiteaMirror commented

2026-04-12 13:50:33 -05:00

@cvrunmin commented on GitHub (Nov 18, 2025):

... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik)

To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of splitNumber-totalSplits. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine.

If the multi-file GGUF model is created by user using ollama create -f Modelfile and the filenames of split ggufs are nicely named as xxxxxx-00001-of-00003.gguf then this is what your case. However, when the model is pulled from Internet, we only have the hash of the file.
For example, this is the manifest of gpt-oss hosted on ollama registry (https://registry.ollama.ai/v2/library/gpt-oss/manifests/latest):

{
  "schemaVersion":2,
  "mediaType":"application/vnd.docker.distribution.manifest.v2+json",
  "config": {
    "mediaType":"application/vnd.docker.container.image.v1+json",
    "digest":"sha256:776beb3adb235076157cfea408b8ea2a2d25eae99d7f5da997f607f6b69fa0fa",
    "size":489
  },
  "layers":[
    {
      "mediaType":"application/vnd.ollama.image.model",
      "digest":"sha256:e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb",
      "size":13793422144
    },
    {
      "mediaType":"application/vnd.ollama.image.template",
      "digest":"sha256:fa6710a93d78da62641e192361344be7a8c0a1c3737f139cf89f20ce1626b99c",
      "size":7240
    },
    {
      "mediaType":"application/vnd.ollama.image.license",
      "digest":"sha256:f60356777647e927149cbd4c0ec1314a90caba9400ad205ddc4ce47ed001c2d6",
      "size":11353
    },
    {
      "mediaType":"application/vnd.ollama.image.params",
      "digest":"sha256:d8ba2f9a17b3bbdeb5690efaa409b3fcb0b56296a777c7a69c78aa33bbddf182",
      "size":18
    }
  ]
}

In GGUF metadata of split GGUF, we have the split file information split.no, split.tensors.count and split.count. This is where llama.cpp check if the split file is provided in order:
584e2d646f/llama/llama.cpp/src/llama-model-loader.cpp (L526-L573)
In worst case scenario, we can cache the ordering from the metadata when the model is first created or pulled, then changes towards the config spec could be minimal.

@cvrunmin commented on GitHub (Nov 18, 2025): > > ... This information should be provided by HF too when ollama is ready for multifile gguf (they only provide a list of hash-named blob with blob type afaik) > > To the best of my knowledge splits always have their filenames suffixed (before extension) with a format of `splitNumber-totalSplits`. That's probably the only assumption that could be made about order. Maybe the backend doesn't care about order and loads them fine. If the multi-file GGUF model is created by user using `ollama create -f Modelfile` and the filenames of split ggufs are nicely named as `xxxxxx-00001-of-00003.gguf` then this is what your case. However, when the model is pulled from Internet, we only have the hash of the file. For example, this is the manifest of gpt-oss hosted on ollama registry (`https://registry.ollama.ai/v2/library/gpt-oss/manifests/latest`): ```json { "schemaVersion":2, "mediaType":"application/vnd.docker.distribution.manifest.v2+json", "config": { "mediaType":"application/vnd.docker.container.image.v1+json", "digest":"sha256:776beb3adb235076157cfea408b8ea2a2d25eae99d7f5da997f607f6b69fa0fa", "size":489 }, "layers":[ { "mediaType":"application/vnd.ollama.image.model", "digest":"sha256:e7b273f9636059a689e3ddcab3716e4f65abe0143ac978e46673ad0e52d09efb", "size":13793422144 }, { "mediaType":"application/vnd.ollama.image.template", "digest":"sha256:fa6710a93d78da62641e192361344be7a8c0a1c3737f139cf89f20ce1626b99c", "size":7240 }, { "mediaType":"application/vnd.ollama.image.license", "digest":"sha256:f60356777647e927149cbd4c0ec1314a90caba9400ad205ddc4ce47ed001c2d6", "size":11353 }, { "mediaType":"application/vnd.ollama.image.params", "digest":"sha256:d8ba2f9a17b3bbdeb5690efaa409b3fcb0b56296a777c7a69c78aa33bbddf182", "size":18 } ] } ``` In GGUF metadata of split GGUF, we have the split file information `split.no`, `split.tensors.count` and `split.count`. This is where llama.cpp check if the split file is provided in order: https://github.com/ollama/ollama/blob/584e2d646fb4d2f1643b4da81a096d01114f5b2b/llama/llama.cpp/src/llama-model-loader.cpp#L526-L573 In worst case scenario, we can cache the ordering from the metadata when the model is first created or pulled, then changes towards the config spec could be minimal.

GiteaMirror commented

2026-04-12 13:50:33 -05:00

@shimmyshimmer commented on GitHub (Nov 21, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}
It would be great to run this model in ollama!
We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf
Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish
Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now.

Thanks for all your work.

Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which

@shimmyshimmer commented on GitHub (Nov 21, 2025): > > > ``` > > > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > > > pulling manifest > > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It would be great to run this model in ollama! > > > > > > We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf > > Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish > > Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now. > > Thanks for all your work. Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which

GiteaMirror commented

2026-04-12 13:50:34 -05:00

@eurekin commented on GitHub (Dec 9, 2025):

This still a thing?

Happy birthday to the issue I guess

@eurekin commented on GitHub (Dec 9, 2025): This still a thing? Happy birthday to the issue I guess

GiteaMirror commented

2026-04-12 13:50:35 -05:00

@FearL0rd commented on GitHub (Dec 9, 2025):

This still a thing?

Happy birthday to the issue I guess

Now all the focus is on cloud

@FearL0rd commented on GitHub (Dec 9, 2025): > This still a thing? > > Happy birthday to the issue I guess Now all the focus is on cloud

GiteaMirror commented

2026-04-12 13:50:36 -05:00

@johnml1135 commented on GitHub (Dec 17, 2025):

I had a similar issue so I spun up my own tooling to adapt poor-fitting GGUF models into ollama by reworking the top layers - https://github.com/johnml1135/ollama-copilot-fixer.

@johnml1135 commented on GitHub (Dec 17, 2025): I had a similar issue so I spun up my own tooling to adapt poor-fitting GGUF models into ollama by reworking the top layers - https://github.com/johnml1135/ollama-copilot-fixer.

GiteaMirror commented

2026-04-12 13:50:36 -05:00

@giorgostheo commented on GitHub (Dec 23, 2025):

ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S
pulling manifest
Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"}
It would be great to run this model in ollama!
We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf
Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish
Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now.

Thanks for all your work.
Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which

Hey,

With GLM4.7 out, would it be possible that you upload a single gguf for the main config (q4_K_M or whatever it is)? You can add some sort of id like "mono" or "single" to make sure that users are not confused. I know it's not pretty, but it's an easy way to solve the complete lack of compatibility with ollama and allow tons more to use the newest and best models!

Keep up the awesome work.

@giorgostheo commented on GitHub (Dec 23, 2025): > > > > ``` > > > > ollama run hf.co/unsloth/Kimi-K2-Thinking-GGUF:IQ1_S > > > > pulling manifest > > > > Error: pull model manifest: 400: {"error":"The specified repository contains sharded GGUF. Ollama does not support this yet. Follow this issue for more info: https://github.com/ollama/ollama/issues/5245"} > > > > ``` > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It would be great to run this model in ollama! > > > > > > > > > We make smaller versions specifically for Ollama, see: https://huggingface.co/unsloth/MiniMax-M2-GGUF/blob/main/MiniMax-M2-UD-TQ1_0.gguf > > > Usually we do these non-sharded files for any model under 300M parameters or so. But it is very small and 1.77-bit ish > > > > Hi Michael. Since ollama teams seems to really not care about the shared GGUF thing, it would be great if we got more of those "merged" exports for larger quants. For GLM4.6 for example, something like Q3 would be great. I understand that it is stretching it size-wise, but for us ollama users its the only way to go for now. > > > > Thanks for all your work. > > Even though this is possible it might not be the best idea because if one file breaks or the internet gets cut off, you'll need to redownload the hundreds of GB again. It might be fine if you have good internet but around 50% of people have very slow internet :( But we'll see what we can do - it will be confusing for users to navigate which is which Hey, With GLM4.7 out, would it be possible that you upload a single gguf for the main config (q4_K_M or whatever it is)? You can add some sort of id like "mono" or "single" to make sure that users are not confused. I know it's not pretty, but it's an easy way to solve the complete lack of compatibility with ollama and allow tons more to use the newest and best models! Keep up the awesome work.

GiteaMirror commented

2026-04-12 13:50:36 -05:00

@scorpion7slayer commented on GitHub (Jan 30, 2026):

I have the problem with kimi k2.5 will this be added in the future?

@scorpion7slayer commented on GitHub (Jan 30, 2026): I have the problem with kimi k2.5 will this be added in the future?

GiteaMirror commented

2026-04-12 13:50:37 -05:00

@boomam commented on GitHub (Feb 20, 2026):

Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p

@boomam commented on GitHub (Feb 20, 2026): Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p

GiteaMirror commented

2026-04-12 13:50:38 -05:00

@cvrunmin commented on GitHub (Feb 21, 2026):

Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p

While such error messages is more likely provided from Huggingface side and not from ollama itself (same error message can be triggered by trying to access the manifest file of the model in browsers), it is still very amusing that more issues have been marked duplicate of this issue, meaning that some contributors know that this issue exists, but an pull request that claims that can solve this issues is now about four months old with zero comments from any contributors. Neither "this solution looks good!", nor "this solution is not good".

@cvrunmin commented on GitHub (Feb 21, 2026): > Its amusing that they went to the effort of changing the output of the Ollama error to reference this issue. :-p While such error messages is more likely provided from Huggingface side and not from ollama itself (same error message can be triggered by trying to access the manifest file of the model in browsers), it is still very amusing that more issues have been marked duplicate of this issue, meaning that some contributors know that this issue exists, but an pull request that claims that can solve this issues is now about four months old with zero comments from any contributors. Neither "this solution looks good!", nor "this solution is not good".

GiteaMirror commented

2026-04-12 13:50:39 -05:00

@OdinVex commented on GitHub (Feb 21, 2026):

Anyone know of an alternative to Ollama?

@OdinVex commented on GitHub (Feb 21, 2026): Anyone know of an alternative to Ollama?

GiteaMirror commented

2026-04-12 13:50:39 -05:00

@SvenMeyer commented on GitHub (Feb 22, 2026):

@OdinVex I switched to LMstudio which does not have this problem and has a nice GUI as well.
Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models.

@SvenMeyer commented on GitHub (Feb 22, 2026): @OdinVex I switched to LMstudio which does not have this problem and has a nice GUI as well. Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models.

GiteaMirror commented

2026-04-12 13:50:40 -05:00

@OdinVex commented on GitHub (Feb 22, 2026):

@OdinVex I switched to LMstudio which does not have this problem and has a nice GUI as well. Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models.

Doesn't appear at all to be an alternative, though. Ollama's use at the moment is container-supported for network-based interactions.

@OdinVex commented on GitHub (Feb 22, 2026): > [@OdinVex](https://github.com/OdinVex) I switched to LMstudio which does not have this problem and has a nice GUI as well. Actually, I would not be surprised if you would find it much better in every respect. Also LMstudio contines to provide a solid basis to run AI models locally which is the whole point of ollama/LMstusio, while ollama now diverted to become just a proxy to online AI models. Doesn't appear at all to be an alternative, though. Ollama's use at the moment is container-supported for network-based interactions.

GiteaMirror commented

2026-04-12 13:50:40 -05:00

@elkay commented on GitHub (Mar 2, 2026):

How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself.

@elkay commented on GitHub (Mar 2, 2026): How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself.

GiteaMirror commented

2026-04-12 13:50:41 -05:00

@FearL0rd commented on GitHub (Mar 3, 2026):

How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself.

looks like the focus today is Ollama Cloud

@FearL0rd commented on GitHub (Mar 3, 2026): > How is this an issue still 2 years later? Isn't it as simple as combining the split files and using the single file after download? I know you can use llama.cpp to combine the files if you manually download them, but it's really unclear how you would then manually add that file into ollama manually. Makes no sense why the Ollama team is dragging their feet on just supporting the combine in the internal download process itself. looks like the focus today is Ollama Cloud

GiteaMirror commented

2026-04-12 13:50:41 -05:00

@alexanderjacuna commented on GitHub (Mar 10, 2026):

Running into this issue as well with: hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q8_K_XL

This bug has been open since June of 2024, but lets put in a error message that references this issue with no traction after 2 years.

@alexanderjacuna commented on GitHub (Mar 10, 2026): Running into this issue as well with: hf.co/unsloth/Qwen3-Coder-Next-GGUF:UD-Q8_K_XL This bug has been open since June of 2024, but lets put in a error message that references this issue with no traction after 2 years.

GiteaMirror commented

2026-04-12 13:50:43 -05:00

@SvenMeyer commented on GitHub (Mar 10, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

@SvenMeyer commented on GitHub (Mar 10, 2026): I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

GiteaMirror commented

2026-04-12 13:50:44 -05:00

@OdinVex commented on GitHub (Mar 10, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

@OdinVex commented on GitHub (Mar 10, 2026): > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

GiteaMirror commented

2026-04-12 13:50:44 -05:00

@alexanderjacuna commented on GitHub (Mar 10, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

My setup doesn't allow for this unfortunately.

@alexanderjacuna commented on GitHub (Mar 10, 2026): > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. My setup doesn't allow for this unfortunately.

GiteaMirror commented

2026-04-12 13:50:44 -05:00

@SvenMeyer commented on GitHub (Mar 11, 2026):

@OdinVex @alexanderjacuna what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way.
Also, if you prefer CLI and do not need the GUI, just use llama.cpp

@SvenMeyer commented on GitHub (Mar 11, 2026): @OdinVex @alexanderjacuna what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way. Also, if you prefer CLI and do not need the GUI, just use llama.cpp

GiteaMirror commented

2026-04-12 13:50:44 -05:00

@OdinVex commented on GitHub (Mar 11, 2026):

@OdinVex @alexanderjacuna what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way. Also, if you prefer CLI and do not need the GUI, just use llama.cpp

Several, but most commonly Open-WebUI.

@OdinVex commented on GitHub (Mar 11, 2026): > [@OdinVex](https://github.com/OdinVex) [@alexanderjacuna](https://github.com/alexanderjacuna) what software is so tightly coupled to ollama that you can not replace it with another inference service? At the end it should be just an IP and port and even that you could set the same way. Also, if you prefer CLI and do not need the GUI, just use llama.cpp Several, but most commonly Open-WebUI.

GiteaMirror commented

2026-04-12 13:50:45 -05:00

@FearL0rd commented on GitHub (Mar 14, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

I have a drop-in Solution.
I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm
Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.

@FearL0rd commented on GitHub (Mar 14, 2026): > > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. > > Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama. I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.

GiteaMirror commented

2026-04-12 13:50:45 -05:00

@OdinVex commented on GitHub (Mar 14, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.

If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution.

Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more).

@OdinVex commented on GitHub (Mar 14, 2026): > > > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. > > > > > > Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama. > > I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM. If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution. Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more).

GiteaMirror commented

2026-04-12 13:50:46 -05:00

@FearL0rd commented on GitHub (Mar 14, 2026):

I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio.

Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama.

I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM.

If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution.

Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more).

Thx. This is the first release, and it will become more mature over time. It works for my needs with OpenWebUI and custom apps. It also merges the .gguf (it works with safetensors also, only pass the hf location ie. google/gemma-7b-it)

@FearL0rd commented on GitHub (Mar 14, 2026): > > > > I found a solution and it is pretty easy, also adds at lot of other features and usability at the same time : use LMstudio. > > > > > > > > > Not a viable solution to those needing a drop-in replacement for software that specifically depends upon Ollama. > > > > > > I have a drop-in Solution. I've built a solution called Ovllm. it's essentially an Ollama-style wrapper, but for vLLM instead of llama.cpp. It's still a work in progress, but the core downloading feature is live. Instead of pulling from a custom registry, it downloads models directly from Hugging Face. Just make sure to set your HF_TOKEN environment variable with your API key. Check it out: https://github.com/FearL0rd/Ovllm Ovllm is an Ollama-inspired wrapper designed to simplify working with vLLM. > > If it doesn't have complete feature-parity and speak Ollama so other software would integrate then it's not a drop-in solution. Considering the README has enough spelling/grammar issues I'm gravely concerned it's AI, or at the very least unpolished. Good luck with your project, but it's not a drop-in solution. > > Edit: Considering Ollama uses llama and llama supports shards I'd wager it'd be better to probably PR it (at least for now, until Ollama is forked by someone that cares about shard-support and more). Thx. This is the first release, and it will become more mature over time. It works for my needs with OpenWebUI and custom apps. It also merges the .gguf (it works with safetensors also, only pass the hf location ie. google/gemma-7b-it)

GiteaMirror commented

2026-04-12 13:50:47 -05:00

@Mikec78660 commented on GitHub (Mar 20, 2026):

llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files.

@Mikec78660 commented on GitHub (Mar 20, 2026): llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files.

GiteaMirror commented

2026-04-12 13:50:48 -05:00

@OdinVex commented on GitHub (Mar 20, 2026):

llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files.

So software like Open-WebUI can (without any changes except IP address and port) speak to it as if it were Ollama, even with the Ollama-specific code? And there's an official container for it as well? Not seeing it at all, so... Edit: See my lower post about how this went (failure, does not at all work like Ollama).

@OdinVex commented on GitHub (Mar 20, 2026): > llama.cpp in router mode should works exactly like ollama now. and can use multi part gguf files. So software like Open-WebUI can (without any changes except IP address and port) speak to it as if it were Ollama, even with the Ollama-specific code? And there's an official container for it as well? Not seeing it at all, so... Edit: See my lower post about how this went (failure, does not at all work like Ollama).

GiteaMirror commented

2026-04-12 13:50:49 -05:00

@Mikec78660 commented on GitHub (Mar 23, 2026):

@OdinVex yes.
A very minimal implementation is:
llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI
if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui.

Even better is using a config.ini file where you can set the setting for each model
llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini
This will allow you to set a custom ctx size, kv cache setting, etc.

@Mikec78660 commented on GitHub (Mar 23, 2026): @OdinVex yes. A very minimal implementation is: `llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI` if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui. Even better is using a config.ini file where you can set the setting for each model `llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini` This will allow you to set a custom ctx size, kv cache setting, etc.

GiteaMirror commented

2026-04-12 13:50:49 -05:00

@OdinVex commented on GitHub (Mar 24, 2026):

@OdinVex yes. A very minimal implementation is: llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui.

Even better is using a config.ini file where you can set the setting for each model llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini This will allow you to set a custom ctx size, kv cache setting, etc.

Edit: I see, you meant an OpenAI endpoint, not an Ollama endpoint. Will try and report back if it works.
Edit: It does not work, unfortunately. Trying to download just results in the API reporting 404.

@OdinVex commented on GitHub (Mar 24, 2026): > [@OdinVex](https://github.com/OdinVex) yes. A very minimal implementation is: `llama-server --host [0.0.0.0, or hostname] --port 8080 --models-dir /mnt/AI` if you do this and create a connection in openwebui to [ip or dns name]:8080/v1, it will give you any model in the /mnt/AI directory as an option in openwebui. > > Even better is using a config.ini file where you can set the setting for each model `llama-server --host 0.0.0.0 --port 8080 --models-dir /mnt/AI --models-preset config.ini` This will allow you to set a custom ctx size, kv cache setting, etc. Edit: I see, you meant an OpenAI endpoint, not an *Ollama* endpoint. Will try and report back if it works. Edit: It does *not* work, unfortunately. Trying to download just results in the API reporting 404.

GiteaMirror commented

2026-04-12 13:50:51 -05:00

@raro42 commented on GitHub (Mar 24, 2026):

.

@raro42 commented on GitHub (Mar 24, 2026): .

GiteaMirror commented

2026-04-12 13:50:52 -05:00

@CleyFaye commented on GitHub (Mar 25, 2026):

We already know about the issue, and the general outline of what should be done.

I don't see the value of regurgitating the existing discussion, especially to end on the suggestion to "do what was proposed, then tests and docs".

@CleyFaye commented on GitHub (Mar 25, 2026): We already know about the issue, and the general outline of what should be done. I don't see the value of regurgitating the existing discussion, especially to end on the suggestion to "do what was proposed, then tests and docs".

GiteaMirror commented

2026-04-12 13:50:54 -05:00

@raro42 commented on GitHub (Mar 25, 2026):

@CleyFaye sorry for disturbing. Will edit the comment. Thanks for commenting.

@raro42 commented on GitHub (Mar 25, 2026): @CleyFaye sorry for disturbing. Will edit the comment. Thanks for commenting.

GiteaMirror referenced this issue

2026-04-12 23:21:26 -05:00

[PR #3282] [MERGED] Add docs for GPU selection and nvidia uvm workaround #11107

GiteaMirror referenced this issue

2026-04-16 05:27:30 -05:00

[PR #3282] [MERGED] Add docs for GPU selection and nvidia uvm workaround #16378

GiteaMirror referenced this issue

2026-04-19 15:45:59 -05:00

[PR #3282] [MERGED] Add docs for GPU selection and nvidia uvm workaround #21647

GiteaMirror referenced this issue

2026-04-22 21:38:14 -05:00

[PR #3282] [MERGED] Add docs for GPU selection and nvidia uvm workaround #36980

GiteaMirror referenced this issue

2026-04-24 22:07:29 -05:00

[PR #3282] [MERGED] Add docs for GPU selection and nvidia uvm workaround #42355

GiteaMirror referenced this issue

2026-04-29 12:32:09 -05:00

[PR #3282] [MERGED] Add docs for GPU selection and nvidia uvm workaround #57804

GiteaMirror referenced this issue

2026-05-05 05:13:30 -05:00

[PR #3282] [MERGED] Add docs for GPU selection and nvidia uvm workaround #73401

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

hoyyeva/editor-config-repair

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

hoyyeva/launch-backup-ux

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-mlx-decode-checkpoints

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

brucemacd/download-before-remove

parth/update-claude-docs

parth-anthropic-reference-images-path

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#3282