[GH-ISSUE #14854] WSL2 + Intel Arc 140T: Vulkan runner hangs after completion request on /dev/dxg while llama.cpp works #56094

New Issue

GiteaMirror · 2026-04-29T10:15:23-05:00

GiteaMirror commented

2026-04-29 10:15:23 -05:00

Originally created by @oldeucryptoboi on GitHub (Mar 14, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14854

What is the issue?

On WSL2 (Ubuntu 24.04.4, kernel 6.6.87.2-microsoft-standard-WSL2) with an Intel Arc 140T iGPU exposed through /dev/dxg, Ollama 0.18.0 detects the GPU, fully loads Qwen3.5-9B-Q8_0.gguf on Vulkan, and then hangs after completion request before generating the first token.

This does not look like a general machine/configuration issue because:

vulkaninfo sees Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB)) via Dozen
clinfo sees the GPU
the same model works on the same machine with official llama.cpp b8348 over Vulkan (--device Vulkan0 --gpu-layers all)
while Ollama is hung, the WSL kernel logs /dev/dxg sync-object failures

Possibly related to #13097 and #13621, but this repro is specific to WSL2 + Dozen + /dev/dxg, and with the new engine it hangs rather than immediately panicking.

How to reproduce

Create a model from the public GGUF unsloth/Qwen3.5-9B-GGUF / Qwen3.5-9B-Q8_0.gguf
Start Ollama with Vulkan forced:

export OLLAMA_HOST=127.0.0.1:11443
export OLLAMA_LLM_LIBRARY=vulkan
export OLLAMA_VULKAN=1
export OLLAMA_NEW_ENGINE=true
export OLLAMA_DEBUG=1
/usr/local/bin/ollama serve

Run:

OLLAMA_HOST=127.0.0.1:11443 ollama run qwen35-9b-q8-local "Reply with exactly OK."

The model loads on Vulkan0, then hangs before the first token.

I also reproduced with /api/generate and with OLLAMA_FLASH_ATTENTION=false, so this does not look specific to the CLI or to flash attention.

Relevant log output

time=2026-03-14T18:35:46.347-04:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit ... FlashAttention:Disabled ... GPULayers:33[...] }"
ggml_vulkan: 0 = Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB)) (Dozen)
time=2026-03-14T18:35:53.243-04:00 level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU"
time=2026-03-14T18:36:40.132-04:00 level=INFO source=server.go:1388 msg="llama runner started in 55.60 seconds"
time=2026-03-14T18:36:40.169-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format=""
time=2026-03-14T18:36:40.198-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5

At that point the request stalls and no first token is produced.

The WSL kernel logs at the same time show the /dev/dxg sync path failing:

misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00
misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512
misc dxg: dxgk: process_completion_packet: did not find packet to complete

For completeness: with the old engine on the same machine, I also hit the older crash path instead of the hang:

llama_sampler_dist_apply(...): Assertion 'found' failed.
SIGABRT: abort

Control case

llama.cpp b8348 on the same WSL2 machine, same GPU, same GGUF, works over Vulkan and generates text successfully. So the GPU/Vulkan path is functional enough for inference outside Ollama; the failure seems specific to Ollama's Vulkan runner on this stack.

OS

Linux (WSL2 Ubuntu 24.04.4)

GPU

Intel Arc 140T iGPU via Dozen (Mesa 25.3.6) / /dev/dxg

CPU

Intel Core Ultra 7 255H

Ollama version

0.18.0

Originally created by @oldeucryptoboi on GitHub (Mar 14, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14854 ### What is the issue? On WSL2 (`Ubuntu 24.04.4`, kernel `6.6.87.2-microsoft-standard-WSL2`) with an Intel Arc 140T iGPU exposed through `/dev/dxg`, Ollama `0.18.0` detects the GPU, fully loads `Qwen3.5-9B-Q8_0.gguf` on Vulkan, and then hangs after `completion request` before generating the first token. This does not look like a general machine/configuration issue because: - `vulkaninfo` sees `Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB))` via Dozen - `clinfo` sees the GPU - the same model works on the same machine with official `llama.cpp` `b8348` over Vulkan (`--device Vulkan0 --gpu-layers all`) - while Ollama is hung, the WSL kernel logs `/dev/dxg` sync-object failures Possibly related to #13097 and #13621, but this repro is specific to `WSL2 + Dozen + /dev/dxg`, and with the new engine it hangs rather than immediately panicking. ### How to reproduce 1. Create a model from the public GGUF `unsloth/Qwen3.5-9B-GGUF` / `Qwen3.5-9B-Q8_0.gguf` 2. Start Ollama with Vulkan forced: ```bash export OLLAMA_HOST=127.0.0.1:11443 export OLLAMA_LLM_LIBRARY=vulkan export OLLAMA_VULKAN=1 export OLLAMA_NEW_ENGINE=true export OLLAMA_DEBUG=1 /usr/local/bin/ollama serve ``` 3. Run: ```bash OLLAMA_HOST=127.0.0.1:11443 ollama run qwen35-9b-q8-local "Reply with exactly OK." ``` 4. The model loads on `Vulkan0`, then hangs before the first token. I also reproduced with `/api/generate` and with `OLLAMA_FLASH_ATTENTION=false`, so this does not look specific to the CLI or to flash attention. ### Relevant log output ```text time=2026-03-14T18:35:46.347-04:00 level=INFO source=runner.go:1284 msg=load request="{Operation:fit ... FlashAttention:Disabled ... GPULayers:33[...] }" ggml_vulkan: 0 = Microsoft Direct3D12 (Intel(R) Arc(TM) 140T GPU (16GB)) (Dozen) time=2026-03-14T18:35:53.243-04:00 level=INFO source=ggml.go:494 msg="offloaded 33/33 layers to GPU" time=2026-03-14T18:36:40.132-04:00 level=INFO source=server.go:1388 msg="llama runner started in 55.60 seconds" time=2026-03-14T18:36:40.169-04:00 level=DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format="" time=2026-03-14T18:36:40.198-04:00 level=DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5 ``` At that point the request stalls and no first token is produced. The WSL kernel logs at the same time show the `/dev/dxg` sync path failing: ```text misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00 misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512 misc dxg: dxgk: process_completion_packet: did not find packet to complete ``` For completeness: with the old engine on the same machine, I also hit the older crash path instead of the hang: ```text llama_sampler_dist_apply(...): Assertion 'found' failed. SIGABRT: abort ``` ### Control case `llama.cpp` `b8348` on the same WSL2 machine, same GPU, same GGUF, works over Vulkan and generates text successfully. So the GPU/Vulkan path is functional enough for inference outside Ollama; the failure seems specific to Ollama's Vulkan runner on this stack. ### OS Linux (WSL2 Ubuntu 24.04.4) ### GPU Intel Arc 140T iGPU via Dozen (`Mesa 25.3.6`) / `/dev/dxg` ### CPU Intel Core Ultra 7 255H ### Ollama version 0.18.0

GiteaMirror added the intel vulkan labels 2026-04-29 10:15:23 -05:00

GiteaMirror commented

2026-04-29 10:15:25 -05:00

@oldeucryptoboi commented on GitHub (Mar 14, 2026):

I dug into the v0.18.0 source locally and found two code-level problems that look like plausible root causes for this WSL2/Vulkan hang.

ComputeWithNotify violates its own contract

ml/backend.go says:

ComputeWithNotify(func(), ...Tensor) // notify callback once compute has begun

But ml/backend/ggml/ggml.go currently does:

func (c *Context) ComputeWithNotify(cb func(), tensors ...ml.Tensor) {
    c.b.schedMu.Lock()
    defer c.b.schedMu.Unlock()
    if cb != nil {
        go cb()
    }

    if status := C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph); status != C.GGML_STATUS_SUCCESS {
        panic(...)
    }
    C.ggml_backend_sched_reset(c.b.sched)
    ...
}

So the callback fires before ggml_backend_sched_graph_compute_async() is even called.

That matters because runner/ollamarunner/runner.go uses that callback to let forwardBatch() start setting up the next batch:

// If we have a pending batch still processing, wait until Compute has started
if pendingBatch.ctx != nil {
    <-pendingBatch.computeStartedCh
    nextBatch.inputsReadyCh = pendingBatch.outputsReadyCh
}

In other words, the new engine thinks the previous compute has started when in fact it may not even have been submitted yet.

The new runner enables async pipelining without checking backend async capability

In runner/ollamarunner/runner.go:

supportsAsync := pooling.Type(s.model.Backend().Config().Uint("pooling_type")) == pooling.TypeNone

So async execution is enabled for all non-pooling models, regardless of backend capability.

But Ollama's vendored Vulkan backend currently advertises the opposite in ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp:

props->caps = {
    /* .async                 = */ false,
    /* .host_buffer           = */ true,
    /* .buffer_from_host_ptr  = */ false,
    /* .events                = */ false,
};

So the runner is pipelining batches as if the backend were async-safe, while the backend explicitly says it is not.

Why I think this is relevant to this issue specifically:

the hang happens after completion request, before first token
llama.cpp on the same machine/model works
the WSL kernel logs show /dev/dxg fence/sync failures while Ollama is stuck
the old engine hits a different failure family (llama_sampler_dist_apply(...): Assertion 'found' failed), which suggests the new engine's batching/scheduling path is part of the problem

My best current hypothesis is:

the new engine is overlapping or advancing work too early on the Vulkan path
WSL2/Dozen is more sensitive to this than the plain llama.cpp CLI path
the immediate thing to try is:
- move the cb() call in ComputeWithNotify to after successful ggml_backend_sched_graph_compute_async()
- disable runner-side async pipelining when the selected backend reports async = false

I patched the first change locally in a source clone, but I did not get far enough to run a full patched Vulkan repro because my local build tree was not packaging the runtime libraries the same way as the installed binary.

@oldeucryptoboi commented on GitHub (Mar 14, 2026): I dug into the `v0.18.0` source locally and found two code-level problems that look like plausible root causes for this WSL2/Vulkan hang. 1. `ComputeWithNotify` violates its own contract `ml/backend.go` says: ```go ComputeWithNotify(func(), ...Tensor) // notify callback once compute has begun ``` But `ml/backend/ggml/ggml.go` currently does: ```go func (c *Context) ComputeWithNotify(cb func(), tensors ...ml.Tensor) { c.b.schedMu.Lock() defer c.b.schedMu.Unlock() if cb != nil { go cb() } if status := C.ggml_backend_sched_graph_compute_async(c.b.sched, c.graph); status != C.GGML_STATUS_SUCCESS { panic(...) } C.ggml_backend_sched_reset(c.b.sched) ... } ``` So the callback fires *before* `ggml_backend_sched_graph_compute_async()` is even called. That matters because `runner/ollamarunner/runner.go` uses that callback to let `forwardBatch()` start setting up the next batch: ```go // If we have a pending batch still processing, wait until Compute has started if pendingBatch.ctx != nil { <-pendingBatch.computeStartedCh nextBatch.inputsReadyCh = pendingBatch.outputsReadyCh } ``` In other words, the new engine thinks the previous compute has started when in fact it may not even have been submitted yet. 2. The new runner enables async pipelining without checking backend async capability In `runner/ollamarunner/runner.go`: ```go supportsAsync := pooling.Type(s.model.Backend().Config().Uint("pooling_type")) == pooling.TypeNone ``` So async execution is enabled for all non-pooling models, regardless of backend capability. But Ollama's vendored Vulkan backend currently advertises the opposite in `ml/backend/ggml/ggml/src/ggml-vulkan/ggml-vulkan.cpp`: ```cpp props->caps = { /* .async = */ false, /* .host_buffer = */ true, /* .buffer_from_host_ptr = */ false, /* .events = */ false, }; ``` So the runner is pipelining batches as if the backend were async-safe, while the backend explicitly says it is not. Why I think this is relevant to this issue specifically: - the hang happens after `completion request`, before first token - `llama.cpp` on the same machine/model works - the WSL kernel logs show `/dev/dxg` fence/sync failures while Ollama is stuck - the old engine hits a different failure family (`llama_sampler_dist_apply(...): Assertion 'found' failed`), which suggests the new engine's batching/scheduling path is part of the problem My best current hypothesis is: - the new engine is overlapping or advancing work too early on the Vulkan path - WSL2/Dozen is more sensitive to this than the plain `llama.cpp` CLI path - the immediate thing to try is: - move the `cb()` call in `ComputeWithNotify` to *after* successful `ggml_backend_sched_graph_compute_async()` - disable runner-side async pipelining when the selected backend reports `async = false` I patched the first change locally in a source clone, but I did not get far enough to run a full patched Vulkan repro because my local build tree was not packaging the runtime libraries the same way as the installed binary.

GiteaMirror commented

2026-04-29 10:15:25 -05:00

@oldeucryptoboi commented on GitHub (Mar 14, 2026):

Follow-up after controlled A/B testing on the same machine.

The earlier ComputeWithNotify callback-order theory is not isolated as the cause. I re-ran the tests against the unmodified stock 0.18.0 binary and found a narrower reproducer:

Same machine: WSL2 + Intel Arc 140T + Dozen Vulkan + /dev/dxg
Same model: qwen35-9b-q8-local
Same binary: /usr/local/bin/ollama 0.18.0
Same env except for flash-attention handling:
- OLLAMA_LLM_LIBRARY=vulkan
- OLLAMA_VULKAN=1
- OLLAMA_NEW_ENGINE=true

Results:

OLLAMA_FLASH_ATTENTION=false exported:
- model loads on 100% GPU
- request completes successfully
- ollama run qwen35-9b-q8-local "Reply with exactly OK." returns OK
OLLAMA_FLASH_ATTENTION unset:
- Ollama logs enabling flash attention
- load request shows FlashAttention:Enabled
- model still fully loads on GPU
- runner reaches completion request and loading cache slot
- then it stalls until client timeout, with no token produced

Relevant log evidence from the failing run:

INFO source=server.go:246 msg="enabling flash attention"
INFO source=runner.go:1284 msg=load request="{... FlashAttention:Enabled KvSize:4096 ... GPULayers:33 ...}"
INFO source=server.go:1388 msg="llama runner started in 95.81 seconds"
DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format=""
DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5

At the same time, dmesg shows the same WSL dxg sync-object failure pattern:

misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00
misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512
misc dxg: dxgk: process_completion_packet: did not find packet to complete

This lines up with the source too:

qwen35 defaults flash attention on when not explicitly overridden:
- fs/ggml/ggml.go FlashAttention() returns true for qwen35
the engine enables it automatically when the env var is unset:
- llm/server.go

So the narrowed conclusion is:

General Ollama Vulkan on this machine is not broken
qwen35 on the WSL2/Dozen flash-attention path is broken
forcing OLLAMA_FLASH_ATTENTION=false is a working local workaround here

If useful, I can test whether this is specific to qwen35 or any Vulkan run that uses FA on this WSL/Dozen stack.

@oldeucryptoboi commented on GitHub (Mar 14, 2026): Follow-up after controlled A/B testing on the same machine. The earlier `ComputeWithNotify` callback-order theory is **not** isolated as the cause. I re-ran the tests against the unmodified stock `0.18.0` binary and found a narrower reproducer: - Same machine: `WSL2 + Intel Arc 140T + Dozen Vulkan + /dev/dxg` - Same model: `qwen35-9b-q8-local` - Same binary: `/usr/local/bin/ollama` `0.18.0` - Same env except for flash-attention handling: - `OLLAMA_LLM_LIBRARY=vulkan` - `OLLAMA_VULKAN=1` - `OLLAMA_NEW_ENGINE=true` Results: 1. `OLLAMA_FLASH_ATTENTION=false` exported: - model loads on `100% GPU` - request completes successfully - `ollama run qwen35-9b-q8-local "Reply with exactly OK."` returns `OK` 2. `OLLAMA_FLASH_ATTENTION` unset: - Ollama logs `enabling flash attention` - load request shows `FlashAttention:Enabled` - model still fully loads on GPU - runner reaches `completion request` and `loading cache slot` - then it stalls until client timeout, with no token produced Relevant log evidence from the failing run: ```text INFO source=server.go:246 msg="enabling flash attention" INFO source=runner.go:1284 msg=load request="{... FlashAttention:Enabled KvSize:4096 ... GPULayers:33 ...}" INFO source=server.go:1388 msg="llama runner started in 95.81 seconds" DEBUG source=server.go:1536 msg="completion request" images=0 prompt=22 format="" DEBUG source=cache.go:151 msg="loading cache slot" id=0 cache=0 prompt=5 used=0 remaining=5 ``` At the same time, `dmesg` shows the same WSL `dxg` sync-object failure pattern: ```text misc dxg: dxgk: dxgvmb_send_sync_msg: wait_for_completion failed: fffffe00 misc dxg: dxgk: dxgkio_wait_sync_object_cpu: Ioctl failed: -512 misc dxg: dxgk: process_completion_packet: did not find packet to complete ``` This lines up with the source too: - `qwen35` defaults flash attention on when not explicitly overridden: - `fs/ggml/ggml.go` `FlashAttention()` returns true for `qwen35` - the engine enables it automatically when the env var is unset: - `llm/server.go` So the narrowed conclusion is: - **General Ollama Vulkan on this machine is not broken** - **`qwen35` on the WSL2/Dozen flash-attention path is broken** - forcing `OLLAMA_FLASH_ATTENTION=false` is a working local workaround here If useful, I can test whether this is specific to `qwen35` or any Vulkan run that uses FA on this WSL/Dozen stack.

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#56094