[GH-ISSUE #11135] Qwen3 - Tool Call halucination #33104

New Issue

GiteaMirror · 2026-04-22T15:23:44-05:00

GiteaMirror commented

2026-04-22 15:23:44 -05:00

Originally created by @Notbici on GitHub (Jun 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11135

Originally assigned to: @jmorganca on GitHub.

What is the issue?

Hello..

After 0.9.2 install my Qwen3-32b-8bit bot has been hallucinating tool calls with Native tool call setting in openwebui.

I can see the breakdown mention the correct tool, however ever since upgrading from 0.9 it hallucinates the wrong tool, is 0.9.2 doing anything that could mess with the LLM's ability to decide what tool to run? Its so weird since this change relates to improving tool reliability that my bots saying I should call 'xyz' tool then it picks the completely wrong tool and missing args haha..

Downgrading to 0.9 worked perfectly again.

Relevant log output

idk

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.9.2

Originally created by @Notbici on GitHub (Jun 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11135 Originally assigned to: @jmorganca on GitHub. ### What is the issue? Hello.. After 0.9.2 install my Qwen3-32b-8bit bot has been hallucinating tool calls with Native tool call setting in openwebui. I can see the <think> breakdown mention the correct tool, however ever since upgrading from 0.9 it hallucinates the wrong tool, is 0.9.2 doing anything that could mess with the LLM's ability to decide what tool to run? Its so weird since this change relates to improving tool reliability that my bots saying I should call 'xyz' tool then it picks the completely wrong tool and missing args haha.. Downgrading to 0.9 worked perfectly again. ### Relevant log output ```shell idk ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.9.2

GiteaMirror added the bug label 2026-04-22 15:23:44 -05:00

GiteaMirror closed this issue

2026-04-22 15:23:45 -05:00

GiteaMirror commented

2026-04-22 15:23:47 -05:00

@jmorganca commented on GitHub (Jun 19, 2025):

@Notbici sorry you hit this. Do you have an example you can share of the tool and what it's hallucinating instead?

@jmorganca commented on GitHub (Jun 19, 2025): @Notbici sorry you hit this. Do you have an example you can share of the tool and what it's hallucinating instead?

GiteaMirror commented

2026-04-22 15:23:47 -05:00

@Notbici commented on GitHub (Jun 19, 2025):

Hi @jmorganca

I made a mvp instead, not the original tool but was able to replicate.

Model

qwen3:32b-q8_0 (fresh from ollama).
Function calling mode (might be openwebui specific): Native
model context should be 15k (if it matters idk), everything else default.
No system prompt, though in my original complaint I did have a system prompt (ruling at least that out).

Here's the tool:

import os
import requests
from datetime import datetime
from pydantic import BaseModel, Field


class Tools:
    def __init__(self):
        pass

    def unlock_action(self) -> str:
        """
        :return: Unlocks an action immediately
        """
        return f"another example idk"

    def lock_action(self, duration: str) -> str:
        """
        Locks an action for a set duration.

        :param duration:
        :return: Something
        """
        return "I had to make an example"

I use OpenWebUI to load it: Open your web ui -> Workspace -> Tools -> import

and here's my conversation:
Me: Hi
Bot: Hello! How can I assist you today? 😊
Me: I need you to lock actions for 20 mins
Bot: calls the lock_action function correctly.
Actions have been locked for 20 minutes. Let me know if you need anything else! ⏳🔒
Me: alright great, time is up.. time to unlock the actions
Bot: Okay, the user wants to unlock the actions after the 20-minute lock. Let me check the available tools. There's an "unlock_action" function with no required parameters. Since the lock was set for 20 minutes and the user says time is up, I should call the unlock function. No arguments needed, just trigger it. Make sure the response confirms the unlock.. then calls the lock command with the incorrect parameters ""Tools.lock_action() missing 1 required positional argument: 'duration'""

it then just full blown loops, guessing the right tool then failing by calling the wrong tool.

Now as soon as I downgrade to ollama 0.9.0 this issue goes away with zero changes. Just swap the new version with the old, and it goes away.

Does this help?

@Notbici commented on GitHub (Jun 19, 2025): Hi @jmorganca I made a mvp instead, not the original tool but was able to replicate. Model - qwen3:32b-q8_0 (fresh from ollama). - Function calling mode (might be openwebui specific): Native - model context should be 15k (if it matters idk), everything else default. - No system prompt, though in my original complaint I did have a system prompt (ruling at least that out). Here's the tool: ```python import os import requests from datetime import datetime from pydantic import BaseModel, Field class Tools: def __init__(self): pass def unlock_action(self) -> str: """ :return: Unlocks an action immediately """ return f"another example idk" def lock_action(self, duration: str) -> str: """ Locks an action for a set duration. :param duration: :return: Something """ return "I had to make an example" ``` I use OpenWebUI to load it: Open your web ui -> Workspace -> Tools -> import and here's my conversation: Me: Hi Bot: Hello! How can I assist you today? 😊 Me: I need you to lock actions for 20 mins Bot: calls the lock_action function correctly. Actions have been locked for 20 minutes. Let me know if you need anything else! ⏳🔒 Me: alright great, time is up.. time to unlock the actions Bot: <think> Okay, the user wants to unlock the actions after the 20-minute lock. Let me check the available tools. There's an "unlock_action" function with no required parameters. Since the lock was set for 20 minutes and the user says time is up, I should call the unlock function. No arguments needed, just trigger it. Make sure the response confirms the unlock.. </think> then calls the lock command with the incorrect parameters ""Tools.lock_action() missing 1 required positional argument: 'duration'"" <img width="516" alt="Image" src="https://github.com/user-attachments/assets/64347aa9-b35c-43ac-9fbb-6fdcb6dc081f" /> it then just full blown loops, guessing the right tool then failing by calling the wrong tool. Now as soon as I downgrade to `ollama 0.9.0` this issue goes away with zero changes. Just swap the new version with the old, and it goes away. <img width="695" alt="Image" src="https://github.com/user-attachments/assets/d7b2674b-32d4-42ad-a632-94a664525b3b" /> Does this help?

GiteaMirror commented

2026-04-22 15:23:49 -05:00

@eliciel0513 commented on GitHub (Jun 20, 2025):

I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use.

@eliciel0513 commented on GitHub (Jun 20, 2025): I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use.

GiteaMirror commented

2026-04-22 15:23:51 -05:00

@Notbici commented on GitHub (Jun 20, 2025):

I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use.

@eliciel0513 I'd say I have similar issues that spanned before 0.9.0 as well but with Command-A the 111B model. It prefers to just say it did the action than actually do the action, its a tool supporting model.

@jmorganca if it helps also, I have free resources to test on, at the moment 4x 5090s and 2x RTX 6000 adas idle.

@Notbici commented on GitHub (Jun 20, 2025): > I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use. @eliciel0513 I'd say I have similar issues that spanned before 0.9.0 as well but with `Command-A` the 111B model. It prefers to just say it did the action than actually do the action, its a tool supporting model. @jmorganca if it helps also, I have free resources to test on, at the moment 4x 5090s and 2x RTX 6000 adas idle.

GiteaMirror commented

2026-04-22 15:23:54 -05:00

@mlaihk commented on GitHub (Jun 20, 2025):

Something is not right with ollama for the recent couple releases. I did a comparison with the same models in LMStudio instead of ollama (0.9.1) and OpenWebUI, getting the models to call tools. Ollama generated tool calls results in errors pretty much over 80% of the time and yet the same models, same settings served from LMStudio success rate is over 80%.

I thought something was wrong with both of my laptops.......

With the recent problematic ollama releases, I am seriously considering moving to llama-swap with llama.cpp instead......

@mlaihk commented on GitHub (Jun 20, 2025): Something is not right with ollama for the recent couple releases. I did a comparison with the same models in LMStudio instead of ollama (0.9.1) and OpenWebUI, getting the models to call tools. Ollama generated tool calls results in errors pretty much over 80% of the time and yet the same models, same settings served from LMStudio success rate is over 80%. I thought something was wrong with both of my laptops....... With the recent problematic ollama releases, I am seriously considering moving to llama-swap with llama.cpp instead......

GiteaMirror commented

2026-04-22 15:23:57 -05:00

@jmorganca commented on GitHub (Jul 1, 2025):

Hi there, this should be fixed now as of 0.9.4 and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling.

@jmorganca commented on GitHub (Jul 1, 2025): Hi there, this should be fixed now as of [0.9.4](https://github.com/ollama/ollama/releases/tag/v0.9.4) and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling.

GiteaMirror commented

2026-04-22 15:23:58 -05:00

@Notbici commented on GitHub (Jul 2, 2025):

Hi there, this should be fixed now as of 0.9.4 and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling.

@jmorganca I believe this fix caused another bug, going from 0.9.3 to 0.9.4 instantly caused my models to load onto the CPU and not the GPU. Ollama knows I have the gpus in the logs but assigns the layers to cpu.

I got 4x 5090's running Qwen3-32b 4bit, no change outside of trying this update out to test if it helped with the toolcall issue

time=2025-07-02T07:21:00.813Z level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG-4 OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://<redacted>:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-07-02T07:21:00.818Z level=INFO source=images.go:476 msg="total blobs: 86"
time=2025-07-02T07:21:00.819Z level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-07-02T07:21:00.819Z level=INFO source=routes.go:1288 msg="Listening on <redacted>:11434 (version 0.9.4)"
time=2025-07-02T07:21:00.819Z level=DEBUG source=sched.go:108 msg="starting llm scheduler"
time=2025-07-02T07:21:00.819Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-07-02T07:21:00.847Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.570.169 /usr/lib/x86_64-linux-gnu/libcuda.so.570.153.02]"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:01.471Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=4 library=/usr/lib/x86_64-linux-gnu/libcuda.so.570.169
[GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA totalMem 32119mb
[GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA freeMem 31613mb
[GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] Compute Capability 12.0
[GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA totalMem 32119mb
[GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA freeMem 31613mb
[GPU-85a20cf6-234c-7d11-0c17-e985189c028f] Compute Capability 12.0
[GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA totalMem 32119mb
[GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA freeMem 31613mb
[GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] Compute Capability 12.0
[GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA totalMem 32119mb
[GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA freeMem 31613mb
[GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] Compute Capability 12.0
time=2025-07-02T07:21:02.418Z level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-85a20cf6-234c-7d11-0c17-e985189c028f library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"


time=2025-07-02T07:21:24.537Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.6 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:24.734Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:24.899Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.052Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.206Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:25.216Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-02T07:21:25.258Z level=DEBUG source=sched.go:228 msg="loading first model" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-07-02T07:21:25.258Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]"
time=2025-07-02T07:21:25.258Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-07-02T07:21:25.259Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:25.463Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.706Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.950Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.186Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:26.187Z level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 library=cuda parallel=1 required="45.3 GiB"
time=2025-07-02T07:21:26.187Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:26.337Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.484Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.628Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:26.772Z level=INFO source=server.go:135 msg="system memory" total="251.5 GiB" free="247.5 GiB" free_swap="0 B"
time=2025-07-02T07:21:26.772Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]"
time=2025-07-02T07:21:26.772Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:26.916Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:27.060Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:27.204Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:27.349Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:27.349Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=17,16,16,16 memory.available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="45.3 GiB" memory.required.partial="45.3 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[11.8 GiB 11.2 GiB 11.2 GiB 11.2 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB"
time=2025-07-02T07:21:27.349Z level=INFO source=server.go:218 msg="enabling flash attention"
time=2025-07-02T07:21:27.349Z level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-02T07:21:27.515Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 30720 --batch-size 2048 --n-gpu-layers 65 --threads 16 --flash-attn --kv-cache-type q8_0 --parallel 1 --tensor-split 17,16,16,16 --port 46343"
time=2025-07-02T07:21:27.515Z level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_SCHED_SPREAD=1 OLLAMA_HOST=<redacted> OLLAMA_NUM_PARALLEL=1 OLLAMA_DEBUG=2 PATH=/root/.local/bin:/oob/text-generation-webui-main/installer_files/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc,GPU-85a20cf6-234c-7d11-0c17-e985189c028f,GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1,GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db
time=2025-07-02T07:21:27.515Z level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-02T07:21:27.515Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-02T07:21:27.515Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-02T07:21:27.526Z level=INFO source=runner.go:815 msg="starting go runner"
time=2025-07-02T07:21:27.526Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
time=2025-07-02T07:21:27.526Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-02T07:21:27.548Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46343"
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 25600
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: layer  29 assigned to device CPU, is_swa = 0
load_tensors: layer  30 assigned to device CPU, is_swa = 0
load_tensors: layer  31 assigned to device CPU, is_swa = 0
load_tensors: layer  32 assigned to device CPU, is_swa = 0
load_tensors: layer  33 assigned to device CPU, is_swa = 0
load_tensors: layer  34 assigned to device CPU, is_swa = 0
load_tensors: layer  35 assigned to device CPU, is_swa = 0
load_tensors: layer  36 assigned to device CPU, is_swa = 0
load_tensors: layer  37 assigned to device CPU, is_swa = 0
load_tensors: layer  38 assigned to device CPU, is_swa = 0
load_tensors: layer  39 assigned to device CPU, is_swa = 0
load_tensors: layer  40 assigned to device CPU, is_swa = 0
load_tensors: layer  41 assigned to device CPU, is_swa = 0
load_tensors: layer  42 assigned to device CPU, is_swa = 0
load_tensors: layer  43 assigned to device CPU, is_swa = 0
load_tensors: layer  44 assigned to device CPU, is_swa = 0
load_tensors: layer  45 assigned to device CPU, is_swa = 0
load_tensors: layer  46 assigned to device CPU, is_swa = 0
load_tensors: layer  47 assigned to device CPU, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: layer  49 assigned to device CPU, is_swa = 0
load_tensors: layer  50 assigned to device CPU, is_swa = 0
load_tensors: layer  51 assigned to device CPU, is_swa = 0
load_tensors: layer  52 assigned to device CPU, is_swa = 0
load_tensors: layer  53 assigned to device CPU, is_swa = 0
load_tensors: layer  54 assigned to device CPU, is_swa = 0
load_tensors: layer  55 assigned to device CPU, is_swa = 0
load_tensors: layer  56 assigned to device CPU, is_swa = 0
load_tensors: layer  57 assigned to device CPU, is_swa = 0
load_tensors: layer  58 assigned to device CPU, is_swa = 0
load_tensors: layer  59 assigned to device CPU, is_swa = 0
load_tensors: layer  60 assigned to device CPU, is_swa = 0
load_tensors: layer  61 assigned to device CPU, is_swa = 0
load_tensors: layer  62 assigned to device CPU, is_swa = 0
load_tensors: layer  63 assigned to device CPU, is_swa = 0
load_tensors: layer  64 assigned to device CPU, is_swa = 0
time=2025-07-02T07:21:27.766Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_tensors:   CPU_Mapped model buffer size = 19259.71 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 30720
llama_context: n_ctx_per_seq = 30720
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (30720) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.60 MiB
create_memory: n_ctx = 30720 (padded)
llama_kv_cache_unified: kv_size = 30720, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1, padding = 256
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified: layer  26: dev = CPU
llama_kv_cache_unified: layer  27: dev = CPU
llama_kv_cache_unified: layer  28: dev = CPU
llama_kv_cache_unified: layer  29: dev = CPU
llama_kv_cache_unified: layer  30: dev = CPU
llama_kv_cache_unified: layer  31: dev = CPU
llama_kv_cache_unified: layer  32: dev = CPU
llama_kv_cache_unified: layer  33: dev = CPU
llama_kv_cache_unified: layer  34: dev = CPU
llama_kv_cache_unified: layer  35: dev = CPU
llama_kv_cache_unified: layer  36: dev = CPU
llama_kv_cache_unified: layer  37: dev = CPU
llama_kv_cache_unified: layer  38: dev = CPU
llama_kv_cache_unified: layer  39: dev = CPU
llama_kv_cache_unified: layer  40: dev = CPU
llama_kv_cache_unified: layer  41: dev = CPU
llama_kv_cache_unified: layer  42: dev = CPU
llama_kv_cache_unified: layer  43: dev = CPU
llama_kv_cache_unified: layer  44: dev = CPU
llama_kv_cache_unified: layer  45: dev = CPU
llama_kv_cache_unified: layer  46: dev = CPU
llama_kv_cache_unified: layer  47: dev = CPU
llama_kv_cache_unified: layer  48: dev = CPU
llama_kv_cache_unified: layer  49: dev = CPU
llama_kv_cache_unified: layer  50: dev = CPU
llama_kv_cache_unified: layer  51: dev = CPU
llama_kv_cache_unified: layer  52: dev = CPU
llama_kv_cache_unified: layer  53: dev = CPU
llama_kv_cache_unified: layer  54: dev = CPU
llama_kv_cache_unified: layer  55: dev = CPU
llama_kv_cache_unified: layer  56: dev = CPU
llama_kv_cache_unified: layer  57: dev = CPU
llama_kv_cache_unified: layer  58: dev = CPU
llama_kv_cache_unified: layer  59: dev = CPU
llama_kv_cache_unified: layer  60: dev = CPU
llama_kv_cache_unified: layer  61: dev = CPU
llama_kv_cache_unified: layer  62: dev = CPU
llama_kv_cache_unified: layer  63: dev = CPU
time=2025-07-02T07:21:29.020Z level=DEBUG source=server.go:643 msg="model load progress 1.00"
time=2025-07-02T07:21:29.271Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_unified:        CPU KV buffer size =  4080.00 MiB
llama_kv_cache_unified: KV self size  = 4080.00 MiB, K (q8_0): 2040.00 MiB, V (q8_0): 2040.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:        CPU compute buffer size =   316.75 MiB
llama_context: graph nodes  = 2183
llama_context: graph splits = 1
time=2025-07-02T07:21:30.526Z level=INFO source=server.go:637 msg="llama runner started in 3.01 seconds"
time=2025-07-02T07:21:30.526Z level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720
time=2025-07-02T07:21:30.526Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=52 format=""
time=2025-07-02T07:21:30.526Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n"
time=2025-07-02T07:21:30.528Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=9 used=0 remaining=9
[GIN] 2025/07/02 - 07:23:14 | 200 |         1m49s |  <redacted> | POST     "/api/chat"
time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 duration=1h0m0s
time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=0
time=2025-07-02T07:23:14.139Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-07-02T07:23:14.140Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=1353 format=""
time=2025-07-02T07:23:14.140Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\n### Task:\nGenerate a concise, 3-5 word title with an emoji summarizing the chat history.\n### Guidelines:\n- The title should clearly represent the main theme or subject of the conversation.\n- Use emojis that enhance understanding of the topic, but avoid quotation marks or special formatting.\n- Write the title in the chat's primary language; default to English if multilingual.\n- Prioritize accuracy over excessive creativity; keep it clear and simple.\n- Your entire response must consist solely of the JSON object, without any introductory or concluding text.\n- The output must be a single, raw JSON object, without any markdown code fences or other encapsulating text.\n- Ensure no conversational text, affirmations, or explanations precede or follow the raw JSON output, as this will cause direct parsing failure.\n### Output:\nJSON format: { \"title\": \"your concise title here\" }\n### Examples:\n- { \"title\": \"📉 Stock Market Trends\" },\n- { \"title\": \"🍪 Perfect Chocolate Chip Recipe\" },\n- { \"title\": \"Evolution of Music Streaming\" },\n- { \"title\": \"Remote Work Productivity Tips\" },\n- { \"title\": \"Artificial Intelligence in Healthcare\" },\n- { \"title\": \"🎮 Video Game Development Insights\" }\n### Chat History:\n<chat_history>\nUSER: hi\nASSISTANT: Hello! 😊 How can I assist you today?\n</chat_history><|im_end|>\n<|im_start|>assistant\n"
time=2025-07-02T07:23:14.147Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=94 prompt=297 used=3 remaining=294
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=1
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720

@Notbici commented on GitHub (Jul 2, 2025): > Hi there, this should be fixed now as of [0.9.4](https://github.com/ollama/ollama/releases/tag/v0.9.4) and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling. @jmorganca I believe this fix caused another bug, going from 0.9.3 to 0.9.4 instantly caused my models to load onto the CPU and not the GPU. Ollama knows I have the gpus in the logs but assigns the layers to cpu. I got 4x 5090's running Qwen3-32b 4bit, no change outside of trying this update out to test if it helped with the toolcall issue ``` time=2025-07-02T07:21:00.813Z level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG-4 OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://<redacted>:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-07-02T07:21:00.818Z level=INFO source=images.go:476 msg="total blobs: 86" time=2025-07-02T07:21:00.819Z level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-02T07:21:00.819Z level=INFO source=routes.go:1288 msg="Listening on <redacted>:11434 (version 0.9.4)" time=2025-07-02T07:21:00.819Z level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-07-02T07:21:00.819Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-07-02T07:21:00.847Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.570.169 /usr/lib/x86_64-linux-gnu/libcuda.so.570.153.02]" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:01.471Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=4 library=/usr/lib/x86_64-linux-gnu/libcuda.so.570.169 [GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA totalMem 32119mb [GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA freeMem 31613mb [GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] Compute Capability 12.0 [GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA totalMem 32119mb [GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA freeMem 31613mb [GPU-85a20cf6-234c-7d11-0c17-e985189c028f] Compute Capability 12.0 [GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA totalMem 32119mb [GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA freeMem 31613mb [GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] Compute Capability 12.0 [GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA totalMem 32119mb [GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA freeMem 31613mb [GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] Compute Capability 12.0 time=2025-07-02T07:21:02.418Z level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-85a20cf6-234c-7d11-0c17-e985189c028f library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:24.537Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.6 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:24.734Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:24.899Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.052Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.206Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:25.216Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-02T07:21:25.258Z level=DEBUG source=sched.go:228 msg="loading first model" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-07-02T07:21:25.258Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" time=2025-07-02T07:21:25.258Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-07-02T07:21:25.259Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:25.463Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.706Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.950Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.186Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:26.187Z level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 library=cuda parallel=1 required="45.3 GiB" time=2025-07-02T07:21:26.187Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:26.337Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.484Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.628Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:26.772Z level=INFO source=server.go:135 msg="system memory" total="251.5 GiB" free="247.5 GiB" free_swap="0 B" time=2025-07-02T07:21:26.772Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" time=2025-07-02T07:21:26.772Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:26.916Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:27.060Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:27.204Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:27.349Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:27.349Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=17,16,16,16 memory.available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="45.3 GiB" memory.required.partial="45.3 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[11.8 GiB 11.2 GiB 11.2 GiB 11.2 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB" time=2025-07-02T07:21:27.349Z level=INFO source=server.go:218 msg="enabling flash attention" time=2025-07-02T07:21:27.349Z level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-07-02T07:21:27.515Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 30720 --batch-size 2048 --n-gpu-layers 65 --threads 16 --flash-attn --kv-cache-type q8_0 --parallel 1 --tensor-split 17,16,16,16 --port 46343" time=2025-07-02T07:21:27.515Z level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_SCHED_SPREAD=1 OLLAMA_HOST=<redacted> OLLAMA_NUM_PARALLEL=1 OLLAMA_DEBUG=2 PATH=/root/.local/bin:/oob/text-generation-webui-main/installer_files/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc,GPU-85a20cf6-234c-7d11-0c17-e985189c028f,GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1,GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db time=2025-07-02T07:21:27.515Z level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-07-02T07:21:27.515Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-02T07:21:27.515Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-02T07:21:27.526Z level=INFO source=runner.go:815 msg="starting go runner" time=2025-07-02T07:21:27.526Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama time=2025-07-02T07:21:27.526Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2025-07-02T07:21:27.548Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46343" llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 5120 print_info: n_layer = 64 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 25600 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 32B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 load_tensors: layer 29 assigned to device CPU, is_swa = 0 load_tensors: layer 30 assigned to device CPU, is_swa = 0 load_tensors: layer 31 assigned to device CPU, is_swa = 0 load_tensors: layer 32 assigned to device CPU, is_swa = 0 load_tensors: layer 33 assigned to device CPU, is_swa = 0 load_tensors: layer 34 assigned to device CPU, is_swa = 0 load_tensors: layer 35 assigned to device CPU, is_swa = 0 load_tensors: layer 36 assigned to device CPU, is_swa = 0 load_tensors: layer 37 assigned to device CPU, is_swa = 0 load_tensors: layer 38 assigned to device CPU, is_swa = 0 load_tensors: layer 39 assigned to device CPU, is_swa = 0 load_tensors: layer 40 assigned to device CPU, is_swa = 0 load_tensors: layer 41 assigned to device CPU, is_swa = 0 load_tensors: layer 42 assigned to device CPU, is_swa = 0 load_tensors: layer 43 assigned to device CPU, is_swa = 0 load_tensors: layer 44 assigned to device CPU, is_swa = 0 load_tensors: layer 45 assigned to device CPU, is_swa = 0 load_tensors: layer 46 assigned to device CPU, is_swa = 0 load_tensors: layer 47 assigned to device CPU, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: layer 49 assigned to device CPU, is_swa = 0 load_tensors: layer 50 assigned to device CPU, is_swa = 0 load_tensors: layer 51 assigned to device CPU, is_swa = 0 load_tensors: layer 52 assigned to device CPU, is_swa = 0 load_tensors: layer 53 assigned to device CPU, is_swa = 0 load_tensors: layer 54 assigned to device CPU, is_swa = 0 load_tensors: layer 55 assigned to device CPU, is_swa = 0 load_tensors: layer 56 assigned to device CPU, is_swa = 0 load_tensors: layer 57 assigned to device CPU, is_swa = 0 load_tensors: layer 58 assigned to device CPU, is_swa = 0 load_tensors: layer 59 assigned to device CPU, is_swa = 0 load_tensors: layer 60 assigned to device CPU, is_swa = 0 load_tensors: layer 61 assigned to device CPU, is_swa = 0 load_tensors: layer 62 assigned to device CPU, is_swa = 0 load_tensors: layer 63 assigned to device CPU, is_swa = 0 load_tensors: layer 64 assigned to device CPU, is_swa = 0 time=2025-07-02T07:21:27.766Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" load_tensors: CPU_Mapped model buffer size = 19259.71 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 30720 llama_context: n_ctx_per_seq = 30720 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (30720) < n_ctx_train (40960) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.60 MiB create_memory: n_ctx = 30720 (padded) llama_kv_cache_unified: kv_size = 30720, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1, padding = 256 llama_kv_cache_unified: layer 0: dev = CPU llama_kv_cache_unified: layer 1: dev = CPU llama_kv_cache_unified: layer 2: dev = CPU llama_kv_cache_unified: layer 3: dev = CPU llama_kv_cache_unified: layer 4: dev = CPU llama_kv_cache_unified: layer 5: dev = CPU llama_kv_cache_unified: layer 6: dev = CPU llama_kv_cache_unified: layer 7: dev = CPU llama_kv_cache_unified: layer 8: dev = CPU llama_kv_cache_unified: layer 9: dev = CPU llama_kv_cache_unified: layer 10: dev = CPU llama_kv_cache_unified: layer 11: dev = CPU llama_kv_cache_unified: layer 12: dev = CPU llama_kv_cache_unified: layer 13: dev = CPU llama_kv_cache_unified: layer 14: dev = CPU llama_kv_cache_unified: layer 15: dev = CPU llama_kv_cache_unified: layer 16: dev = CPU llama_kv_cache_unified: layer 17: dev = CPU llama_kv_cache_unified: layer 18: dev = CPU llama_kv_cache_unified: layer 19: dev = CPU llama_kv_cache_unified: layer 20: dev = CPU llama_kv_cache_unified: layer 21: dev = CPU llama_kv_cache_unified: layer 22: dev = CPU llama_kv_cache_unified: layer 23: dev = CPU llama_kv_cache_unified: layer 24: dev = CPU llama_kv_cache_unified: layer 25: dev = CPU llama_kv_cache_unified: layer 26: dev = CPU llama_kv_cache_unified: layer 27: dev = CPU llama_kv_cache_unified: layer 28: dev = CPU llama_kv_cache_unified: layer 29: dev = CPU llama_kv_cache_unified: layer 30: dev = CPU llama_kv_cache_unified: layer 31: dev = CPU llama_kv_cache_unified: layer 32: dev = CPU llama_kv_cache_unified: layer 33: dev = CPU llama_kv_cache_unified: layer 34: dev = CPU llama_kv_cache_unified: layer 35: dev = CPU llama_kv_cache_unified: layer 36: dev = CPU llama_kv_cache_unified: layer 37: dev = CPU llama_kv_cache_unified: layer 38: dev = CPU llama_kv_cache_unified: layer 39: dev = CPU llama_kv_cache_unified: layer 40: dev = CPU llama_kv_cache_unified: layer 41: dev = CPU llama_kv_cache_unified: layer 42: dev = CPU llama_kv_cache_unified: layer 43: dev = CPU llama_kv_cache_unified: layer 44: dev = CPU llama_kv_cache_unified: layer 45: dev = CPU llama_kv_cache_unified: layer 46: dev = CPU llama_kv_cache_unified: layer 47: dev = CPU llama_kv_cache_unified: layer 48: dev = CPU llama_kv_cache_unified: layer 49: dev = CPU llama_kv_cache_unified: layer 50: dev = CPU llama_kv_cache_unified: layer 51: dev = CPU llama_kv_cache_unified: layer 52: dev = CPU llama_kv_cache_unified: layer 53: dev = CPU llama_kv_cache_unified: layer 54: dev = CPU llama_kv_cache_unified: layer 55: dev = CPU llama_kv_cache_unified: layer 56: dev = CPU llama_kv_cache_unified: layer 57: dev = CPU llama_kv_cache_unified: layer 58: dev = CPU llama_kv_cache_unified: layer 59: dev = CPU llama_kv_cache_unified: layer 60: dev = CPU llama_kv_cache_unified: layer 61: dev = CPU llama_kv_cache_unified: layer 62: dev = CPU llama_kv_cache_unified: layer 63: dev = CPU time=2025-07-02T07:21:29.020Z level=DEBUG source=server.go:643 msg="model load progress 1.00" time=2025-07-02T07:21:29.271Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_unified: CPU KV buffer size = 4080.00 MiB llama_kv_cache_unified: KV self size = 4080.00 MiB, K (q8_0): 2040.00 MiB, V (q8_0): 2040.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 1 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CPU compute buffer size = 316.75 MiB llama_context: graph nodes = 2183 llama_context: graph splits = 1 time=2025-07-02T07:21:30.526Z level=INFO source=server.go:637 msg="llama runner started in 3.01 seconds" time=2025-07-02T07:21:30.526Z level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 time=2025-07-02T07:21:30.526Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=52 format="" time=2025-07-02T07:21:30.526Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n" time=2025-07-02T07:21:30.528Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=9 used=0 remaining=9 [GIN] 2025/07/02 - 07:23:14 | 200 | 1m49s | <redacted> | POST "/api/chat" time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 duration=1h0m0s time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=0 time=2025-07-02T07:23:14.139Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-07-02T07:23:14.140Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=1353 format="" time=2025-07-02T07:23:14.140Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\n### Task:\nGenerate a concise, 3-5 word title with an emoji summarizing the chat history.\n### Guidelines:\n- The title should clearly represent the main theme or subject of the conversation.\n- Use emojis that enhance understanding of the topic, but avoid quotation marks or special formatting.\n- Write the title in the chat's primary language; default to English if multilingual.\n- Prioritize accuracy over excessive creativity; keep it clear and simple.\n- Your entire response must consist solely of the JSON object, without any introductory or concluding text.\n- The output must be a single, raw JSON object, without any markdown code fences or other encapsulating text.\n- Ensure no conversational text, affirmations, or explanations precede or follow the raw JSON output, as this will cause direct parsing failure.\n### Output:\nJSON format: { \"title\": \"your concise title here\" }\n### Examples:\n- { \"title\": \"📉 Stock Market Trends\" },\n- { \"title\": \"🍪 Perfect Chocolate Chip Recipe\" },\n- { \"title\": \"Evolution of Music Streaming\" },\n- { \"title\": \"Remote Work Productivity Tips\" },\n- { \"title\": \"Artificial Intelligence in Healthcare\" },\n- { \"title\": \"🎮 Video Game Development Insights\" }\n### Chat History:\n<chat_history>\nUSER: hi\nASSISTANT: Hello! 😊 How can I assist you today?\n</chat_history><|im_end|>\n<|im_start|>assistant\n" time=2025-07-02T07:23:14.147Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=94 prompt=297 used=3 remaining=294 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=1 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 ```

Sign in to join this conversation.

Branches Tags

main

dhiltgen/ci

parth-launch-plan-gating

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth-launch-codex-app

hoyyeva/fix-codex-model-metadata-warning

hoyyeva/qwen

parth/hide-claude-desktop-till-release

hoyyeva/opencode-image-modality

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

hoyyeva/opencode-thinking

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#33104