[GH-ISSUE #11135] Qwen3 - Tool Call halucination #33104

Closed
opened 2026-04-22 15:23:44 -05:00 by GiteaMirror · 7 comments
Owner

Originally created by @Notbici on GitHub (Jun 19, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/11135

Originally assigned to: @jmorganca on GitHub.

What is the issue?

Hello..

After 0.9.2 install my Qwen3-32b-8bit bot has been hallucinating tool calls with Native tool call setting in openwebui.

I can see the breakdown mention the correct tool, however ever since upgrading from 0.9 it hallucinates the wrong tool, is 0.9.2 doing anything that could mess with the LLM's ability to decide what tool to run? Its so weird since this change relates to improving tool reliability that my bots saying I should call 'xyz' tool then it picks the completely wrong tool and missing args haha..

Downgrading to 0.9 worked perfectly again.

Relevant log output

idk

OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.9.2

Originally created by @Notbici on GitHub (Jun 19, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/11135 Originally assigned to: @jmorganca on GitHub. ### What is the issue? Hello.. After 0.9.2 install my Qwen3-32b-8bit bot has been hallucinating tool calls with Native tool call setting in openwebui. I can see the <think> breakdown mention the correct tool, however ever since upgrading from 0.9 it hallucinates the wrong tool, is 0.9.2 doing anything that could mess with the LLM's ability to decide what tool to run? Its so weird since this change relates to improving tool reliability that my bots saying I should call 'xyz' tool then it picks the completely wrong tool and missing args haha.. Downgrading to 0.9 worked perfectly again. ### Relevant log output ```shell idk ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.9.2
GiteaMirror added the bug label 2026-04-22 15:23:44 -05:00
Author
Owner

@jmorganca commented on GitHub (Jun 19, 2025):

@Notbici sorry you hit this. Do you have an example you can share of the tool and what it's hallucinating instead?

<!-- gh-comment-id:2988639672 --> @jmorganca commented on GitHub (Jun 19, 2025): @Notbici sorry you hit this. Do you have an example you can share of the tool and what it's hallucinating instead?
Author
Owner

@Notbici commented on GitHub (Jun 19, 2025):

Hi @jmorganca

I made a mvp instead, not the original tool but was able to replicate.

Model

  • qwen3:32b-q8_0 (fresh from ollama).
  • Function calling mode (might be openwebui specific): Native
  • model context should be 15k (if it matters idk), everything else default.
  • No system prompt, though in my original complaint I did have a system prompt (ruling at least that out).

Here's the tool:

import os
import requests
from datetime import datetime
from pydantic import BaseModel, Field


class Tools:
    def __init__(self):
        pass

    def unlock_action(self) -> str:
        """
        :return: Unlocks an action immediately
        """
        return f"another example idk"

    def lock_action(self, duration: str) -> str:
        """
        Locks an action for a set duration.

        :param duration:
        :return: Something
        """
        return "I had to make an example"

I use OpenWebUI to load it: Open your web ui -> Workspace -> Tools -> import

and here's my conversation:
Me: Hi
Bot: Hello! How can I assist you today? 😊
Me: I need you to lock actions for 20 mins
Bot: calls the lock_action function correctly.
Actions have been locked for 20 minutes. Let me know if you need anything else! 🔒
Me: alright great, time is up.. time to unlock the actions
Bot: Okay, the user wants to unlock the actions after the 20-minute lock. Let me check the available tools. There's an "unlock_action" function with no required parameters. Since the lock was set for 20 minutes and the user says time is up, I should call the unlock function. No arguments needed, just trigger it. Make sure the response confirms the unlock.. then calls the lock command with the incorrect parameters ""Tools.lock_action() missing 1 required positional argument: 'duration'""

Image it then just full blown loops, guessing the right tool then failing by calling the wrong tool.

Now as soon as I downgrade to ollama 0.9.0 this issue goes away with zero changes. Just swap the new version with the old, and it goes away.

Image

Does this help?

<!-- gh-comment-id:2988899791 --> @Notbici commented on GitHub (Jun 19, 2025): Hi @jmorganca I made a mvp instead, not the original tool but was able to replicate. Model - qwen3:32b-q8_0 (fresh from ollama). - Function calling mode (might be openwebui specific): Native - model context should be 15k (if it matters idk), everything else default. - No system prompt, though in my original complaint I did have a system prompt (ruling at least that out). Here's the tool: ```python import os import requests from datetime import datetime from pydantic import BaseModel, Field class Tools: def __init__(self): pass def unlock_action(self) -> str: """ :return: Unlocks an action immediately """ return f"another example idk" def lock_action(self, duration: str) -> str: """ Locks an action for a set duration. :param duration: :return: Something """ return "I had to make an example" ``` I use OpenWebUI to load it: Open your web ui -> Workspace -> Tools -> import and here's my conversation: Me: Hi Bot: Hello! How can I assist you today? 😊 Me: I need you to lock actions for 20 mins Bot: calls the lock_action function correctly. Actions have been locked for 20 minutes. Let me know if you need anything else! ⏳🔒 Me: alright great, time is up.. time to unlock the actions Bot: <think> Okay, the user wants to unlock the actions after the 20-minute lock. Let me check the available tools. There's an "unlock_action" function with no required parameters. Since the lock was set for 20 minutes and the user says time is up, I should call the unlock function. No arguments needed, just trigger it. Make sure the response confirms the unlock.. </think> then calls the lock command with the incorrect parameters ""Tools.lock_action() missing 1 required positional argument: 'duration'"" <img width="516" alt="Image" src="https://github.com/user-attachments/assets/64347aa9-b35c-43ac-9fbb-6fdcb6dc081f" /> it then just full blown loops, guessing the right tool then failing by calling the wrong tool. Now as soon as I downgrade to `ollama 0.9.0` this issue goes away with zero changes. Just swap the new version with the old, and it goes away. <img width="695" alt="Image" src="https://github.com/user-attachments/assets/d7b2674b-32d4-42ad-a632-94a664525b3b" /> Does this help?
Author
Owner

@eliciel0513 commented on GitHub (Jun 20, 2025):

I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use.

<!-- gh-comment-id:2989475086 --> @eliciel0513 commented on GitHub (Jun 20, 2025): I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use.
Author
Owner

@Notbici commented on GitHub (Jun 20, 2025):

I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use.

@eliciel0513 I'd say I have similar issues that spanned before 0.9.0 as well but with Command-A the 111B model. It prefers to just say it did the action than actually do the action, its a tool supporting model.

@jmorganca if it helps also, I have free resources to test on, at the moment 4x 5090s and 2x RTX 6000 adas idle.

<!-- gh-comment-id:2990219408 --> @Notbici commented on GitHub (Jun 20, 2025): > I'm having an issue with Tools as well, I've tried Qwen3, 8B, 14B, and 32B . I'm using N8N with an N8N MCP for Google Tasks, it used to work before just fine with with any of the qwen 3 models, to add tasks and delete tasks on different Google task lists. But now the model just responds with the thinking but doesn't follow through with the use of the tool from the MCP, it seems to not recognize the tools are available to be used, it just skips the tool and ends the process without actually doing anything, it used to work. Btw I've tried switching to The latest deep seek r1, since it was updated to use tools but it doesn't work at all, and I've tried Magistral but Magistral doesn't recognize the tools either like how Qwen 3 is also acting. It seems to be an issue with the ollama tool use. @eliciel0513 I'd say I have similar issues that spanned before 0.9.0 as well but with `Command-A` the 111B model. It prefers to just say it did the action than actually do the action, its a tool supporting model. @jmorganca if it helps also, I have free resources to test on, at the moment 4x 5090s and 2x RTX 6000 adas idle.
Author
Owner

@mlaihk commented on GitHub (Jun 20, 2025):

Something is not right with ollama for the recent couple releases. I did a comparison with the same models in LMStudio instead of ollama (0.9.1) and OpenWebUI, getting the models to call tools. Ollama generated tool calls results in errors pretty much over 80% of the time and yet the same models, same settings served from LMStudio success rate is over 80%.

I thought something was wrong with both of my laptops.......

With the recent problematic ollama releases, I am seriously considering moving to llama-swap with llama.cpp instead......

<!-- gh-comment-id:2992332018 --> @mlaihk commented on GitHub (Jun 20, 2025): Something is not right with ollama for the recent couple releases. I did a comparison with the same models in LMStudio instead of ollama (0.9.1) and OpenWebUI, getting the models to call tools. Ollama generated tool calls results in errors pretty much over 80% of the time and yet the same models, same settings served from LMStudio success rate is over 80%. I thought something was wrong with both of my laptops....... With the recent problematic ollama releases, I am seriously considering moving to llama-swap with llama.cpp instead......
Author
Owner

@jmorganca commented on GitHub (Jul 1, 2025):

Hi there, this should be fixed now as of 0.9.4 and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling.

<!-- gh-comment-id:3025127010 --> @jmorganca commented on GitHub (Jul 1, 2025): Hi there, this should be fixed now as of [0.9.4](https://github.com/ollama/ollama/releases/tag/v0.9.4) and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling.
Author
Owner

@Notbici commented on GitHub (Jul 2, 2025):

Hi there, this should be fixed now as of 0.9.4 and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling.

@jmorganca I believe this fix caused another bug, going from 0.9.3 to 0.9.4 instantly caused my models to load onto the CPU and not the GPU. Ollama knows I have the gpus in the logs but assigns the layers to cpu.

I got 4x 5090's running Qwen3-32b 4bit, no change outside of trying this update out to test if it helped with the toolcall issue

time=2025-07-02T07:21:00.813Z level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG-4 OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://<redacted>:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-07-02T07:21:00.818Z level=INFO source=images.go:476 msg="total blobs: 86"
time=2025-07-02T07:21:00.819Z level=INFO source=images.go:483 msg="total unused blobs removed: 0"
time=2025-07-02T07:21:00.819Z level=INFO source=routes.go:1288 msg="Listening on <redacted>:11434 (version 0.9.4)"
time=2025-07-02T07:21:00.819Z level=DEBUG source=sched.go:108 msg="starting llm scheduler"
time=2025-07-02T07:21:00.819Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so*
time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2025-07-02T07:21:00.847Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.570.169 /usr/lib/x86_64-linux-gnu/libcuda.so.570.153.02]"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:01.471Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=4 library=/usr/lib/x86_64-linux-gnu/libcuda.so.570.169
[GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA totalMem 32119mb
[GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA freeMem 31613mb
[GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] Compute Capability 12.0
[GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA totalMem 32119mb
[GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA freeMem 31613mb
[GPU-85a20cf6-234c-7d11-0c17-e985189c028f] Compute Capability 12.0
[GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA totalMem 32119mb
[GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA freeMem 31613mb
[GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] Compute Capability 12.0
[GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA totalMem 32119mb
[GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA freeMem 31613mb
[GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] Compute Capability 12.0
time=2025-07-02T07:21:02.418Z level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-85a20cf6-234c-7d11-0c17-e985189c028f library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"
time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB"


time=2025-07-02T07:21:24.537Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.6 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:24.734Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:24.899Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.052Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.206Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:25.216Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32
time=2025-07-02T07:21:25.258Z level=DEBUG source=sched.go:228 msg="loading first model" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-07-02T07:21:25.258Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]"
time=2025-07-02T07:21:25.258Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-07-02T07:21:25.259Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:25.463Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.706Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:25.950Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.186Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:26.187Z level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 library=cuda parallel=1 required="45.3 GiB"
time=2025-07-02T07:21:26.187Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:26.337Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.484Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.628Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:26.772Z level=INFO source=server.go:135 msg="system memory" total="251.5 GiB" free="247.5 GiB" free_swap="0 B"
time=2025-07-02T07:21:26.772Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]"
time=2025-07-02T07:21:26.772Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0
time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B"
initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169
dlsym: cuInit - 0x7f7113e48a60
dlsym: cuDriverGetVersion - 0x7f7113e48a80
dlsym: cuDeviceGetCount - 0x7f7113e48ac0
dlsym: cuDeviceGet - 0x7f7113e48aa0
dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0
dlsym: cuDeviceGetUuid - 0x7f7113e48b00
dlsym: cuDeviceGetName - 0x7f7113e48ae0
dlsym: cuCtxCreate_v3 - 0x7f7113e48d80
dlsym: cuMemGetInfo_v2 - 0x7f7113e69140
dlsym: cuCtxDestroy - 0x7f7113ea7a60
calling cuInit
calling cuDriverGetVersion
raw version 0x2f30
CUDA driver version: 12.8
calling cuDeviceGetCount
device count 4
time=2025-07-02T07:21:26.916Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:27.060Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:27.204Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
time=2025-07-02T07:21:27.349Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB"
releasing cuda driver library
time=2025-07-02T07:21:27.349Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=17,16,16,16 memory.available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="45.3 GiB" memory.required.partial="45.3 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[11.8 GiB 11.2 GiB 11.2 GiB 11.2 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB"
time=2025-07-02T07:21:27.349Z level=INFO source=server.go:218 msg="enabling flash attention"
time=2025-07-02T07:21:27.349Z level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[]
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 1
print_info: model type       = ?B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-07-02T07:21:27.515Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 30720 --batch-size 2048 --n-gpu-layers 65 --threads 16 --flash-attn --kv-cache-type q8_0 --parallel 1 --tensor-split 17,16,16,16 --port 46343"
time=2025-07-02T07:21:27.515Z level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_SCHED_SPREAD=1 OLLAMA_HOST=<redacted> OLLAMA_NUM_PARALLEL=1 OLLAMA_DEBUG=2 PATH=/root/.local/bin:/oob/text-generation-webui-main/installer_files/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc,GPU-85a20cf6-234c-7d11-0c17-e985189c028f,GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1,GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db
time=2025-07-02T07:21:27.515Z level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-07-02T07:21:27.515Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding"
time=2025-07-02T07:21:27.515Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding"
time=2025-07-02T07:21:27.526Z level=INFO source=runner.go:815 msg="starting go runner"
time=2025-07-02T07:21:27.526Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama
time=2025-07-02T07:21:27.526Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc)
time=2025-07-02T07:21:27.548Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46343"
llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen3 32B
llama_model_loader: - kv   3:                           general.basename str              = Qwen3
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                          qwen3.block_count u32              = 64
llama_model_loader: - kv   6:                       qwen3.context_length u32              = 40960
llama_model_loader: - kv   7:                     qwen3.embedding_length u32              = 5120
llama_model_loader: - kv   8:                  qwen3.feed_forward_length u32              = 25600
llama_model_loader: - kv   9:                 qwen3.attention.head_count u32              = 64
llama_model_loader: - kv  10:              qwen3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  11:                       qwen3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  12:     qwen3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  13:                 qwen3.attention.key_length u32              = 128
llama_model_loader: - kv  14:               qwen3.attention.value_length u32              = 128
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  17:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  19:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  22:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  23:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  24:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  25:               general.quantization_version u32              = 2
llama_model_loader: - kv  26:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  257 tensors
llama_model_loader: - type  f16:   64 tensors
llama_model_loader: - type q4_K:  353 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 18.81 GiB (4.93 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: special tokens cache size = 26
load: token to piece cache size = 0.9311 MB
print_info: arch             = qwen3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 40960
print_info: n_embd           = 5120
print_info: n_layer          = 64
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 25600
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 40960
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 32B
print_info: model params     = 32.76 B
print_info: general.name     = Qwen3 32B
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151643 '<|endoftext|>'
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device CPU, is_swa = 0
load_tensors: layer   1 assigned to device CPU, is_swa = 0
load_tensors: layer   2 assigned to device CPU, is_swa = 0
load_tensors: layer   3 assigned to device CPU, is_swa = 0
load_tensors: layer   4 assigned to device CPU, is_swa = 0
load_tensors: layer   5 assigned to device CPU, is_swa = 0
load_tensors: layer   6 assigned to device CPU, is_swa = 0
load_tensors: layer   7 assigned to device CPU, is_swa = 0
load_tensors: layer   8 assigned to device CPU, is_swa = 0
load_tensors: layer   9 assigned to device CPU, is_swa = 0
load_tensors: layer  10 assigned to device CPU, is_swa = 0
load_tensors: layer  11 assigned to device CPU, is_swa = 0
load_tensors: layer  12 assigned to device CPU, is_swa = 0
load_tensors: layer  13 assigned to device CPU, is_swa = 0
load_tensors: layer  14 assigned to device CPU, is_swa = 0
load_tensors: layer  15 assigned to device CPU, is_swa = 0
load_tensors: layer  16 assigned to device CPU, is_swa = 0
load_tensors: layer  17 assigned to device CPU, is_swa = 0
load_tensors: layer  18 assigned to device CPU, is_swa = 0
load_tensors: layer  19 assigned to device CPU, is_swa = 0
load_tensors: layer  20 assigned to device CPU, is_swa = 0
load_tensors: layer  21 assigned to device CPU, is_swa = 0
load_tensors: layer  22 assigned to device CPU, is_swa = 0
load_tensors: layer  23 assigned to device CPU, is_swa = 0
load_tensors: layer  24 assigned to device CPU, is_swa = 0
load_tensors: layer  25 assigned to device CPU, is_swa = 0
load_tensors: layer  26 assigned to device CPU, is_swa = 0
load_tensors: layer  27 assigned to device CPU, is_swa = 0
load_tensors: layer  28 assigned to device CPU, is_swa = 0
load_tensors: layer  29 assigned to device CPU, is_swa = 0
load_tensors: layer  30 assigned to device CPU, is_swa = 0
load_tensors: layer  31 assigned to device CPU, is_swa = 0
load_tensors: layer  32 assigned to device CPU, is_swa = 0
load_tensors: layer  33 assigned to device CPU, is_swa = 0
load_tensors: layer  34 assigned to device CPU, is_swa = 0
load_tensors: layer  35 assigned to device CPU, is_swa = 0
load_tensors: layer  36 assigned to device CPU, is_swa = 0
load_tensors: layer  37 assigned to device CPU, is_swa = 0
load_tensors: layer  38 assigned to device CPU, is_swa = 0
load_tensors: layer  39 assigned to device CPU, is_swa = 0
load_tensors: layer  40 assigned to device CPU, is_swa = 0
load_tensors: layer  41 assigned to device CPU, is_swa = 0
load_tensors: layer  42 assigned to device CPU, is_swa = 0
load_tensors: layer  43 assigned to device CPU, is_swa = 0
load_tensors: layer  44 assigned to device CPU, is_swa = 0
load_tensors: layer  45 assigned to device CPU, is_swa = 0
load_tensors: layer  46 assigned to device CPU, is_swa = 0
load_tensors: layer  47 assigned to device CPU, is_swa = 0
load_tensors: layer  48 assigned to device CPU, is_swa = 0
load_tensors: layer  49 assigned to device CPU, is_swa = 0
load_tensors: layer  50 assigned to device CPU, is_swa = 0
load_tensors: layer  51 assigned to device CPU, is_swa = 0
load_tensors: layer  52 assigned to device CPU, is_swa = 0
load_tensors: layer  53 assigned to device CPU, is_swa = 0
load_tensors: layer  54 assigned to device CPU, is_swa = 0
load_tensors: layer  55 assigned to device CPU, is_swa = 0
load_tensors: layer  56 assigned to device CPU, is_swa = 0
load_tensors: layer  57 assigned to device CPU, is_swa = 0
load_tensors: layer  58 assigned to device CPU, is_swa = 0
load_tensors: layer  59 assigned to device CPU, is_swa = 0
load_tensors: layer  60 assigned to device CPU, is_swa = 0
load_tensors: layer  61 assigned to device CPU, is_swa = 0
load_tensors: layer  62 assigned to device CPU, is_swa = 0
load_tensors: layer  63 assigned to device CPU, is_swa = 0
load_tensors: layer  64 assigned to device CPU, is_swa = 0
time=2025-07-02T07:21:27.766Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model"
load_tensors:   CPU_Mapped model buffer size = 19259.71 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 30720
llama_context: n_ctx_per_seq = 30720
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 1
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (30720) < n_ctx_train (40960) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.60 MiB
create_memory: n_ctx = 30720 (padded)
llama_kv_cache_unified: kv_size = 30720, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1, padding = 256
llama_kv_cache_unified: layer   0: dev = CPU
llama_kv_cache_unified: layer   1: dev = CPU
llama_kv_cache_unified: layer   2: dev = CPU
llama_kv_cache_unified: layer   3: dev = CPU
llama_kv_cache_unified: layer   4: dev = CPU
llama_kv_cache_unified: layer   5: dev = CPU
llama_kv_cache_unified: layer   6: dev = CPU
llama_kv_cache_unified: layer   7: dev = CPU
llama_kv_cache_unified: layer   8: dev = CPU
llama_kv_cache_unified: layer   9: dev = CPU
llama_kv_cache_unified: layer  10: dev = CPU
llama_kv_cache_unified: layer  11: dev = CPU
llama_kv_cache_unified: layer  12: dev = CPU
llama_kv_cache_unified: layer  13: dev = CPU
llama_kv_cache_unified: layer  14: dev = CPU
llama_kv_cache_unified: layer  15: dev = CPU
llama_kv_cache_unified: layer  16: dev = CPU
llama_kv_cache_unified: layer  17: dev = CPU
llama_kv_cache_unified: layer  18: dev = CPU
llama_kv_cache_unified: layer  19: dev = CPU
llama_kv_cache_unified: layer  20: dev = CPU
llama_kv_cache_unified: layer  21: dev = CPU
llama_kv_cache_unified: layer  22: dev = CPU
llama_kv_cache_unified: layer  23: dev = CPU
llama_kv_cache_unified: layer  24: dev = CPU
llama_kv_cache_unified: layer  25: dev = CPU
llama_kv_cache_unified: layer  26: dev = CPU
llama_kv_cache_unified: layer  27: dev = CPU
llama_kv_cache_unified: layer  28: dev = CPU
llama_kv_cache_unified: layer  29: dev = CPU
llama_kv_cache_unified: layer  30: dev = CPU
llama_kv_cache_unified: layer  31: dev = CPU
llama_kv_cache_unified: layer  32: dev = CPU
llama_kv_cache_unified: layer  33: dev = CPU
llama_kv_cache_unified: layer  34: dev = CPU
llama_kv_cache_unified: layer  35: dev = CPU
llama_kv_cache_unified: layer  36: dev = CPU
llama_kv_cache_unified: layer  37: dev = CPU
llama_kv_cache_unified: layer  38: dev = CPU
llama_kv_cache_unified: layer  39: dev = CPU
llama_kv_cache_unified: layer  40: dev = CPU
llama_kv_cache_unified: layer  41: dev = CPU
llama_kv_cache_unified: layer  42: dev = CPU
llama_kv_cache_unified: layer  43: dev = CPU
llama_kv_cache_unified: layer  44: dev = CPU
llama_kv_cache_unified: layer  45: dev = CPU
llama_kv_cache_unified: layer  46: dev = CPU
llama_kv_cache_unified: layer  47: dev = CPU
llama_kv_cache_unified: layer  48: dev = CPU
llama_kv_cache_unified: layer  49: dev = CPU
llama_kv_cache_unified: layer  50: dev = CPU
llama_kv_cache_unified: layer  51: dev = CPU
llama_kv_cache_unified: layer  52: dev = CPU
llama_kv_cache_unified: layer  53: dev = CPU
llama_kv_cache_unified: layer  54: dev = CPU
llama_kv_cache_unified: layer  55: dev = CPU
llama_kv_cache_unified: layer  56: dev = CPU
llama_kv_cache_unified: layer  57: dev = CPU
llama_kv_cache_unified: layer  58: dev = CPU
llama_kv_cache_unified: layer  59: dev = CPU
llama_kv_cache_unified: layer  60: dev = CPU
llama_kv_cache_unified: layer  61: dev = CPU
llama_kv_cache_unified: layer  62: dev = CPU
llama_kv_cache_unified: layer  63: dev = CPU
time=2025-07-02T07:21:29.020Z level=DEBUG source=server.go:643 msg="model load progress 1.00"
time=2025-07-02T07:21:29.271Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model"
llama_kv_cache_unified:        CPU KV buffer size =  4080.00 MiB
llama_kv_cache_unified: KV self size  = 4080.00 MiB, K (q8_0): 2040.00 MiB, V (q8_0): 2040.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 1
llama_context: max_nodes = 65536
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context: reserving graph for n_tokens = 1, n_seqs = 1
llama_context: reserving graph for n_tokens = 512, n_seqs = 1
llama_context:        CPU compute buffer size =   316.75 MiB
llama_context: graph nodes  = 2183
llama_context: graph splits = 1
time=2025-07-02T07:21:30.526Z level=INFO source=server.go:637 msg="llama runner started in 3.01 seconds"
time=2025-07-02T07:21:30.526Z level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720
time=2025-07-02T07:21:30.526Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=52 format=""
time=2025-07-02T07:21:30.526Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n"
time=2025-07-02T07:21:30.528Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=9 used=0 remaining=9
[GIN] 2025/07/02 - 07:23:14 | 200 |         1m49s |  <redacted> | POST     "/api/chat"
time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:503 msg="context for request finished"
time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 duration=1h0m0s
time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=0
time=2025-07-02T07:23:14.139Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-07-02T07:23:14.140Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=1353 format=""
time=2025-07-02T07:23:14.140Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\n### Task:\nGenerate a concise, 3-5 word title with an emoji summarizing the chat history.\n### Guidelines:\n- The title should clearly represent the main theme or subject of the conversation.\n- Use emojis that enhance understanding of the topic, but avoid quotation marks or special formatting.\n- Write the title in the chat's primary language; default to English if multilingual.\n- Prioritize accuracy over excessive creativity; keep it clear and simple.\n- Your entire response must consist solely of the JSON object, without any introductory or concluding text.\n- The output must be a single, raw JSON object, without any markdown code fences or other encapsulating text.\n- Ensure no conversational text, affirmations, or explanations precede or follow the raw JSON output, as this will cause direct parsing failure.\n### Output:\nJSON format: { \"title\": \"your concise title here\" }\n### Examples:\n- { \"title\": \"📉 Stock Market Trends\" },\n- { \"title\": \"🍪 Perfect Chocolate Chip Recipe\" },\n- { \"title\": \"Evolution of Music Streaming\" },\n- { \"title\": \"Remote Work Productivity Tips\" },\n- { \"title\": \"Artificial Intelligence in Healthcare\" },\n- { \"title\": \"🎮 Video Game Development Insights\" }\n### Chat History:\n<chat_history>\nUSER: hi\nASSISTANT: Hello! 😊 How can I assist you today?\n</chat_history><|im_end|>\n<|im_start|>assistant\n"
time=2025-07-02T07:23:14.147Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=94 prompt=297 used=3 remaining=294
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=1
time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720
<!-- gh-comment-id:3026754902 --> @Notbici commented on GitHub (Jul 2, 2025): > Hi there, this should be fixed now as of [0.9.4](https://github.com/ollama/ollama/releases/tag/v0.9.4) and later. Sorry for the issue. Let me know if you're still seeing any issues with tool calling. @jmorganca I believe this fix caused another bug, going from 0.9.3 to 0.9.4 instantly caused my models to load onto the CPU and not the GPU. Ollama knows I have the gpus in the logs but assigns the layers to cpu. I got 4x 5090's running Qwen3-32b 4bit, no change outside of trying this update out to test if it helped with the toolcall issue ``` time=2025-07-02T07:21:00.813Z level=INFO source=routes.go:1235 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG-4 OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://<redacted>:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:1h0m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:1 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:true ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-07-02T07:21:00.818Z level=INFO source=images.go:476 msg="total blobs: 86" time=2025-07-02T07:21:00.819Z level=INFO source=images.go:483 msg="total unused blobs removed: 0" time=2025-07-02T07:21:00.819Z level=INFO source=routes.go:1288 msg="Listening on <redacted>:11434 (version 0.9.4)" time=2025-07-02T07:21:00.819Z level=DEBUG source=sched.go:108 msg="starting llm scheduler" time=2025-07-02T07:21:00.819Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:501 msg="Searching for GPU library" name=libcuda.so* time=2025-07-02T07:21:00.842Z level=DEBUG source=gpu.go:525 msg="gpu library search" globs="[/usr/local/lib/ollama/libcuda.so* /libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2025-07-02T07:21:00.847Z level=DEBUG source=gpu.go:558 msg="discovered GPU libraries" paths="[/usr/lib/x86_64-linux-gnu/libcuda.so.570.169 /usr/lib/x86_64-linux-gnu/libcuda.so.570.153.02]" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:01.471Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=4 library=/usr/lib/x86_64-linux-gnu/libcuda.so.570.169 [GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA totalMem 32119mb [GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] CUDA freeMem 31613mb [GPU-855b345a-d321-1304-e06d-5100e5f1d9bc] Compute Capability 12.0 [GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA totalMem 32119mb [GPU-85a20cf6-234c-7d11-0c17-e985189c028f] CUDA freeMem 31613mb [GPU-85a20cf6-234c-7d11-0c17-e985189c028f] Compute Capability 12.0 [GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA totalMem 32119mb [GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] CUDA freeMem 31613mb [GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1] Compute Capability 12.0 [GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA totalMem 32119mb [GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] CUDA freeMem 31613mb [GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db] Compute Capability 12.0 time=2025-07-02T07:21:02.418Z level=DEBUG source=amd_linux.go:419 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-85a20cf6-234c-7d11-0c17-e985189c028f library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:02.418Z level=INFO source=types.go:130 msg="inference compute" id=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db library=cuda variant=v12 compute=12.0 driver=12.8 name="NVIDIA GeForce RTX 5090" total="31.4 GiB" available="30.9 GiB" time=2025-07-02T07:21:24.537Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.6 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:24.734Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:24.899Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.052Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.206Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:25.216Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=general.alignment default=32 time=2025-07-02T07:21:25.258Z level=DEBUG source=sched.go:228 msg="loading first model" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-07-02T07:21:25.258Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" time=2025-07-02T07:21:25.258Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-07-02T07:21:25.259Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:25.463Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.706Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:25.950Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.186Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:26.187Z level=INFO source=sched.go:804 msg="new model will fit in available VRAM, loading" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 library=cuda parallel=1 required="45.3 GiB" time=2025-07-02T07:21:26.187Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:26.337Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.484Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.628Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:26.772Z level=INFO source=server.go:135 msg="system memory" total="251.5 GiB" free="247.5 GiB" free_swap="0 B" time=2025-07-02T07:21:26.772Z level=DEBUG source=memory.go:111 msg=evaluating library=cuda gpu_count=4 available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" time=2025-07-02T07:21:26.772Z level=DEBUG source=ggml.go:206 msg="key with type not found" key=qwen3.vision.block_count default=0 time=2025-07-02T07:21:26.772Z level=DEBUG source=gpu.go:391 msg="updating system memory data" before.total="251.5 GiB" before.free="247.5 GiB" before.free_swap="0 B" now.total="251.5 GiB" now.free="247.5 GiB" now.free_swap="0 B" initializing /usr/lib/x86_64-linux-gnu/libcuda.so.570.169 dlsym: cuInit - 0x7f7113e48a60 dlsym: cuDriverGetVersion - 0x7f7113e48a80 dlsym: cuDeviceGetCount - 0x7f7113e48ac0 dlsym: cuDeviceGet - 0x7f7113e48aa0 dlsym: cuDeviceGetAttribute - 0x7f7113e48ba0 dlsym: cuDeviceGetUuid - 0x7f7113e48b00 dlsym: cuDeviceGetName - 0x7f7113e48ae0 dlsym: cuCtxCreate_v3 - 0x7f7113e48d80 dlsym: cuMemGetInfo_v2 - 0x7f7113e69140 dlsym: cuCtxDestroy - 0x7f7113ea7a60 calling cuInit calling cuDriverGetVersion raw version 0x2f30 CUDA driver version: 12.8 calling cuDeviceGetCount device count 4 time=2025-07-02T07:21:26.916Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:27.060Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-85a20cf6-234c-7d11-0c17-e985189c028f name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:27.204Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1 name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" time=2025-07-02T07:21:27.349Z level=DEBUG source=gpu.go:441 msg="updating cuda memory data" gpu=GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db name="NVIDIA GeForce RTX 5090" overhead="0 B" before.total="31.4 GiB" before.free="30.9 GiB" now.total="31.4 GiB" now.free="30.9 GiB" now.used="506.2 MiB" releasing cuda driver library time=2025-07-02T07:21:27.349Z level=INFO source=server.go:175 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split=17,16,16,16 memory.available="[30.9 GiB 30.9 GiB 30.9 GiB 30.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="45.3 GiB" memory.required.partial="45.3 GiB" memory.required.kv="3.8 GiB" memory.required.allocations="[11.8 GiB 11.2 GiB 11.2 GiB 11.2 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="17.8 GiB" memory.weights.nonrepeating="608.6 MiB" memory.graph.full="5.0 GiB" memory.graph.partial="5.0 GiB" time=2025-07-02T07:21:27.349Z level=INFO source=server.go:218 msg="enabling flash attention" time=2025-07-02T07:21:27.349Z level=DEBUG source=server.go:291 msg="compatible gpu libraries" compatible=[] llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-07-02T07:21:27.515Z level=INFO source=server.go:438 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 --ctx-size 30720 --batch-size 2048 --n-gpu-layers 65 --threads 16 --flash-attn --kv-cache-type q8_0 --parallel 1 --tensor-split 17,16,16,16 --port 46343" time=2025-07-02T07:21:27.515Z level=DEBUG source=server.go:439 msg=subprocess OLLAMA_MAX_LOADED_MODELS=1 OLLAMA_SCHED_SPREAD=1 OLLAMA_HOST=<redacted> OLLAMA_NUM_PARALLEL=1 OLLAMA_DEBUG=2 PATH=/root/.local/bin:/oob/text-generation-webui-main/installer_files/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_KEEP_ALIVE=1h OLLAMA_FLASH_ATTENTION=1 OLLAMA_KV_CACHE_TYPE=q8_0 OLLAMA_LIBRARY_PATH=/usr/local/lib/ollama LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama CUDA_VISIBLE_DEVICES=GPU-855b345a-d321-1304-e06d-5100e5f1d9bc,GPU-85a20cf6-234c-7d11-0c17-e985189c028f,GPU-047af8ac-3926-d40d-aa93-0bd45f8115f1,GPU-2f61d2bb-98a4-3ca8-1395-3d1853b512db time=2025-07-02T07:21:27.515Z level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-07-02T07:21:27.515Z level=INFO source=server.go:598 msg="waiting for llama runner to start responding" time=2025-07-02T07:21:27.515Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server not responding" time=2025-07-02T07:21:27.526Z level=INFO source=runner.go:815 msg="starting go runner" time=2025-07-02T07:21:27.526Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/local/lib/ollama time=2025-07-02T07:21:27.526Z level=INFO source=ggml.go:104 msg=system CPU.0.LLAMAFILE=1 compiler=cgo(gcc) time=2025-07-02T07:21:27.548Z level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46343" llama_model_loader: loaded meta data with 27 key-value pairs and 707 tensors from /root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen3 llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Qwen3 32B llama_model_loader: - kv 3: general.basename str = Qwen3 llama_model_loader: - kv 4: general.size_label str = 32B llama_model_loader: - kv 5: qwen3.block_count u32 = 64 llama_model_loader: - kv 6: qwen3.context_length u32 = 40960 llama_model_loader: - kv 7: qwen3.embedding_length u32 = 5120 llama_model_loader: - kv 8: qwen3.feed_forward_length u32 = 25600 llama_model_loader: - kv 9: qwen3.attention.head_count u32 = 64 llama_model_loader: - kv 10: qwen3.attention.head_count_kv u32 = 8 llama_model_loader: - kv 11: qwen3.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 12: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001 llama_model_loader: - kv 13: qwen3.attention.key_length u32 = 128 llama_model_loader: - kv 14: qwen3.attention.value_length u32 = 128 llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... llama_model_loader: - kv 25: general.quantization_version u32 = 2 llama_model_loader: - kv 26: general.file_type u32 = 15 llama_model_loader: - type f32: 257 tensors llama_model_loader: - type f16: 64 tensors llama_model_loader: - type q4_K: 353 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 18.81 GiB (4.93 BPW) init_tokenizer: initializing tokenizer for type 2 load: control token: 151660 '<|fim_middle|>' is not marked as EOG load: control token: 151659 '<|fim_prefix|>' is not marked as EOG load: control token: 151653 '<|vision_end|>' is not marked as EOG load: control token: 151648 '<|box_start|>' is not marked as EOG load: control token: 151646 '<|object_ref_start|>' is not marked as EOG load: control token: 151649 '<|box_end|>' is not marked as EOG load: control token: 151655 '<|image_pad|>' is not marked as EOG load: control token: 151651 '<|quad_end|>' is not marked as EOG load: control token: 151647 '<|object_ref_end|>' is not marked as EOG load: control token: 151652 '<|vision_start|>' is not marked as EOG load: control token: 151654 '<|vision_pad|>' is not marked as EOG load: control token: 151656 '<|video_pad|>' is not marked as EOG load: control token: 151644 '<|im_start|>' is not marked as EOG load: control token: 151661 '<|fim_suffix|>' is not marked as EOG load: control token: 151650 '<|quad_start|>' is not marked as EOG load: special tokens cache size = 26 load: token to piece cache size = 0.9311 MB print_info: arch = qwen3 print_info: vocab_only = 0 print_info: n_ctx_train = 40960 print_info: n_embd = 5120 print_info: n_layer = 64 print_info: n_head = 64 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 8 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-06 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 25600 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 40960 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 32B print_info: model params = 32.76 B print_info: general.name = Qwen3 32B print_info: vocab type = BPE print_info: n_vocab = 151936 print_info: n_merges = 151387 print_info: BOS token = 151643 '<|endoftext|>' print_info: EOS token = 151645 '<|im_end|>' print_info: EOT token = 151645 '<|im_end|>' print_info: PAD token = 151643 '<|endoftext|>' print_info: LF token = 198 'Ċ' print_info: FIM PRE token = 151659 '<|fim_prefix|>' print_info: FIM SUF token = 151661 '<|fim_suffix|>' print_info: FIM MID token = 151660 '<|fim_middle|>' print_info: FIM PAD token = 151662 '<|fim_pad|>' print_info: FIM REP token = 151663 '<|repo_name|>' print_info: FIM SEP token = 151664 '<|file_sep|>' print_info: EOG token = 151643 '<|endoftext|>' print_info: EOG token = 151645 '<|im_end|>' print_info: EOG token = 151662 '<|fim_pad|>' print_info: EOG token = 151663 '<|repo_name|>' print_info: EOG token = 151664 '<|file_sep|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: layer 0 assigned to device CPU, is_swa = 0 load_tensors: layer 1 assigned to device CPU, is_swa = 0 load_tensors: layer 2 assigned to device CPU, is_swa = 0 load_tensors: layer 3 assigned to device CPU, is_swa = 0 load_tensors: layer 4 assigned to device CPU, is_swa = 0 load_tensors: layer 5 assigned to device CPU, is_swa = 0 load_tensors: layer 6 assigned to device CPU, is_swa = 0 load_tensors: layer 7 assigned to device CPU, is_swa = 0 load_tensors: layer 8 assigned to device CPU, is_swa = 0 load_tensors: layer 9 assigned to device CPU, is_swa = 0 load_tensors: layer 10 assigned to device CPU, is_swa = 0 load_tensors: layer 11 assigned to device CPU, is_swa = 0 load_tensors: layer 12 assigned to device CPU, is_swa = 0 load_tensors: layer 13 assigned to device CPU, is_swa = 0 load_tensors: layer 14 assigned to device CPU, is_swa = 0 load_tensors: layer 15 assigned to device CPU, is_swa = 0 load_tensors: layer 16 assigned to device CPU, is_swa = 0 load_tensors: layer 17 assigned to device CPU, is_swa = 0 load_tensors: layer 18 assigned to device CPU, is_swa = 0 load_tensors: layer 19 assigned to device CPU, is_swa = 0 load_tensors: layer 20 assigned to device CPU, is_swa = 0 load_tensors: layer 21 assigned to device CPU, is_swa = 0 load_tensors: layer 22 assigned to device CPU, is_swa = 0 load_tensors: layer 23 assigned to device CPU, is_swa = 0 load_tensors: layer 24 assigned to device CPU, is_swa = 0 load_tensors: layer 25 assigned to device CPU, is_swa = 0 load_tensors: layer 26 assigned to device CPU, is_swa = 0 load_tensors: layer 27 assigned to device CPU, is_swa = 0 load_tensors: layer 28 assigned to device CPU, is_swa = 0 load_tensors: layer 29 assigned to device CPU, is_swa = 0 load_tensors: layer 30 assigned to device CPU, is_swa = 0 load_tensors: layer 31 assigned to device CPU, is_swa = 0 load_tensors: layer 32 assigned to device CPU, is_swa = 0 load_tensors: layer 33 assigned to device CPU, is_swa = 0 load_tensors: layer 34 assigned to device CPU, is_swa = 0 load_tensors: layer 35 assigned to device CPU, is_swa = 0 load_tensors: layer 36 assigned to device CPU, is_swa = 0 load_tensors: layer 37 assigned to device CPU, is_swa = 0 load_tensors: layer 38 assigned to device CPU, is_swa = 0 load_tensors: layer 39 assigned to device CPU, is_swa = 0 load_tensors: layer 40 assigned to device CPU, is_swa = 0 load_tensors: layer 41 assigned to device CPU, is_swa = 0 load_tensors: layer 42 assigned to device CPU, is_swa = 0 load_tensors: layer 43 assigned to device CPU, is_swa = 0 load_tensors: layer 44 assigned to device CPU, is_swa = 0 load_tensors: layer 45 assigned to device CPU, is_swa = 0 load_tensors: layer 46 assigned to device CPU, is_swa = 0 load_tensors: layer 47 assigned to device CPU, is_swa = 0 load_tensors: layer 48 assigned to device CPU, is_swa = 0 load_tensors: layer 49 assigned to device CPU, is_swa = 0 load_tensors: layer 50 assigned to device CPU, is_swa = 0 load_tensors: layer 51 assigned to device CPU, is_swa = 0 load_tensors: layer 52 assigned to device CPU, is_swa = 0 load_tensors: layer 53 assigned to device CPU, is_swa = 0 load_tensors: layer 54 assigned to device CPU, is_swa = 0 load_tensors: layer 55 assigned to device CPU, is_swa = 0 load_tensors: layer 56 assigned to device CPU, is_swa = 0 load_tensors: layer 57 assigned to device CPU, is_swa = 0 load_tensors: layer 58 assigned to device CPU, is_swa = 0 load_tensors: layer 59 assigned to device CPU, is_swa = 0 load_tensors: layer 60 assigned to device CPU, is_swa = 0 load_tensors: layer 61 assigned to device CPU, is_swa = 0 load_tensors: layer 62 assigned to device CPU, is_swa = 0 load_tensors: layer 63 assigned to device CPU, is_swa = 0 load_tensors: layer 64 assigned to device CPU, is_swa = 0 time=2025-07-02T07:21:27.766Z level=INFO source=server.go:632 msg="waiting for server to become available" status="llm server loading model" load_tensors: CPU_Mapped model buffer size = 19259.71 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 30720 llama_context: n_ctx_per_seq = 30720 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 1 llama_context: freq_base = 1000000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (30720) < n_ctx_train (40960) -- the full capacity of the model will not be utilized set_abort_callback: call llama_context: CPU output buffer size = 0.60 MiB create_memory: n_ctx = 30720 (padded) llama_kv_cache_unified: kv_size = 30720, type_k = 'q8_0', type_v = 'q8_0', n_layer = 64, can_shift = 1, padding = 256 llama_kv_cache_unified: layer 0: dev = CPU llama_kv_cache_unified: layer 1: dev = CPU llama_kv_cache_unified: layer 2: dev = CPU llama_kv_cache_unified: layer 3: dev = CPU llama_kv_cache_unified: layer 4: dev = CPU llama_kv_cache_unified: layer 5: dev = CPU llama_kv_cache_unified: layer 6: dev = CPU llama_kv_cache_unified: layer 7: dev = CPU llama_kv_cache_unified: layer 8: dev = CPU llama_kv_cache_unified: layer 9: dev = CPU llama_kv_cache_unified: layer 10: dev = CPU llama_kv_cache_unified: layer 11: dev = CPU llama_kv_cache_unified: layer 12: dev = CPU llama_kv_cache_unified: layer 13: dev = CPU llama_kv_cache_unified: layer 14: dev = CPU llama_kv_cache_unified: layer 15: dev = CPU llama_kv_cache_unified: layer 16: dev = CPU llama_kv_cache_unified: layer 17: dev = CPU llama_kv_cache_unified: layer 18: dev = CPU llama_kv_cache_unified: layer 19: dev = CPU llama_kv_cache_unified: layer 20: dev = CPU llama_kv_cache_unified: layer 21: dev = CPU llama_kv_cache_unified: layer 22: dev = CPU llama_kv_cache_unified: layer 23: dev = CPU llama_kv_cache_unified: layer 24: dev = CPU llama_kv_cache_unified: layer 25: dev = CPU llama_kv_cache_unified: layer 26: dev = CPU llama_kv_cache_unified: layer 27: dev = CPU llama_kv_cache_unified: layer 28: dev = CPU llama_kv_cache_unified: layer 29: dev = CPU llama_kv_cache_unified: layer 30: dev = CPU llama_kv_cache_unified: layer 31: dev = CPU llama_kv_cache_unified: layer 32: dev = CPU llama_kv_cache_unified: layer 33: dev = CPU llama_kv_cache_unified: layer 34: dev = CPU llama_kv_cache_unified: layer 35: dev = CPU llama_kv_cache_unified: layer 36: dev = CPU llama_kv_cache_unified: layer 37: dev = CPU llama_kv_cache_unified: layer 38: dev = CPU llama_kv_cache_unified: layer 39: dev = CPU llama_kv_cache_unified: layer 40: dev = CPU llama_kv_cache_unified: layer 41: dev = CPU llama_kv_cache_unified: layer 42: dev = CPU llama_kv_cache_unified: layer 43: dev = CPU llama_kv_cache_unified: layer 44: dev = CPU llama_kv_cache_unified: layer 45: dev = CPU llama_kv_cache_unified: layer 46: dev = CPU llama_kv_cache_unified: layer 47: dev = CPU llama_kv_cache_unified: layer 48: dev = CPU llama_kv_cache_unified: layer 49: dev = CPU llama_kv_cache_unified: layer 50: dev = CPU llama_kv_cache_unified: layer 51: dev = CPU llama_kv_cache_unified: layer 52: dev = CPU llama_kv_cache_unified: layer 53: dev = CPU llama_kv_cache_unified: layer 54: dev = CPU llama_kv_cache_unified: layer 55: dev = CPU llama_kv_cache_unified: layer 56: dev = CPU llama_kv_cache_unified: layer 57: dev = CPU llama_kv_cache_unified: layer 58: dev = CPU llama_kv_cache_unified: layer 59: dev = CPU llama_kv_cache_unified: layer 60: dev = CPU llama_kv_cache_unified: layer 61: dev = CPU llama_kv_cache_unified: layer 62: dev = CPU llama_kv_cache_unified: layer 63: dev = CPU time=2025-07-02T07:21:29.020Z level=DEBUG source=server.go:643 msg="model load progress 1.00" time=2025-07-02T07:21:29.271Z level=DEBUG source=server.go:646 msg="model load completed, waiting for server to become available" status="llm server loading model" llama_kv_cache_unified: CPU KV buffer size = 4080.00 MiB llama_kv_cache_unified: KV self size = 4080.00 MiB, K (q8_0): 2040.00 MiB, V (q8_0): 2040.00 MiB llama_context: enumerating backends llama_context: backend_ptrs.size() = 1 llama_context: max_nodes = 65536 llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 0 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: reserving graph for n_tokens = 1, n_seqs = 1 llama_context: reserving graph for n_tokens = 512, n_seqs = 1 llama_context: CPU compute buffer size = 316.75 MiB llama_context: graph nodes = 2183 llama_context: graph splits = 1 time=2025-07-02T07:21:30.526Z level=INFO source=server.go:637 msg="llama runner started in 3.01 seconds" time=2025-07-02T07:21:30.526Z level=DEBUG source=sched.go:495 msg="finished setting up" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 time=2025-07-02T07:21:30.526Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=52 format="" time=2025-07-02T07:21:30.526Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\nhi<|im_end|>\n<|im_start|>assistant\n" time=2025-07-02T07:21:30.528Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=0 prompt=9 used=0 remaining=9 [GIN] 2025/07/02 - 07:23:14 | 200 | 1m49s | <redacted> | POST "/api/chat" time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:503 msg="context for request finished" time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:343 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 duration=1h0m0s time=2025-07-02T07:23:14.028Z level=DEBUG source=sched.go:361 msg="after processing request finished event" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=0 time=2025-07-02T07:23:14.139Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-07-02T07:23:14.140Z level=DEBUG source=server.go:736 msg="completion request" images=0 prompt=1353 format="" time=2025-07-02T07:23:14.140Z level=TRACE source=server.go:737 msg="completion request" prompt="<|im_start|>user\n### Task:\nGenerate a concise, 3-5 word title with an emoji summarizing the chat history.\n### Guidelines:\n- The title should clearly represent the main theme or subject of the conversation.\n- Use emojis that enhance understanding of the topic, but avoid quotation marks or special formatting.\n- Write the title in the chat's primary language; default to English if multilingual.\n- Prioritize accuracy over excessive creativity; keep it clear and simple.\n- Your entire response must consist solely of the JSON object, without any introductory or concluding text.\n- The output must be a single, raw JSON object, without any markdown code fences or other encapsulating text.\n- Ensure no conversational text, affirmations, or explanations precede or follow the raw JSON output, as this will cause direct parsing failure.\n### Output:\nJSON format: { \"title\": \"your concise title here\" }\n### Examples:\n- { \"title\": \"📉 Stock Market Trends\" },\n- { \"title\": \"🍪 Perfect Chocolate Chip Recipe\" },\n- { \"title\": \"Evolution of Music Streaming\" },\n- { \"title\": \"Remote Work Productivity Tips\" },\n- { \"title\": \"Artificial Intelligence in Healthcare\" },\n- { \"title\": \"🎮 Video Game Development Insights\" }\n### Chat History:\n<chat_history>\nUSER: hi\nASSISTANT: Hello! 😊 How can I assist you today?\n</chat_history><|im_end|>\n<|im_start|>assistant\n" time=2025-07-02T07:23:14.147Z level=DEBUG source=cache.go:104 msg="loading cache slot" id=0 cache=94 prompt=297 used=3 remaining=294 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:615 msg="evaluating already loaded" model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:151 msg=reloading runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:287 msg="resetting model to expire immediately to make room" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 refCount=1 time=2025-07-02T07:23:29.810Z level=DEBUG source=sched.go:300 msg="waiting for pending requests to complete and unload to occur" runner.name=registry.ollama.ai/library/qwen3:32b runner.inference=cuda runner.devices=4 runner.size="45.3 GiB" runner.vram="45.3 GiB" runner.parallel=1 runner.pid=3973568 runner.model=/root/.ollama/models/blobs/sha256-3291abe70f16ee9682de7bfae08db5373ea9d6497e614aaad63340ad421d6312 runner.num_ctx=30720 ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#33104