[GH-ISSUE #7597] Tool calling fails when using the ollama API from the open-webui endpoint #30339

New Issue

GiteaMirror · 2026-04-25T04:34:46-05:00

GiteaMirror commented

2026-04-25 04:34:46 -05:00

Originally created by @ma3oun on GitHub (Dec 4, 2024).
Original GitHub issue: https://github.com/open-webui/open-webui/issues/7597

Bug Report

Installation Method

Docker (for ollama and open-webui)

Environment

Open WebUI Version: 0.4.7
Ollama (if applicable): 0.4.7
Operating System: Ubuntu 24.04 LTS
Browser (if applicable): N/A

Confirmation:

I have read and followed all the instructions provided in the README.md.
I am on the latest version of both Open WebUI and Ollama.
I have included the browser console logs.
I have included the Docker container logs.
I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below.

Expected Behavior:

I am using the ollama python library to make API calls to an ollama server through the openwebui API endpoint:
http://myserver:8080/ollama
I expect llama3.2:3b to use tools given a certain prompt but that does not happen.

Actual Behavior:

No tools are called and llama3.2 returns a generic message. However, llama3.2 does use tools when I use the typical ollama endpoint: http://myserver:11434/

Description

Bug Summary:

Function calling fails when using the openwebui ollama endpoint but works when calling ollama directly.

Reproduction Details

Steps to Reproduce:

import ollama
from typing import Dict,Callable

API_KEY="sk-123456789123456789123456789"
BASE_URL='http://myserver:8080/ollama'

client = ollama.Client(
  host=BASE_URL,
  headers={"Authorization": f"Bearer {API_KEY}"}
)

def get_stock_price(symbol:str)->float:
    """
        Get stock prices from the internet using company symbol (e.g. AAPL for Apple)
    Args:
        symbol: Company symbol

    Return:
        float: Company stock price
    """
    return 544.6


available_functions :Dict[str,Callable]= {
  'get_stock_price': get_stock_price,
}

messages = [{
    'role': 'user',
    'content': 'What is the stock price of the Apple company? Use available tools...',
  }]
response: ollama.ChatResponse = client.chat(model='llama3.2:3b', messages=messages, tools=[get_stock_price])

if response.message.tool_calls:
  print("Calling tools !!!")
  # There may be multiple tool calls in the response
  for tool in response.message.tool_calls:
    # Ensure the function is available, and then call it
    if function_to_call := available_functions.get(tool.function.name):
      print('Calling function:', tool.function.name)
      print('Arguments:', tool.function.arguments)
      output = function_to_call(**tool.function.arguments)
      print('Function output:', output)
    else:
      print('Function', tool.function.name, 'not found')
else:
    print("no tool called...")

# Only needed to chat with the model using the tool call results
if response.message.tool_calls:
  # Add the function response to messages for the model to use
  messages.append(response.message)
  messages.append({'role': 'tool', 'content': str(output), 'name': tool.function.name})

  # Get final response from model with function outputs
  final_response = client.chat('llama3.2:3b', messages=messages)
  print('Final response:', final_response.message.content)

Running this code gives:

no tool called...

But when I change BASE_URL to BASE_URL='http://myserver:11434/' then I get this:

Calling tools !!!
Calling function: get_stock_price
Arguments: {'symbol': 'AAPL'}
Function output: 544.6
Final response: The current stock price of Apple (AAPL) is $544.6 as of my knowledge cutoff in December 2023. However, please note that stock prices can fluctuate rapidly and may have changed since my knowledge cutoff. For the most up-to-date information, I recommend checking a financial website or platform such as Yahoo Finance, Google Finance, or Bloomberg.

Logs and Screenshots

N/A

Browser Console Logs:
N/A

Docker Container Logs:

open-webui | INFO [open_webui.apps.ollama.main] url: http://host.docker.internal:11434
ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.512Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.544Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
ollama | time=2024-12-04T15:23:28.605Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-c4d55cd9-db3f-d008-ab89-c7dd0b4e68ef parallel=4 available=25465651200 required="3.7 GiB"
ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.608Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.631Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory"
ollama | time=2024-12-04T15:23:28.631Z level=INFO source=server.go:105 msg="system memory" total="125.6 GiB" free="122.1 GiB" free_swap="0 B"
ollama | time=2024-12-04T15:23:28.631Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB"
ollama | time=2024-12-04T15:23:28.632Z level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 46607"
ollama | time=2024-12-04T15:23:28.632Z level=INFO source=sched.go:449 msg="loaded runners" count=1
ollama | time=2024-12-04T15:23:28.632Z level=INFO source=server.go:559 msg="waiting for llama runner to start responding"
ollama | time=2024-12-04T15:23:28.633Z level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error"
ollama | time=2024-12-04T15:23:28.687Z level=INFO source=runner.go:939 msg="starting go runner"
ollama | time=2024-12-04T15:23:28.687Z level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=8
ollama | time=2024-12-04T15:23:28.687Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46607"
ollama | llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest))
ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
ollama | llama_model_loader: - kv 0: general.architecture str = llama
ollama | llama_model_loader: - kv 1: general.type str = model
ollama | llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct
ollama | llama_model_loader: - kv 3: general.finetune str = Instruct
ollama | llama_model_loader: - kv 4: general.basename str = Llama-3.2
ollama | llama_model_loader: - kv 5: general.size_label str = 3B
ollama | llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
ollama | llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
ollama | llama_model_loader: - kv 8: llama.block_count u32 = 28
ollama | llama_model_loader: - kv 9: llama.context_length u32 = 131072
ollama | llama_model_loader: - kv 10: llama.embedding_length u32 = 3072
ollama | llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192
ollama | llama_model_loader: - kv 12: llama.attention.head_count u32 = 24
ollama | llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8
ollama | llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000
ollama | llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
ollama | llama_model_loader: - kv 16: llama.attention.key_length u32 = 128
ollama | llama_model_loader: - kv 17: llama.attention.value_length u32 = 128
ollama | llama_model_loader: - kv 18: general.file_type u32 = 15
ollama | llama_model_loader: - kv 19: llama.vocab_size u32 = 128256
ollama | llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128
ollama | llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
ollama | llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe
ollama | llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
ollama | llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
ollama | llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
ollama | llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000
ollama | llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009
ollama | llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
ollama | llama_model_loader: - kv 29: general.quantization_version u32 = 2
ollama | llama_model_loader: - type f32: 58 tensors
ollama | llama_model_loader: - type q4_K: 168 tensors
ollama | llama_model_loader: - type q6_K: 29 tensors
ollama | time=2024-12-04T15:23:28.885Z level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model"
ollama | llm_load_vocab: special tokens cache size = 256
ollama | llm_load_vocab: token to piece cache size = 0.7999 MB
ollama | llm_load_print_meta: format = GGUF V3 (latest)
ollama | llm_load_print_meta: arch = llama
ollama | llm_load_print_meta: vocab type = BPE
ollama | llm_load_print_meta: n_vocab = 128256
ollama | llm_load_print_meta: n_merges = 280147
ollama | llm_load_print_meta: vocab_only = 0
ollama | llm_load_print_meta: n_ctx_train = 131072
ollama | llm_load_print_meta: n_embd = 3072
ollama | llm_load_print_meta: n_layer = 28
ollama | llm_load_print_meta: n_head = 24
ollama | llm_load_print_meta: n_head_kv = 8
ollama | llm_load_print_meta: n_rot = 128
ollama | llm_load_print_meta: n_swa = 0
ollama | llm_load_print_meta: n_embd_head_k = 128
ollama | llm_load_print_meta: n_embd_head_v = 128
ollama | llm_load_print_meta: n_gqa = 3
ollama | llm_load_print_meta: n_embd_k_gqa = 1024
ollama | llm_load_print_meta: n_embd_v_gqa = 1024
ollama | llm_load_print_meta: f_norm_eps = 0.0e+00
ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00
ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
ollama | llm_load_print_meta: f_logit_scale = 0.0e+00
ollama | llm_load_print_meta: n_ff = 8192
ollama | llm_load_print_meta: n_expert = 0
ollama | llm_load_print_meta: n_expert_used = 0
ollama | llm_load_print_meta: causal attn = 1
ollama | llm_load_print_meta: pooling type = 0
ollama | llm_load_print_meta: rope type = 0
ollama | llm_load_print_meta: rope scaling = linear
ollama | llm_load_print_meta: freq_base_train = 500000.0
ollama | llm_load_print_meta: freq_scale_train = 1
ollama | llm_load_print_meta: n_ctx_orig_yarn = 131072
ollama | llm_load_print_meta: rope_finetuned = unknown
ollama | llm_load_print_meta: ssm_d_conv = 0
ollama | llm_load_print_meta: ssm_d_inner = 0
ollama | llm_load_print_meta: ssm_d_state = 0
ollama | llm_load_print_meta: ssm_dt_rank = 0
ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0
ollama | llm_load_print_meta: model type = 3B
ollama | llm_load_print_meta: model ftype = Q4_K - Medium
ollama | llm_load_print_meta: model params = 3.21 B
ollama | llm_load_print_meta: model size = 1.87 GiB (5.01 BPW)
ollama | llm_load_print_meta: general.name = Llama 3.2 3B Instruct
ollama | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
ollama | llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
ollama | llm_load_print_meta: LF token = 128 'Ä'
ollama | llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
ollama | llm_load_print_meta: EOM token = 128008 '<|eom_id|>'
ollama | llm_load_print_meta: EOG token = 128008 '<|eom_id|>'
ollama | llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
ollama | llm_load_print_meta: max token length = 256
ollama | ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
ollama | llm_load_tensors: ggml ctx size = 0.12 MiB
ollama | llm_load_tensors: offloading 28 repeating layers to GPU
ollama | llm_load_tensors: offloading non-repeating layers to GPU
ollama | llm_load_tensors: offloaded 29/29 layers to GPU
ollama | llm_load_tensors: CPU buffer size = 1918.35 MiB
ollama | llama_new_context_with_model: n_ctx = 8192
ollama | llama_new_context_with_model: n_batch = 2048
ollama | llama_new_context_with_model: n_ubatch = 512
ollama | llama_new_context_with_model: flash_attn = 0
ollama | llama_new_context_with_model: freq_base = 500000.0
ollama | llama_new_context_with_model: freq_scale = 1
ollama | ggml_cuda_host_malloc: failed to allocate 896.00 MiB of pinned memory: no CUDA-capable device is detected
ollama | llama_kv_cache_init: CPU KV buffer size = 896.00 MiB
ollama | llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB
ollama | ggml_cuda_host_malloc: failed to allocate 2.00 MiB of pinned memory: no CUDA-capable device is detected
ollama | llama_new_context_with_model: CPU output buffer size = 2.00 MiB
ollama | ggml_cuda_host_malloc: failed to allocate 424.01 MiB of pinned memory: no CUDA-capable device is detected
ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 424.01 MiB
ollama | llama_new_context_with_model: graph nodes = 902
ollama | llama_new_context_with_model: graph splits = 1
ollama | time=2024-12-04T15:23:30.140Z level=INFO source=server.go:598 msg="llama runner started in 1.51 seconds"
ollama | [GIN] 2024/12/04 - 15:23:45 | 200 | 16.626160086s | 172.20.0.1 | POST "/api/chat"
open-webui | INFO: 192.168.1.38:44612 - "POST /ollama/api/chat HTTP/1.1" 200 OK
watchtower | time="2024-12-04T15:24:21Z" level=info msg="Session done" Failed=0 Scanned=5 Updated=0 notify=no

Screenshots/Screen Recordings (if applicable):
N/A

Additional Information

N/A

Originally created by @ma3oun on GitHub (Dec 4, 2024). Original GitHub issue: https://github.com/open-webui/open-webui/issues/7597 # Bug Report ## Installation Method Docker (for ollama and open-webui) ## Environment - **Open WebUI Version:** 0.4.7 - **Ollama (if applicable):** 0.4.7 - **Operating System:** Ubuntu 24.04 LTS - **Browser (if applicable):** N/A **Confirmation:** - [x] I have read and followed all the instructions provided in the README.md. - [x] I am on the latest version of both Open WebUI and Ollama. - [ ] I have included the browser console logs. - [x] I have included the Docker container logs. - [x] I have provided the exact steps to reproduce the bug in the "Steps to Reproduce" section below. ## Expected Behavior: I am using the ollama python library to make API calls to an ollama server through the openwebui API endpoint: `http://myserver:8080/ollama` I expect llama3.2:3b to use tools given a certain prompt but that does not happen. ## Actual Behavior: No tools are called and llama3.2 returns a generic message. However, llama3.2 does use tools when I use the typical ollama endpoint: `http://myserver:11434/` ## Description **Bug Summary:** Function calling fails when using the openwebui ollama endpoint but works when calling ollama directly. ## Reproduction Details **Steps to Reproduce:** ``` import ollama from typing import Dict,Callable API_KEY="sk-123456789123456789123456789" BASE_URL='http://myserver:8080/ollama' client = ollama.Client( host=BASE_URL, headers={"Authorization": f"Bearer {API_KEY}"} ) def get_stock_price(symbol:str)->float: """ Get stock prices from the internet using company symbol (e.g. AAPL for Apple) Args: symbol: Company symbol Return: float: Company stock price """ return 544.6 available_functions :Dict[str,Callable]= { 'get_stock_price': get_stock_price, } messages = [{ 'role': 'user', 'content': 'What is the stock price of the Apple company? Use available tools...', }] response: ollama.ChatResponse = client.chat(model='llama3.2:3b', messages=messages, tools=[get_stock_price]) if response.message.tool_calls: print("Calling tools !!!") # There may be multiple tool calls in the response for tool in response.message.tool_calls: # Ensure the function is available, and then call it if function_to_call := available_functions.get(tool.function.name): print('Calling function:', tool.function.name) print('Arguments:', tool.function.arguments) output = function_to_call(**tool.function.arguments) print('Function output:', output) else: print('Function', tool.function.name, 'not found') else: print("no tool called...") # Only needed to chat with the model using the tool call results if response.message.tool_calls: # Add the function response to messages for the model to use messages.append(response.message) messages.append({'role': 'tool', 'content': str(output), 'name': tool.function.name}) # Get final response from model with function outputs final_response = client.chat('llama3.2:3b', messages=messages) print('Final response:', final_response.message.content) ``` Running this code gives: ``` no tool called... ``` But when I change `BASE_URL` to `BASE_URL='http://myserver:11434/'` then I get this: ``` Calling tools !!! Calling function: get_stock_price Arguments: {'symbol': 'AAPL'} Function output: 544.6 Final response: The current stock price of Apple (AAPL) is $544.6 as of my knowledge cutoff in December 2023. However, please note that stock prices can fluctuate rapidly and may have changed since my knowledge cutoff. For the most up-to-date information, I recommend checking a financial website or platform such as Yahoo Finance, Google Finance, or Bloomberg. ``` ## Logs and Screenshots N/A **Browser Console Logs:** N/A **Docker Container Logs:** > open-webui | INFO [open_webui.apps.ollama.main] url: http://host.docker.internal:11434 ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.512Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.544Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" ollama | time=2024-12-04T15:23:28.605Z level=INFO source=sched.go:714 msg="new model will fit in available VRAM in single GPU, loading" model=/root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff gpu=GPU-c4d55cd9-db3f-d008-ab89-c7dd0b4e68ef parallel=4 available=25465651200 required="3.7 GiB" ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.608Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" ollama | cuda driver library failed to get device context 800time=2024-12-04T15:23:28.631Z level=WARN source=gpu.go:441 msg="error looking up nvidia GPU memory" ollama | time=2024-12-04T15:23:28.631Z level=INFO source=server.go:105 msg="system memory" total="125.6 GiB" free="122.1 GiB" free_swap="0 B" ollama | time=2024-12-04T15:23:28.631Z level=INFO source=memory.go:343 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=29 layers.split="" memory.available="[23.7 GiB]" memory.gpu_overhead="0 B" memory.required.full="3.7 GiB" memory.required.partial="3.7 GiB" memory.required.kv="896.0 MiB" memory.required.allocations="[3.7 GiB]" memory.weights.total="2.4 GiB" memory.weights.repeating="2.1 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="424.0 MiB" memory.graph.partial="570.7 MiB" ollama | time=2024-12-04T15:23:28.632Z level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12/ollama_llama_server --model /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff --ctx-size 8192 --batch-size 512 --n-gpu-layers 29 --threads 8 --parallel 4 --port 46607" ollama | time=2024-12-04T15:23:28.632Z level=INFO source=sched.go:449 msg="loaded runners" count=1 ollama | time=2024-12-04T15:23:28.632Z level=INFO source=server.go:559 msg="waiting for llama runner to start responding" ollama | time=2024-12-04T15:23:28.633Z level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server error" ollama | time=2024-12-04T15:23:28.687Z level=INFO source=runner.go:939 msg="starting go runner" ollama | time=2024-12-04T15:23:28.687Z level=INFO source=runner.go:940 msg=system info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | cgo(gcc)" threads=8 ollama | time=2024-12-04T15:23:28.687Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46607" ollama | llama_model_loader: loaded meta data with 30 key-value pairs and 255 tensors from /root/.ollama/models/blobs/sha256-dde5aa3fc5ffc17176b5e8bdc82f587b24b2678c6c66101bf7da77af9f7ccdff (version GGUF V3 (latest)) ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. ollama | llama_model_loader: - kv 0: general.architecture str = llama ollama | llama_model_loader: - kv 1: general.type str = model ollama | llama_model_loader: - kv 2: general.name str = Llama 3.2 3B Instruct ollama | llama_model_loader: - kv 3: general.finetune str = Instruct ollama | llama_model_loader: - kv 4: general.basename str = Llama-3.2 ollama | llama_model_loader: - kv 5: general.size_label str = 3B ollama | llama_model_loader: - kv 6: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... ollama | llama_model_loader: - kv 7: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... ollama | llama_model_loader: - kv 8: llama.block_count u32 = 28 ollama | llama_model_loader: - kv 9: llama.context_length u32 = 131072 ollama | llama_model_loader: - kv 10: llama.embedding_length u32 = 3072 ollama | llama_model_loader: - kv 11: llama.feed_forward_length u32 = 8192 ollama | llama_model_loader: - kv 12: llama.attention.head_count u32 = 24 ollama | llama_model_loader: - kv 13: llama.attention.head_count_kv u32 = 8 ollama | llama_model_loader: - kv 14: llama.rope.freq_base f32 = 500000.000000 ollama | llama_model_loader: - kv 15: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 ollama | llama_model_loader: - kv 16: llama.attention.key_length u32 = 128 ollama | llama_model_loader: - kv 17: llama.attention.value_length u32 = 128 ollama | llama_model_loader: - kv 18: general.file_type u32 = 15 ollama | llama_model_loader: - kv 19: llama.vocab_size u32 = 128256 ollama | llama_model_loader: - kv 20: llama.rope.dimension_count u32 = 128 ollama | llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2 ollama | llama_model_loader: - kv 22: tokenizer.ggml.pre str = llama-bpe ollama | llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... ollama | llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ollama | llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... ollama | llama_model_loader: - kv 26: tokenizer.ggml.bos_token_id u32 = 128000 ollama | llama_model_loader: - kv 27: tokenizer.ggml.eos_token_id u32 = 128009 ollama | llama_model_loader: - kv 28: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... ollama | llama_model_loader: - kv 29: general.quantization_version u32 = 2 ollama | llama_model_loader: - type f32: 58 tensors ollama | llama_model_loader: - type q4_K: 168 tensors ollama | llama_model_loader: - type q6_K: 29 tensors ollama | time=2024-12-04T15:23:28.885Z level=INFO source=server.go:593 msg="waiting for server to become available" status="llm server loading model" ollama | llm_load_vocab: special tokens cache size = 256 ollama | llm_load_vocab: token to piece cache size = 0.7999 MB ollama | llm_load_print_meta: format = GGUF V3 (latest) ollama | llm_load_print_meta: arch = llama ollama | llm_load_print_meta: vocab type = BPE ollama | llm_load_print_meta: n_vocab = 128256 ollama | llm_load_print_meta: n_merges = 280147 ollama | llm_load_print_meta: vocab_only = 0 ollama | llm_load_print_meta: n_ctx_train = 131072 ollama | llm_load_print_meta: n_embd = 3072 ollama | llm_load_print_meta: n_layer = 28 ollama | llm_load_print_meta: n_head = 24 ollama | llm_load_print_meta: n_head_kv = 8 ollama | llm_load_print_meta: n_rot = 128 ollama | llm_load_print_meta: n_swa = 0 ollama | llm_load_print_meta: n_embd_head_k = 128 ollama | llm_load_print_meta: n_embd_head_v = 128 ollama | llm_load_print_meta: n_gqa = 3 ollama | llm_load_print_meta: n_embd_k_gqa = 1024 ollama | llm_load_print_meta: n_embd_v_gqa = 1024 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-05 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 ollama | llm_load_print_meta: n_ff = 8192 ollama | llm_load_print_meta: n_expert = 0 ollama | llm_load_print_meta: n_expert_used = 0 ollama | llm_load_print_meta: causal attn = 1 ollama | llm_load_print_meta: pooling type = 0 ollama | llm_load_print_meta: rope type = 0 ollama | llm_load_print_meta: rope scaling = linear ollama | llm_load_print_meta: freq_base_train = 500000.0 ollama | llm_load_print_meta: freq_scale_train = 1 ollama | llm_load_print_meta: n_ctx_orig_yarn = 131072 ollama | llm_load_print_meta: rope_finetuned = unknown ollama | llm_load_print_meta: ssm_d_conv = 0 ollama | llm_load_print_meta: ssm_d_inner = 0 ollama | llm_load_print_meta: ssm_d_state = 0 ollama | llm_load_print_meta: ssm_dt_rank = 0 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 ollama | llm_load_print_meta: model type = 3B ollama | llm_load_print_meta: model ftype = Q4_K - Medium ollama | llm_load_print_meta: model params = 3.21 B ollama | llm_load_print_meta: model size = 1.87 GiB (5.01 BPW) ollama | llm_load_print_meta: general.name = Llama 3.2 3B Instruct ollama | llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' ollama | llm_load_print_meta: EOS token = 128009 '<|eot_id|>' ollama | llm_load_print_meta: LF token = 128 'Ä' ollama | llm_load_print_meta: EOT token = 128009 '<|eot_id|>' ollama | llm_load_print_meta: EOM token = 128008 '<|eom_id|>' ollama | llm_load_print_meta: EOG token = 128008 '<|eom_id|>' ollama | llm_load_print_meta: EOG token = 128009 '<|eot_id|>' ollama | llm_load_print_meta: max token length = 256 ollama | ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected ollama | llm_load_tensors: ggml ctx size = 0.12 MiB ollama | llm_load_tensors: offloading 28 repeating layers to GPU ollama | llm_load_tensors: offloading non-repeating layers to GPU ollama | llm_load_tensors: offloaded 29/29 layers to GPU ollama | llm_load_tensors: CPU buffer size = 1918.35 MiB ollama | llama_new_context_with_model: n_ctx = 8192 ollama | llama_new_context_with_model: n_batch = 2048 ollama | llama_new_context_with_model: n_ubatch = 512 ollama | llama_new_context_with_model: flash_attn = 0 ollama | llama_new_context_with_model: freq_base = 500000.0 ollama | llama_new_context_with_model: freq_scale = 1 ollama | ggml_cuda_host_malloc: failed to allocate 896.00 MiB of pinned memory: no CUDA-capable device is detected ollama | llama_kv_cache_init: CPU KV buffer size = 896.00 MiB ollama | llama_new_context_with_model: KV self size = 896.00 MiB, K (f16): 448.00 MiB, V (f16): 448.00 MiB ollama | ggml_cuda_host_malloc: failed to allocate 2.00 MiB of pinned memory: no CUDA-capable device is detected ollama | llama_new_context_with_model: CPU output buffer size = 2.00 MiB ollama | ggml_cuda_host_malloc: failed to allocate 424.01 MiB of pinned memory: no CUDA-capable device is detected ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 424.01 MiB ollama | llama_new_context_with_model: graph nodes = 902 ollama | llama_new_context_with_model: graph splits = 1 ollama | time=2024-12-04T15:23:30.140Z level=INFO source=server.go:598 msg="llama runner started in 1.51 seconds" ollama | [GIN] 2024/12/04 - 15:23:45 | 200 | 16.626160086s | 172.20.0.1 | POST "/api/chat" open-webui | INFO: 192.168.1.38:44612 - "POST /ollama/api/chat HTTP/1.1" 200 OK watchtower | time="2024-12-04T15:24:21Z" level=info msg="Session done" Failed=0 Scanned=5 Updated=0 notify=no **Screenshots/Screen Recordings (if applicable):** N/A ## Additional Information N/A

GiteaMirror closed this issue

2026-04-25 04:34:47 -05:00

GiteaMirror commented

2026-04-25 04:34:48 -05:00

@ishumilin commented on GitHub (Dec 9, 2024):

This sounds like a purely ollama issue. Try starting setting OLLAMA_KEEP_ALIVE=-1 to prevent it from unloading models from memory, i.e. start container with

docker run -d --gpus=all --restart always -v ollama:/root/.ollama -e OLLAMA_KEEP_ALIVE=-1 -p 11434:11434 --name ollama ollama/ollama

@ishumilin commented on GitHub (Dec 9, 2024): This sounds like a purely ollama issue. Try starting setting OLLAMA_KEEP_ALIVE=-1 to prevent it from unloading models from memory, i.e. start container with `docker run -d --gpus=all --restart always -v ollama:/root/.ollama -e OLLAMA_KEEP_ALIVE=-1 -p 11434:11434 --name ollama ollama/ollama`

GiteaMirror commented

2026-04-25 04:34:48 -05:00

@ma3oun commented on GitHub (Dec 10, 2024):

I am not sure you understand my issue. The Ollama server works fine. My issue concerns models that can handle tools, such as llama3.2. When I use the ollama api endpoint (e.g. http://myserver:11434/), everything works fine. But when I use the endpoint through open-webui (e.g. http://myserver:8080/ollama), the model does send a response but never calls tools. By tools, I mean a python function such as adding two numbers.

@ma3oun commented on GitHub (Dec 10, 2024): I am not sure you understand my issue. The Ollama server works fine. My issue concerns models that can handle tools, such as llama3.2. When I use the ollama api endpoint (e.g. http://myserver:11434/), everything works fine. But when I use the endpoint through open-webui (e.g. http://myserver:8080/ollama), the model does send a response but never calls tools. By tools, I mean a python function such as adding two numbers.

GiteaMirror commented

2026-04-25 04:34:49 -05:00

@ishumilin commented on GitHub (Dec 10, 2024):

I am just judging by your log files

ollama | ggml_cuda_host_malloc: failed to allocate 896.00 MiB of pinned memory: no CUDA-capable device is detected

and other rows mention that ollama container is experiencing issues so I assume the problem arises in the ollama container and not in webui itself

@ishumilin commented on GitHub (Dec 10, 2024): I am just judging by your log files `ollama | ggml_cuda_host_malloc: failed to allocate 896.00 MiB of pinned memory: no CUDA-capable device is detected` and other rows mention that ollama container is experiencing issues so I assume the problem arises in the ollama container and not in webui itself

GiteaMirror commented

2026-04-25 04:34:49 -05:00

@ishumilin commented on GitHub (Dec 15, 2024):

I am just judging by your log files

ollama | ggml_cuda_host_malloc: failed to allocate 896.00 MiB of pinned memory: no CUDA-capable device is detected

and other rows mention that ollama container is experiencing issues so I assume the problem arises in the ollama container and not in webui itself

Was digging into this problem a little longer, a temporary solution may be to edit /etc/docker/daemon.json and add "exec-opts": ["native.cgroupdriver=cgroupfs"], then restart docker together with all containers (or better the entire machine).

See this comment for details

@ishumilin commented on GitHub (Dec 15, 2024): > I am just judging by your log files > > `ollama | ggml_cuda_host_malloc: failed to allocate 896.00 MiB of pinned memory: no CUDA-capable device is detected` > > and other rows mention that ollama container is experiencing issues so I assume the problem arises in the ollama container and not in webui itself Was digging into this problem a little longer, a temporary solution may be to edit /etc/docker/daemon.json and add "exec-opts": ["native.cgroupdriver=cgroupfs"], then restart docker together with all containers (or better the entire machine). See this [comment](https://github.com/ollama/ollama/issues/4604#issuecomment-2130436000) for details

GiteaMirror commented

2026-04-25 04:34:49 -05:00

@tjbck commented on GitHub (Dec 19, 2024):

I believe this issue has been addressed with https://github.com/open-webui/open-webui/pull/7920, testing wanted here!

@tjbck commented on GitHub (Dec 19, 2024): I believe this issue has been addressed with https://github.com/open-webui/open-webui/pull/7920, testing wanted here!

Sign in to join this conversation.

Branches Tags

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/open-webui#30339