[GH-ISSUE #7919] Performance decline #67125

New Issue

GiteaMirror · 2026-05-04T09:30:55-05:00

GiteaMirror commented

2026-05-04 09:30:55 -05:00

Originally created by @axil76 on GitHub (Dec 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7919

What is the issue?

I am testing the Vgpu on a Vsphere 8 cluster, the drivers work on the redhat 8 os and work in docker, when the VM boots, the Ollama server responds well and after several minutes, the ollama server no longer responds
Device 0: NVIDIA L40S-24C, compute capability 8.9, VMM: no
time=2024-12-03T14:30:07.963Z level=INFO source=server.go:593 msg="waiting for server to become available" satus="llm server loading model" and the service no longer responds and the nvidia-persistenced service is running
I don't understand where the problem comes from, when the card was mounted directly on the vm it worked

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

ollama version
ollama version is 0.4.7

thanks for your answers.

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.4.7

Originally created by @axil76 on GitHub (Dec 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7919 ### What is the issue? I am testing the Vgpu on a Vsphere 8 cluster, the drivers work on the redhat 8 os and work in docker, when the VM boots, the Ollama server responds well and after several minutes, the ollama server no longer responds Device 0: NVIDIA L40S-24C, compute capability 8.9, VMM: no time=2024-12-03T14:30:07.963Z level=INFO source=server.go:593 msg="waiting for server to become available" satus="llm server loading model" and the service no longer responds and the nvidia-persistenced service is running I don't understand where the problem comes from, when the card was mounted directly on the vm it worked in docker nvidia-smi Tue Dec 3 14:37:24 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40S-24C Off | 00000000:02:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 12571MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ ollama version ollama version is 0.4.7 thanks for your answers. ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.4.7

GiteaMirror added the bug needs more info labels 2026-05-04 09:30:55 -05:00

GiteaMirror closed this issue

2026-05-04 09:30:57 -05:00

GiteaMirror commented

2026-05-04 09:31:00 -05:00

@rick-github commented on GitHub (Dec 3, 2024):

Adding full server logs will aid in debugging.

@rick-github commented on GitHub (Dec 3, 2024): Adding full [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.

GiteaMirror commented

2026-05-04 09:31:01 -05:00

@axil76 commented on GitHub (Dec 3, 2024):

ollama.log
example of request after I have the message msg="waiting for server to become available" status="llm server loading model" and no response I have to restart the container

@axil76 commented on GitHub (Dec 3, 2024): [ollama.log](https://github.com/user-attachments/files/17995772/ollama.log) example of request after I have the message msg="waiting for server to become available" status="llm server loading model" and no response I have to restart the container

GiteaMirror commented

2026-05-04 09:31:01 -05:00

@rick-github commented on GitHub (Dec 3, 2024):

The server did become available after 36 seconds:

time=2024-12-03T16:04:59.050Z level=INFO source=server.go:598 msg="llama runner started in 36.59 seconds"

What client is accessing the model? If you add OLLAMA_DEBUG=1 to the server environment there might be something in the logs to indicate what is happening.

@rick-github commented on GitHub (Dec 3, 2024): The server did become available after 36 seconds: ``` time=2024-12-03T16:04:59.050Z level=INFO source=server.go:598 msg="llama runner started in 36.59 seconds" ``` What client is accessing the model? If you add `OLLAMA_DEBUG=1` to the server environment there might be something in the logs to indicate what is happening.

GiteaMirror commented

2026-05-04 09:31:02 -05:00

@axil76 commented on GitHub (Dec 3, 2024):

ollama.log
with OLLAMA_DEBUG=1
the answer is very long... he writes a message every 10 seconds
now it's continue who makes the requests ollama
what is strange is that just after boot it works very well for a few minutes and then it doesn't work anymore, the performance deteriorates

@axil76 commented on GitHub (Dec 3, 2024): [ollama.log](https://github.com/user-attachments/files/17997228/ollama.log) with OLLAMA_DEBUG=1 the answer is very long... he writes a message every 10 seconds now it's continue who makes the requests ollama what is strange is that just after boot it works very well for a few minutes and then it doesn't work anymore, the performance deteriorates

GiteaMirror commented

2026-05-04 09:31:03 -05:00

@rick-github commented on GitHub (Dec 3, 2024):

What model are you using? The GET every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting?

@rick-github commented on GitHub (Dec 3, 2024): What model are you using? The `GET` every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting?

GiteaMirror commented

2026-05-04 09:31:03 -05:00

@AdminOfOz commented on GitHub (Dec 3, 2024):

I can somewhat anecdotally confirm that I have experience a similar degradation of service where the initial boot of ollama works great, but after a few requests or after some time it does not work.
The only other log I'm getting "gpu VRAM usage didn't recover within timeout"

SIDE NOTE: I recently went through a large infrastructure change that might invalidate my report.
My Setup Previous:
I was previously using ollama in a docker container with a 4090 and I believe it was the previouis version of ollama.
My Current Setup:
I took the same hardware and converted it so that I'm now using a GPU passthrough in proxmox. I also upgraded to the current version of ollama (and no longer via docker) during this period so I cannot say if the change in performance was due to an upgrade in version or a change in passing through the GPU and quite frankly it was too much work to get GPU pass through working.

Current nvidia smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                    0 |
| 31%   41C    P0             98W /  480W |      36MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Process runs in a terminal so not great logging... I know... I know. I did see this error:

gpu VRAM usage didn't recover │INFO: 192.168.1.30:55030 - "POST /ollama/ap
within timeout" seconds=6.*

@AdminOfOz commented on GitHub (Dec 3, 2024): I can somewhat anecdotally confirm that I have experience a similar degradation of service where the initial boot of ollama works great, but after a few requests or after some time it does not work. The only other log I'm getting "gpu VRAM usage didn't recover within timeout" SIDE NOTE: I recently went through a large infrastructure change that might invalidate my report. My Setup Previous: I was previously using ollama in a docker container with a 4090 and I believe it was the previouis version of ollama. My Current Setup: I took the same hardware and converted it so that I'm now using a GPU passthrough in proxmox. I also upgraded to the current version of ollama (and no longer via docker) during this period so I cannot say if the change in performance was due to an upgrade in version or a change in passing through the GPU and quite frankly it was too much work to get GPU pass through working. Current nvidia smi: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | 0 | | 31% 41C P0 98W / 480W | 36MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` Process runs in a terminal so not great logging... I know... I know. I did see this error: gpu VRAM usage didn't recover │INFO: 192.168.1.30:55030 - "POST /ollama/ap within timeout" seconds=6.*

GiteaMirror commented

2026-05-04 09:31:04 -05:00

@rick-github commented on GitHub (Dec 4, 2024):

Either there's no model loaded, or ollama is not using the GPU.

Ollama doesn't have any endpoints that start with "/ollama".

I know it's in a terminal, but logs would be required for any debugging.

@rick-github commented on GitHub (Dec 4, 2024): Either there's no model loaded, or ollama is not using the GPU. Ollama doesn't have any endpoints that start with "/ollama". I know it's in a terminal, but logs would be required for any debugging.

GiteaMirror commented

2026-05-04 09:31:04 -05:00

@axil76 commented on GitHub (Dec 4, 2024):

What model are you using? The GET every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting?

There was not much in the log, on the other hand I use the vgrid drivers which corresponds to the version installed on the esxi, I don't know what comes from that.

@axil76 commented on GitHub (Dec 4, 2024): > What model are you using? The `GET` every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting? There was not much in the log, on the other hand I use the vgrid drivers which corresponds to the version installed on the esxi, I don't know what comes from that.

GiteaMirror commented

2026-05-04 09:31:05 -05:00

@rick-github commented on GitHub (Dec 12, 2024):

From #8023, it's possible the performance decline from the original post is a licensing issue. What's the output of nvidia-smi -q?

@rick-github commented on GitHub (Dec 12, 2024): From #8023, it's possible the performance decline from the original post is a licensing issue. What's the output of `nvidia-smi -q`?

GiteaMirror commented

2026-05-04 09:31:06 -05:00

@clduab11 commented on GitHub (Dec 23, 2024):

For what it's worth...

I'm running 12.7 CUDA Version (I think nvidia's site said my vGPU doesn't support license system; though running my nvidia-smi -q didn't even showcase any vGPU information for me.

I wanted to throw my hat in the ring and say I'm having very wonky inference times whereas in previous versions I did not, and wondered if this issue may have been related. I'll do my best to provide full logs... I launch with docker-compose.yaml (but don't have the debug mode in my .yaml unfortunately)...

This is one such example for time to first token (over 5 minutes). The llamarunner index took a long time... I will fit all the logs I can from first to last...I know my Pipelines throws an error but it isn't related to the poor inference. This happens with or without pipelines in my configuration.

2024-12-22 19:32:21 open-webui | {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'messages': [{'role': 'user', 'content': '### Task:\nYou are an autocompletion system. Continue the text in based on the **completion type** inand the given language. \n\n### **Instructions**:\n1. Analyzefor context and meaning. \n2. Useto guide your output: \n - **General**: Provide a natural, concise continuation. \n - **Search Query**: Complete as if generating a realistic search query. \n3. Start as if you are directly continuing. Do **not** repeat, paraphrase, or respond as a model. Simply complete the text. \n4. Ensure the continuation:\n - Flows naturally from . \n - Avoids repetition, overexplaining, or unrelated ideas. \n5. If unsure, return: { "text": "" }. \n\n### **Output Rules**:\n- Respond only in JSON format: { "text": "<your_completion>" }.\n\n### **Examples**:\n#### Example 1: \nInput: \n<type>General</type> \n<text>The sun was setting over the horizon, painting the sky</text> \nOutput: \n{ "text": "with vibrant shades of orange and pink." }\n\n#### Example 2: \nInput: \n<type>Search Query</type> \n<text>Top-rated restaurants in</text> \nOutput: \n{ "text": "New York City for Italian cuisine." } \n\n---\n### Context:\n<chat_history>\n\n</chat_history>\n<type>search query</type> \n<text>Homer, talk to me about </text> \n#### Output:\n'}], 'stream': False, 'metadata': {'task': 'autocomplete_generation', 'task_body': {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'prompt': 'Homer, talk to me about ', 'type': 'search query', 'stream': False}, 'chat_id': None}} 2024-12-22 19:32:21 open-webui | INFO: 127.0.0.1:57710 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.161Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.083582445 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.411Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.333496096 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 33289" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.647Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.661Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.583968982 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:32:24 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:33289" 2024-12-22 19:32:24 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.878Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:32:24 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:32:24 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:32:24 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:32:24 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:32:25 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:32:25 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:32:25 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:32:25 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:32:25 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:32:25 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:25 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/new HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:32:25 pipelines | INFO: 172.18.0.1:56674 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:32:26 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:32:26 ollama | [GIN] 2024/12/23 - 01:32:26 | 200 | 77.050597ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:32:27 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:32:27 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:32:27 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:32:27 pipelines | INFO: 172.18.0.1:37230 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:32:28 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:32:51 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:32:51 open-webui | INFO: 127.0.0.1:42742 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:21 open-webui | INFO: 127.0.0.1:36904 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:51 open-webui | INFO: 127.0.0.1:58296 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:21 open-webui | INFO: 127.0.0.1:60204 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:51 open-webui | INFO: 127.0.0.1:43350 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:11 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: offloaded 28/29 layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: CPU_Mapped model buffer size = 852.73 MiB 2024-12-22 19:35:11 ollama | llm_load_tensors: CUDA0 model buffer size = 5106.06 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:35:12 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CPU output buffer size = 0.59 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph splits = 4 (with bs=512), 3 (with bs=1) 2024-12-22 19:35:13 ollama | time=2024-12-23T01:35:13.002Z level=INFO source=server.go:594 msg="llama runner started in 168.39 seconds" 2024-12-22 19:35:15 ollama | [GIN] 2024/12/23 - 01:35:15 | 200 | 2m56s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:35:15 open-webui | INFO: 172.18.0.1:57508 - "POST /api/task/auto/completions HTTP/1.1" 200 OK 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.743Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.744Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=145 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.747Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39149" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.904Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:35:20 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=128 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39149" 2024-12-22 19:35:21 ollama | time=2024-12-23T01:35:21.000Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:35:21 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:35:21 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:35:21 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:35:21 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:35:21 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:35:21 open-webui | INFO: 127.0.0.1:55554 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:21 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:35:21 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:35:21 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:35:21 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:35:21 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:35:21 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:35:51 open-webui | INFO: 127.0.0.1:36152 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:21 open-webui | INFO: 127.0.0.1:58318 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:51 open-webui | INFO: 127.0.0.1:46402 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:21 open-webui | INFO: 127.0.0.1:39014 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading output layer to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloaded 29/29 layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: CPU_Mapped model buffer size = 426.36 MiB 2024-12-22 19:37:22 ollama | llm_load_tensors: CUDA0 model buffer size = 5532.43 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:37:24 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 492.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph splits = 2 2024-12-22 19:37:24 ollama | time=2024-12-23T01:37:24.223Z level=INFO source=server.go:594 msg="llama runner started in 123.48 seconds" 2024-12-22 19:37:24 open-webui | INFO: 172.18.0.1:57524 - "POST /ollama/api/chat HTTP/1.1" 200 OK 2024-12-22 19:37:31 ollama | [GIN] 2024/12/23 - 01:37:31 | 200 | 5m3s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:37:31 pipelines | INFO: 172.18.0.1:49478 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:37:32 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:37:32 ollama | [GIN] 2024/12/23 - 01:37:32 | 200 | 63.627221ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49494 - "POST /function_calling_scaffold/filter/outlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:37:34 open-webui | <Encoding 'o200k_base'> 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/chat/completed HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:34 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:37:34 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:37:34 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49500 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.076Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.190654251 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.326Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.440366167 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.576Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.690480729 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.584Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.749Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:37:40 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.837Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:37:40 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:37:41 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:37:41 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:37:41 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:37:41 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:37:41 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:37:41 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:37:41 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:37:41 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:37:41 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:37:41 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: max token length = 256

Parts of logs I feel could be relevant to my noob eyes?

2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d

I also thought somewhere I saw the llamarunner take an inordinate amount of time?

I'm not sure if any of this is helpful or not, but I've been racking my brains trying to figure it out. As I said at the top, I run my configuration through a .yaml in Docker Compose. I can upload the .yaml if it's helpful.

@clduab11 commented on GitHub (Dec 23, 2024): For what it's worth... I'm running 12.7 CUDA Version (I think nvidia's site said my vGPU doesn't support license system; though running my `nvidia-smi -q` didn't even showcase any vGPU information for me. I wanted to throw my hat in the ring and say I'm having very wonky inference times whereas in previous versions I did not, and wondered if this issue may have been related. I'll do my best to provide full logs... I launch with docker-compose.yaml (but don't have the debug mode in my .yaml unfortunately)... ![Screenshot 2024-12-22 193747](https://github.com/user-attachments/assets/6459d594-b2a7-4054-abff-3e9cf9220e92) This is one such example for time to first token (over 5 minutes). The llamarunner index took a long time... I will fit all the logs I can from first to last...I know my Pipelines throws an error but it isn't related to the poor inference. This happens with or without pipelines in my configuration. `2024-12-22 19:32:21 open-webui | {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'messages': [{'role': 'user', 'content': '### Task:\nYou are an autocompletion system. Continue the text in `<text>` based on the **completion type** in `<type>` and the given language. \n\n### **Instructions**:\n1. Analyze `<text>` for context and meaning. \n2. Use `<type>` to guide your output: \n - **General**: Provide a natural, concise continuation. \n - **Search Query**: Complete as if generating a realistic search query. \n3. Start as if you are directly continuing `<text>`. Do **not** repeat, paraphrase, or respond as a model. Simply complete the text. \n4. Ensure the continuation:\n - Flows naturally from `<text>`. \n - Avoids repetition, overexplaining, or unrelated ideas. \n5. If unsure, return: `{ "text": "" }`. \n\n### **Output Rules**:\n- Respond only in JSON format: `{ "text": "<your_completion>" }`.\n\n### **Examples**:\n#### Example 1: \nInput: \n<type>General</type> \n<text>The sun was setting over the horizon, painting the sky</text> \nOutput: \n{ "text": "with vibrant shades of orange and pink." }\n\n#### Example 2: \nInput: \n<type>Search Query</type> \n<text>Top-rated restaurants in</text> \nOutput: \n{ "text": "New York City for Italian cuisine." } \n\n---\n### Context:\n<chat_history>\n\n</chat_history>\n<type>search query</type> \n<text>Homer, talk to me about </text> \n#### Output:\n'}], 'stream': False, 'metadata': {'task': 'autocomplete_generation', 'task_body': {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'prompt': 'Homer, talk to me about ', 'type': 'search query', 'stream': False}, 'chat_id': None}} 2024-12-22 19:32:21 open-webui | INFO: 127.0.0.1:57710 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.161Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.083582445 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.411Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.333496096 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 33289" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.647Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.661Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.583968982 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:32:24 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:33289" 2024-12-22 19:32:24 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.878Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:32:24 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:32:24 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:32:24 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:32:24 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:32:25 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:32:25 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:32:25 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:32:25 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:32:25 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:32:25 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:25 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/new HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:32:25 pipelines | INFO: 172.18.0.1:56674 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:32:26 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:32:26 ollama | [GIN] 2024/12/23 - 01:32:26 | 200 | 77.050597ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:32:27 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:32:27 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:32:27 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:32:27 pipelines | INFO: 172.18.0.1:37230 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:32:28 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:32:51 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:32:51 open-webui | INFO: 127.0.0.1:42742 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:21 open-webui | INFO: 127.0.0.1:36904 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:51 open-webui | INFO: 127.0.0.1:58296 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:21 open-webui | INFO: 127.0.0.1:60204 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:51 open-webui | INFO: 127.0.0.1:43350 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:11 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: offloaded 28/29 layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: CPU_Mapped model buffer size = 852.73 MiB 2024-12-22 19:35:11 ollama | llm_load_tensors: CUDA0 model buffer size = 5106.06 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:35:12 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CPU output buffer size = 0.59 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph splits = 4 (with bs=512), 3 (with bs=1) 2024-12-22 19:35:13 ollama | time=2024-12-23T01:35:13.002Z level=INFO source=server.go:594 msg="llama runner started in 168.39 seconds" 2024-12-22 19:35:15 ollama | [GIN] 2024/12/23 - 01:35:15 | 200 | 2m56s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:35:15 open-webui | INFO: 172.18.0.1:57508 - "POST /api/task/auto/completions HTTP/1.1" 200 OK 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.743Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.744Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=145 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.747Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39149" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.904Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:35:20 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=128 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39149" 2024-12-22 19:35:21 ollama | time=2024-12-23T01:35:21.000Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:35:21 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:35:21 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:35:21 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:35:21 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:35:21 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:35:21 open-webui | INFO: 127.0.0.1:55554 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:21 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:35:21 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:35:21 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:35:21 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:35:21 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:35:21 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:35:51 open-webui | INFO: 127.0.0.1:36152 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:21 open-webui | INFO: 127.0.0.1:58318 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:51 open-webui | INFO: 127.0.0.1:46402 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:21 open-webui | INFO: 127.0.0.1:39014 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading output layer to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloaded 29/29 layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: CPU_Mapped model buffer size = 426.36 MiB 2024-12-22 19:37:22 ollama | llm_load_tensors: CUDA0 model buffer size = 5532.43 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:37:24 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 492.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph splits = 2 2024-12-22 19:37:24 ollama | time=2024-12-23T01:37:24.223Z level=INFO source=server.go:594 msg="llama runner started in 123.48 seconds" 2024-12-22 19:37:24 open-webui | INFO: 172.18.0.1:57524 - "POST /ollama/api/chat HTTP/1.1" 200 OK 2024-12-22 19:37:31 ollama | [GIN] 2024/12/23 - 01:37:31 | 200 | 5m3s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:37:31 pipelines | INFO: 172.18.0.1:49478 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:37:32 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:37:32 ollama | [GIN] 2024/12/23 - 01:37:32 | 200 | 63.627221ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49494 - "POST /function_calling_scaffold/filter/outlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:37:34 open-webui | <Encoding 'o200k_base'> 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/chat/completed HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:34 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:37:34 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:37:34 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49500 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.076Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.190654251 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.326Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.440366167 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.576Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.690480729 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.584Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.749Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:37:40 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.837Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:37:40 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:37:41 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:37:41 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:37:41 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:37:41 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:37:41 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:37:41 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:37:41 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:37:41 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:37:41 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:37:41 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: max token length = 256` Parts of logs I feel could be relevant to my noob eyes? `2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d` I also thought somewhere I saw the llamarunner take an inordinate amount of time? I'm not sure if any of this is helpful or not, but I've been racking my brains trying to figure it out. As I said at the top, I run my configuration through a .yaml in Docker Compose. I can upload the .yaml if it's helpful.

GiteaMirror commented

2026-05-04 09:31:06 -05:00

@rick-github commented on GitHub (Dec 23, 2024):

Could you either add block markdown markers around the logs (```) or add the logs as an attachment, it's very difficult to parse the logs as they are.

@rick-github commented on GitHub (Dec 23, 2024): Could you either add block markdown markers around the logs (```) or add the logs as an attachment, it's very difficult to parse the logs as they are.

GiteaMirror commented

2026-05-04 09:31:08 -05:00

@rick-github commented on GitHub (Dec 23, 2024):

Actually, I see that there is some sort of block, it seems that the text inside is badly formatted. Adding as an attachment would help a lot.

@rick-github commented on GitHub (Dec 23, 2024): Actually, I see that there is some sort of block, it seems that the text inside is badly formatted. Adding as an attachment would help a lot.

GiteaMirror commented

2026-05-04 09:31:09 -05:00

@rick-github commented on GitHub (Dec 23, 2024):

From your screenshot, the model generated a very respectable 36 tokens per second, but the overall response took just over 5 minutes. Logs will show for sure, but this seems like most of this time was spent loading the model. If you haven't set OLLAMA_KEEP_ALIVE in your docker compose file, then ollama will unload a model after 5 minutes of inactivity. This may lead to the "wonky inference times" you mention - the first inference takes 5 minutes because the model needs to load, a second inference takes 7 seconds, you leave it for 10 minutes and the third inferences takes 5 minutes because the model has been evicted.

@rick-github commented on GitHub (Dec 23, 2024): From your screenshot, the model generated a very respectable 36 tokens per second, but the overall response took just over 5 minutes. Logs will show for sure, but this seems like most of this time was spent loading the model. If you haven't set `OLLAMA_KEEP_ALIVE` in your docker compose file, then ollama will unload a model after 5 minutes of inactivity. This may lead to the "wonky inference times" you mention - the first inference takes 5 minutes because the model needs to load, a second inference takes 7 seconds, you leave it for 10 minutes and the third inferences takes 5 minutes because the model has been evicted.

GiteaMirror commented

2026-05-04 09:31:10 -05:00

@clduab11 commented on GitHub (Dec 23, 2024):

Thanks so much for the response @rick-github! My apologies, I'm definitely still very new to GitHub so I'll try to make this easier...

First of all, I'm sure this doesn't have a lot to do with it...but my Watchtower is included in my .yaml, and this is the current Ollama version I have...

Otherwise, you're absolutely correct; it's definitely the model loading that's taking the longest. I wish I had prior logs from older versions, but my initial model load used to never be very long in the 0.4.x version(s).

I've done some testing this AM and this is what I've noticed...

This is the data for the first inference, indicative of model load. However, when going to prompt my model a second time (immediately after the first output fully generated)... these were my results...

I noticed while watching my logs in Docker that it almost appeared as if... for lack of a better description, re-inferencing? It went through similar mechanisms twice before generating the second follow-up output (the screenshot directly above).

Here's some .txt's for my logs...One shows the logs between the 2nd input in -> 2nd output out, and one shows the full logs from the moment the first output generated -> 2nd output out (if that makes sense, so sorry if that's poorly phrased!)

second-inf-logs.txt
first-inference-to-end-of-2nd.txt

@clduab11 commented on GitHub (Dec 23, 2024): Thanks so much for the response @rick-github! My apologies, I'm definitely still very new to GitHub so I'll try to make this easier... First of all, I'm sure this doesn't have a lot to do with it...but my Watchtower is included in my .yaml, and this is the current Ollama version I have... ![image](https://github.com/user-attachments/assets/8ed479a6-604b-4dca-aa5e-0fde49e5de28) Otherwise, you're absolutely correct; it's definitely the model loading that's taking the longest. I wish I had prior logs from older versions, but my initial model load used to never be very long in the 0.4.x version(s). I've done some testing this AM and this is what I've noticed... ![Screenshot 2024-12-23 075747](https://github.com/user-attachments/assets/996100f1-6c37-4716-9be7-cb5ae17c6f9c) This is the data for the first inference, indicative of model load. However, when going to prompt my model a second time (immediately after the first output fully generated)... these were my results... ![Screenshot 2024-12-23 080244](https://github.com/user-attachments/assets/a39087ab-d019-42d0-aa2c-3f7f303ba309) I noticed while watching my logs in Docker that it almost appeared as if... for lack of a better description, re-inferencing? It went through similar mechanisms twice before generating the second follow-up output (the screenshot directly above). Here's some .txt's for my logs...One shows the logs between the 2nd input in -> 2nd output out, and one shows the full logs from the moment the first output generated -> 2nd output out (if that makes sense, so sorry if that's poorly phrased!) [second-inf-logs.txt](https://github.com/user-attachments/files/18230164/second-inf-logs.txt) [first-inference-to-end-of-2nd.txt](https://github.com/user-attachments/files/18230165/first-inference-to-end-of-2nd.txt)

GiteaMirror commented

2026-05-04 09:31:11 -05:00

@rick-github commented on GitHub (Dec 23, 2024):

It's not re-inferencing, open-webui has a couple of features which result in multiple calls to an LLM API for a single inference. The first is the summary generation, where open-webui takes the first response in a session and asks the LLM to summarize it so that it can add it to the chat list on the left hand panel. The second is auto-complete, where open-webui takes the text you've typed in and asks the LLM to guess what you are going to type, to autocomplete the prompt.

I think these are playing into the delays you are seeing because the size of context window keeps changing:

time=2024-12-23T13:53:30.207Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 38201"
time=2024-12-23T13:57:45.534Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39467"
time=2024-12-23T14:00:02.048Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 41171"
time=2024-12-23T14:02:37.657Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 42163"

I think what's happening is that you have a context window of 8192 configured somewhere (in the model with PARAMETER num_ctx or in open-webui somewhere) and open-webui uses that for a completion, and then when it does it's secondary completion (summary or autocomplete or some other "helper" function) it uses the default context window (either explicitly with "options":{"num_ctx":2048} or implicitly by not setting num_ctx). Unfortunately a change in context window results in a model eviction and immediate reload, which could cause the delays you are seeing - the actual completion finishes in seconds, but all the model unloading/loading around it makes it seem slow. I think you will have to poke around in the open-webui settings and either turn off these functions or configure them to use the same context window as the primary completion. There is an open PR which would alleviate this problem but it's not ready for integration yet.

@rick-github commented on GitHub (Dec 23, 2024): It's not re-inferencing, open-webui has a couple of features which result in multiple calls to an LLM API for a single inference. The first is the summary generation, where open-webui takes the first response in a session and asks the LLM to summarize it so that it can add it to the chat list on the left hand panel. The second is auto-complete, where open-webui takes the text you've typed in and asks the LLM to guess what you are going to type, to autocomplete the prompt. I think these are playing into the delays you are seeing because the size of context window keeps changing: ``` time=2024-12-23T13:53:30.207Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 38201" time=2024-12-23T13:57:45.534Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39467" time=2024-12-23T14:00:02.048Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 41171" time=2024-12-23T14:02:37.657Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 42163" ``` I think what's happening is that you have a context window of 8192 configured somewhere (in the model with `PARAMETER num_ctx` or in open-webui somewhere) and open-webui uses that for a completion, and then when it does it's secondary completion (summary or autocomplete or some other "helper" function) it uses the default context window (either explicitly with `"options":{"num_ctx":2048}` or implicitly by not setting `num_ctx`). Unfortunately a change in context window results in a model eviction and immediate reload, which could cause the delays you are seeing - the actual completion finishes in seconds, but all the model unloading/loading around it makes it seem slow. I think you will have to poke around in the open-webui settings and either turn off these functions or configure them to use the same context window as the primary completion. There is an [open PR](https://github.com/ollama/ollama/pull/8029) which would alleviate this problem but it's not ready for integration yet.

GiteaMirror commented

2026-05-04 09:31:12 -05:00

@clduab11 commented on GitHub (Dec 24, 2024):

Oh wow, and to think I had seen that earlier this morning and was like "hmm that's odd, I wonder why my num_ctx is set at 2048 for that..." and figured the OWUI interface just "overrode" it somehow. This makes perfect sense, and I super appreciate you going out of your way to help me with this! I will reach out to the folks on OWUI's end and see where I should be configuring this to help alleviate some the delay.

Thank you so so much! Ollama rocks!! :)

EDIT: Set the num_ctx at the model level (instead of at the system level) and disabled Autogeneration feature in OWUI, brought my initial load down by a substantial margin, and further prompts conversation-style to the model load as they should; woo! :)

Will eagerly await the next awesome update and the PR to be able to use the AutoComplete feature again without it evicting the model!

@clduab11 commented on GitHub (Dec 24, 2024): Oh wow, and to think I had seen that earlier this morning and was like "hmm that's odd, I wonder why my num_ctx is set at 2048 for that..." and figured the OWUI interface just "overrode" it somehow. This makes perfect sense, and I super appreciate you going out of your way to help me with this! I will reach out to the folks on OWUI's end and see where I should be configuring this to help alleviate some the delay. Thank you so so much! Ollama rocks!! :) EDIT: Set the num_ctx at the model level (instead of at the system level) and disabled Autogeneration feature in OWUI, brought my initial load down by a substantial margin, and further prompts conversation-style to the model load as they should; woo! :) Will eagerly await the next awesome update and the PR to be able to use the AutoComplete feature again without it evicting the model!

GiteaMirror commented

2026-05-04 09:31:12 -05:00

@tne-ops commented on GitHub (Jan 31, 2025):

Just a little confused the issue in ollama is closed but the ollama PR is still open after a month? It duesnt seem fixed? And causes very slow performance with open webui :-)

@tne-ops commented on GitHub (Jan 31, 2025): Just a little confused the issue in ollama is closed but the ollama PR is still open after a month? It duesnt seem fixed? And causes very slow performance with open webui :-)

GiteaMirror commented

2026-05-04 09:31:12 -05:00

@rick-github commented on GitHub (Jan 31, 2025):

The performance decline was likely due to licensing issues, but the OP didn't respond to a request for more information, so this issue was closed as stale. The unrelated issue from a different poster would be resolved by the PR, but the ollama team are busy with other things. Feel free to open a new issue to highlight the need for the PR to be merged.

@rick-github commented on GitHub (Jan 31, 2025): The performance decline was likely due to licensing issues, but the OP didn't respond to a request for more information, so this issue was closed as stale. The unrelated issue from a different poster would be resolved by the PR, but the ollama team are busy with other things. Feel free to open a new issue to highlight the need for the PR to be merged.

Sign in to join this conversation.

Branches Tags

main

hoyyeva/fix-claude-channels-env

parth-update-hermes-launch

hoyyeva/vscode-extension-docs-update

parth-gemma4-chat-template-renderer

parth-api-status-context-length

hoyyeva/wire-up-context-length

hoyyeva/claude-code-context-doc

jmorganca/investigate-issue-17046

hoyyeva/hermes-docs

jmorganca/agent-loop-style

hoyyeva/openclaw

parth-agent-loop

hoyyeva/ollama-vscode-extension

brucemacd/cache-metrics

brucemacd/hermes-desktop

hoyyeva/docs-vscode

parth-input-style-experiment

brucemacd/docs-glm52

hoyyeva/poc-docs

Parth/mlx-launch-recommendations

parth-first-time-app-cli-experience

test/darwin-xcode-pin

improve-cloud-model-recommendations

hoyyeva/goose-docs

jmorganca/context-limit-fixes

hoyyeva/qwen-doc

hoyyeva/vscode-docs

jmorganca/remove-mlx-imagegen-code

parth-copilot-token-length-defaults

hoyyeva/poolside-windows

laguna-support

jmorganca/harden-markdown-rendering

laguna-renderer-parser

laguna-llamacpp

codex/make-integration-hidden-and-lunchable

brucemacd/omp-docs

pdevine/gguf-mtp-oldstyle

hoyyeva/migrate-pi

hoyyeva/anthropic-local-image-path

parth-launch-codex-app

hoyyeva/anthropic-reference-images-path

parth-anthropic-reference-images-path

brucemacd/download-before-remove

hoyyeva/editor-config-repair

parth-mlx-decode-checkpoints

parth/hide-claude-desktop-till-release

parth-add-claude-code-autoinstall

release_v0.22.0

pdevine/manifest-list

codex/fix-codex-model-metadata-warning

pdevine/addressable-manifest

brucemacd/launch-fetch-reccomended

jmorganca/llama-compat

launch-copilot-cli

release_v0.20.7

parth-auto-save-backup

parth-test

jmorganca/gemma4-audio-replacements

fix-manifest-digest-on-pull

hoyyeva/vscode-improve

brucemacd/install-server-wait

parth/update-claude-docs

brucemac/start-ap-install

pdevine/mlx-update

pdevine/qwen35_vision

drifkin/api-show-fallback

mintlify/image-generation-1773352582

hoyyeva/server-context-length-local-config

jmorganca/faster-reptition-penalties

jmorganca/convert-nemotron

parth-pi-thinking

pdevine/sampling-penalties

jmorganca/fix-create-quantization-memory

dongchen/resumable_transfer_fix

pdevine/sampling-cache-error

jessegross/mlx-usage

hoyyeva/openclaw-config

hoyyeva/app-html

pdevine/qwen3next

brucemacd/sign-sh-install

brucemacd/tui-update

brucemacd/usage-api

jmorganca/launch-empty

fix-app-dist-embed

mxyng/mlx-compile

mxyng/mlx-quant

mxyng/mlx-glm4.7

mxyng/mlx

brucemacd/simplify-model-picker

jmorganca/qwen3-concurrent

fix-glm-4.7-flash-mla-config

drifkin/qwen3-coder-opening-tag

brucemacd/usage-cli

fix-cuda12-fattn-shmem

ollama-imagegen-docs

parth/fix-multiline-inputs

brucemacd/config-docs

mxyng/model-files

mxyng/simple-execute

fix-imagegen-ollama-models

mxyng/async-upload

jmorganca/lazy-no-dtype-changes

imagegen-auto-detect-create

parth/decrease-concurrent-download-hf

fix-mlx-quantize-init

jmorganca/x-cleanup

usage

imagegen-readme

jmorganca/glm-image

mlx-gpu-cd

jmorganca/imagegen-modelfile

parth/agent-skills

parth/agent-allowlist

parth/signed-in-offline

parth/agents

parth/fix-context-chopping

improve-cloud-flow

parth/add-models-websearch

parth/prompt-renderer-mcp

jmorganca/native-settings

jmorganca/download-stream-hash

jmorganca/client2-rebased

brucemacd/oai-chat-req-multipart

jessegross/multi_chunk_reserve

grace/additional-omit-empty

grace/mistral-3-large

mxyng/tokenizer2

mxyng/tokenizer

jessegross/flash

hoyyeva/windows-nacked-app

mxyng/cleanup-attention

grace/deepseek-parser

hoyyeva/remember-unsent-prompt

parth/add-lfs-pointer-error-conversion

parth/olmo2-test2

hoyyeva/ollama-launchagent-plist

nicole/olmo-model

parth/olmo-test

mxyng/remove-embedded

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

jmorganca/cmd-eval

nicole/nomic-embed-text-fix

mxyng/lint-2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

mxyng/expand-path

mxyng/environ-2

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama#67125