[GH-ISSUE #7919] Performance decline #67125

Closed
opened 2026-05-04 09:30:55 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @axil76 on GitHub (Dec 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/7919

What is the issue?

I am testing the Vgpu on a Vsphere 8 cluster, the drivers work on the redhat 8 os and work in docker, when the VM boots, the Ollama server responds well and after several minutes, the ollama server no longer responds
Device 0: NVIDIA L40S-24C, compute capability 8.9, VMM: no
time=2024-12-03T14:30:07.963Z level=INFO source=server.go:593 msg="waiting for server to become available" satus="llm server loading model" and the service no longer responds and the nvidia-persistenced service is running
I don't understand where the problem comes from, when the card was mounted directly on the vm it worked

in docker
nvidia-smi
Tue Dec 3 14:37:24 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S-24C Off | 00000000:02:00.0 Off | 0 |
| N/A N/A P0 N/A / N/A | 12571MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+

ollama version
ollama version is 0.4.7

thanks for your answers.

OS

Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.4.7

Originally created by @axil76 on GitHub (Dec 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/7919 ### What is the issue? I am testing the Vgpu on a Vsphere 8 cluster, the drivers work on the redhat 8 os and work in docker, when the VM boots, the Ollama server responds well and after several minutes, the ollama server no longer responds Device 0: NVIDIA L40S-24C, compute capability 8.9, VMM: no time=2024-12-03T14:30:07.963Z level=INFO source=server.go:593 msg="waiting for server to become available" satus="llm server loading model" and the service no longer responds and the nvidia-persistenced service is running I don't understand where the problem comes from, when the card was mounted directly on the vm it worked in docker nvidia-smi Tue Dec 3 14:37:24 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 550.127.05 Driver Version: 550.127.05 CUDA Version: 12.4 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA L40S-24C Off | 00000000:02:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 12571MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| +-----------------------------------------------------------------------------------------+ ollama version ollama version is 0.4.7 thanks for your answers. ### OS Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.4.7
GiteaMirror added the bugneeds more info labels 2026-05-04 09:30:55 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 3, 2024):

Adding full server logs will aid in debugging.

<!-- gh-comment-id:2514864975 --> @rick-github commented on GitHub (Dec 3, 2024): Adding full [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will aid in debugging.
Author
Owner

@axil76 commented on GitHub (Dec 3, 2024):

ollama.log
example of request after I have the message msg="waiting for server to become available" status="llm server loading model" and no response I have to restart the container

<!-- gh-comment-id:2514987642 --> @axil76 commented on GitHub (Dec 3, 2024): [ollama.log](https://github.com/user-attachments/files/17995772/ollama.log) example of request after I have the message msg="waiting for server to become available" status="llm server loading model" and no response I have to restart the container
Author
Owner

@rick-github commented on GitHub (Dec 3, 2024):

The server did become available after 36 seconds:

time=2024-12-03T16:04:59.050Z level=INFO source=server.go:598 msg="llama runner started in 36.59 seconds"

What client is accessing the model? If you add OLLAMA_DEBUG=1 to the server environment there might be something in the logs to indicate what is happening.

<!-- gh-comment-id:2515009649 --> @rick-github commented on GitHub (Dec 3, 2024): The server did become available after 36 seconds: ``` time=2024-12-03T16:04:59.050Z level=INFO source=server.go:598 msg="llama runner started in 36.59 seconds" ``` What client is accessing the model? If you add `OLLAMA_DEBUG=1` to the server environment there might be something in the logs to indicate what is happening.
Author
Owner

@axil76 commented on GitHub (Dec 3, 2024):

ollama.log
with OLLAMA_DEBUG=1
the answer is very long... he writes a message every 10 seconds
now it's continue who makes the requests ollama
what is strange is that just after boot it works very well for a few minutes and then it doesn't work anymore, the performance deteriorates

<!-- gh-comment-id:2515270514 --> @axil76 commented on GitHub (Dec 3, 2024): [ollama.log](https://github.com/user-attachments/files/17997228/ollama.log) with OLLAMA_DEBUG=1 the answer is very long... he writes a message every 10 seconds now it's continue who makes the requests ollama what is strange is that just after boot it works very well for a few minutes and then it doesn't work anymore, the performance deteriorates
Author
Owner

@rick-github commented on GitHub (Dec 3, 2024):

What model are you using? The GET every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting?

<!-- gh-comment-id:2515630879 --> @rick-github commented on GitHub (Dec 3, 2024): What model are you using? The `GET` every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting?
Author
Owner

@AdminOfOz commented on GitHub (Dec 3, 2024):

I can somewhat anecdotally confirm that I have experience a similar degradation of service where the initial boot of ollama works great, but after a few requests or after some time it does not work.
The only other log I'm getting "gpu VRAM usage didn't recover within timeout"

SIDE NOTE: I recently went through a large infrastructure change that might invalidate my report.
My Setup Previous:
I was previously using ollama in a docker container with a 4090 and I believe it was the previouis version of ollama.
My Current Setup:
I took the same hardware and converted it so that I'm now using a GPU passthrough in proxmox. I also upgraded to the current version of ollama (and no longer via docker) during this period so I cannot say if the change in performance was due to an upgrade in version or a change in passing through the GPU and quite frankly it was too much work to get GPU pass through working.

Current nvidia smi:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01              Driver Version: 565.57.01      CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                    0 |
| 31%   41C    P0             98W /  480W |      36MiB /  23028MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                        
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Process runs in a terminal so not great logging... I know... I know. I did see this error:

gpu VRAM usage didn't recover │INFO: 192.168.1.30:55030 - "POST /ollama/ap
within timeout" seconds=6.*

<!-- gh-comment-id:2515822727 --> @AdminOfOz commented on GitHub (Dec 3, 2024): I can somewhat anecdotally confirm that I have experience a similar degradation of service where the initial boot of ollama works great, but after a few requests or after some time it does not work. The only other log I'm getting "gpu VRAM usage didn't recover within timeout" SIDE NOTE: I recently went through a large infrastructure change that might invalidate my report. My Setup Previous: I was previously using ollama in a docker container with a 4090 and I believe it was the previouis version of ollama. My Current Setup: I took the same hardware and converted it so that I'm now using a GPU passthrough in proxmox. I also upgraded to the current version of ollama (and no longer via docker) during this period so I cannot say if the change in performance was due to an upgrade in version or a change in passing through the GPU and quite frankly it was too much work to get GPU pass through working. Current nvidia smi: ``` +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 4090 Off | 00000000:01:00.0 On | 0 | | 31% 41C P0 98W / 480W | 36MiB / 23028MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+ ``` Process runs in a terminal so not great logging... I know... I know. I did see this error: gpu VRAM usage didn't recover │INFO: 192.168.1.30:55030 - "POST /ollama/ap within timeout" seconds=6.*
Author
Owner

@rick-github commented on GitHub (Dec 4, 2024):

Either there's no model loaded, or ollama is not using the GPU.

Ollama doesn't have any endpoints that start with "/ollama".

I know it's in a terminal, but logs would be required for any debugging.

<!-- gh-comment-id:2515831279 --> @rick-github commented on GitHub (Dec 4, 2024): Either there's no model loaded, or ollama is not using the GPU. Ollama doesn't have any endpoints that start with "/ollama". I know it's in a terminal, but logs would be required for any debugging.
Author
Owner

@axil76 commented on GitHub (Dec 4, 2024):

What model are you using? The GET every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting?

There was not much in the log, on the other hand I use the vgrid drivers which corresponds to the version installed on the esxi, I don't know what comes from that.

<!-- gh-comment-id:2516502705 --> @axil76 commented on GitHub (Dec 4, 2024): > What model are you using? The `GET` every 10 seconds is something outside of the container doing (presumably) a health check. The log finishes right after the model was ready and the prompt was being processed, was it restarted at that point or did you leave off the end of the log because there was nothing interesting? There was not much in the log, on the other hand I use the vgrid drivers which corresponds to the version installed on the esxi, I don't know what comes from that.
Author
Owner

@rick-github commented on GitHub (Dec 12, 2024):

From #8023, it's possible the performance decline from the original post is a licensing issue. What's the output of nvidia-smi -q?

<!-- gh-comment-id:2539495941 --> @rick-github commented on GitHub (Dec 12, 2024): From #8023, it's possible the performance decline from the original post is a licensing issue. What's the output of `nvidia-smi -q`?
Author
Owner

@clduab11 commented on GitHub (Dec 23, 2024):

For what it's worth...

I'm running 12.7 CUDA Version (I think nvidia's site said my vGPU doesn't support license system; though running my nvidia-smi -q didn't even showcase any vGPU information for me.

I wanted to throw my hat in the ring and say I'm having very wonky inference times whereas in previous versions I did not, and wondered if this issue may have been related. I'll do my best to provide full logs... I launch with docker-compose.yaml (but don't have the debug mode in my .yaml unfortunately)...

Screenshot 2024-12-22 193747

This is one such example for time to first token (over 5 minutes). The llamarunner index took a long time... I will fit all the logs I can from first to last...I know my Pipelines throws an error but it isn't related to the poor inference. This happens with or without pipelines in my configuration.

2024-12-22 19:32:21 open-webui | {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'messages': [{'role': 'user', 'content': '### Task:\nYou are an autocompletion system. Continue the text in based on the **completion type** inand the given language. \n\n### **Instructions**:\n1. Analyzefor context and meaning. \n2. Useto guide your output: \n - **General**: Provide a natural, concise continuation. \n - **Search Query**: Complete as if generating a realistic search query. \n3. Start as if you are directly continuing. Do **not** repeat, paraphrase, or respond as a model. Simply complete the text. \n4. Ensure the continuation:\n - Flows naturally from . \n - Avoids repetition, overexplaining, or unrelated ideas. \n5. If unsure, return: { "text": "" }. \n\n### **Output Rules**:\n- Respond only in JSON format: { "text": "<your_completion>" }.\n\n### **Examples**:\n#### Example 1: \nInput: \n<type>General</type> \n<text>The sun was setting over the horizon, painting the sky</text> \nOutput: \n{ "text": "with vibrant shades of orange and pink." }\n\n#### Example 2: \nInput: \n<type>Search Query</type> \n<text>Top-rated restaurants in</text> \nOutput: \n{ "text": "New York City for Italian cuisine." } \n\n---\n### Context:\n<chat_history>\n\n</chat_history>\n<type>search query</type> \n<text>Homer, talk to me about </text> \n#### Output:\n'}], 'stream': False, 'metadata': {'task': 'autocomplete_generation', 'task_body': {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'prompt': 'Homer, talk to me about ', 'type': 'search query', 'stream': False}, 'chat_id': None}} 2024-12-22 19:32:21 open-webui | INFO: 127.0.0.1:57710 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.161Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.083582445 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.411Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.333496096 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 33289" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.647Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.661Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.583968982 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:32:24 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:33289" 2024-12-22 19:32:24 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.878Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:32:24 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:32:24 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:32:24 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:32:24 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:32:25 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:32:25 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:32:25 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:32:25 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:32:25 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:32:25 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:25 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/new HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:32:25 pipelines | INFO: 172.18.0.1:56674 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:32:26 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:32:26 ollama | [GIN] 2024/12/23 - 01:32:26 | 200 | 77.050597ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:32:27 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:32:27 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:32:27 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:32:27 pipelines | INFO: 172.18.0.1:37230 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:32:28 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:32:51 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:32:51 open-webui | INFO: 127.0.0.1:42742 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:21 open-webui | INFO: 127.0.0.1:36904 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:51 open-webui | INFO: 127.0.0.1:58296 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:21 open-webui | INFO: 127.0.0.1:60204 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:51 open-webui | INFO: 127.0.0.1:43350 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:11 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: offloaded 28/29 layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: CPU_Mapped model buffer size = 852.73 MiB 2024-12-22 19:35:11 ollama | llm_load_tensors: CUDA0 model buffer size = 5106.06 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:35:12 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CPU output buffer size = 0.59 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph splits = 4 (with bs=512), 3 (with bs=1) 2024-12-22 19:35:13 ollama | time=2024-12-23T01:35:13.002Z level=INFO source=server.go:594 msg="llama runner started in 168.39 seconds" 2024-12-22 19:35:15 ollama | [GIN] 2024/12/23 - 01:35:15 | 200 | 2m56s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:35:15 open-webui | INFO: 172.18.0.1:57508 - "POST /api/task/auto/completions HTTP/1.1" 200 OK 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.743Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.744Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=145 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.747Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39149" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.904Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:35:20 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=128 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39149" 2024-12-22 19:35:21 ollama | time=2024-12-23T01:35:21.000Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:35:21 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:35:21 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:35:21 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:35:21 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:35:21 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:35:21 open-webui | INFO: 127.0.0.1:55554 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:21 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:35:21 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:35:21 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:35:21 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:35:21 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:35:21 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:35:51 open-webui | INFO: 127.0.0.1:36152 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:21 open-webui | INFO: 127.0.0.1:58318 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:51 open-webui | INFO: 127.0.0.1:46402 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:21 open-webui | INFO: 127.0.0.1:39014 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading output layer to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloaded 29/29 layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: CPU_Mapped model buffer size = 426.36 MiB 2024-12-22 19:37:22 ollama | llm_load_tensors: CUDA0 model buffer size = 5532.43 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:37:24 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 492.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph splits = 2 2024-12-22 19:37:24 ollama | time=2024-12-23T01:37:24.223Z level=INFO source=server.go:594 msg="llama runner started in 123.48 seconds" 2024-12-22 19:37:24 open-webui | INFO: 172.18.0.1:57524 - "POST /ollama/api/chat HTTP/1.1" 200 OK 2024-12-22 19:37:31 ollama | [GIN] 2024/12/23 - 01:37:31 | 200 | 5m3s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:37:31 pipelines | INFO: 172.18.0.1:49478 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:37:32 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:37:32 ollama | [GIN] 2024/12/23 - 01:37:32 | 200 | 63.627221ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49494 - "POST /function_calling_scaffold/filter/outlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:37:34 open-webui | <Encoding 'o200k_base'> 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/chat/completed HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:34 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:37:34 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:37:34 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49500 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.076Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.190654251 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.326Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.440366167 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.576Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.690480729 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.584Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.749Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:37:40 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.837Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:37:40 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:37:41 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:37:41 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:37:41 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:37:41 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:37:41 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:37:41 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:37:41 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:37:41 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:37:41 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:37:41 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: max token length = 256

Parts of logs I feel could be relevant to my noob eyes?

2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d

I also thought somewhere I saw the llamarunner take an inordinate amount of time?

I'm not sure if any of this is helpful or not, but I've been racking my brains trying to figure it out. As I said at the top, I run my configuration through a .yaml in Docker Compose. I can upload the .yaml if it's helpful.

<!-- gh-comment-id:2558739336 --> @clduab11 commented on GitHub (Dec 23, 2024): For what it's worth... I'm running 12.7 CUDA Version (I think nvidia's site said my vGPU doesn't support license system; though running my `nvidia-smi -q` didn't even showcase any vGPU information for me. I wanted to throw my hat in the ring and say I'm having very wonky inference times whereas in previous versions I did not, and wondered if this issue may have been related. I'll do my best to provide full logs... I launch with docker-compose.yaml (but don't have the debug mode in my .yaml unfortunately)... ![Screenshot 2024-12-22 193747](https://github.com/user-attachments/assets/6459d594-b2a7-4054-abff-3e9cf9220e92) This is one such example for time to first token (over 5 minutes). The llamarunner index took a long time... I will fit all the logs I can from first to last...I know my Pipelines throws an error but it isn't related to the poor inference. This happens with or without pipelines in my configuration. `2024-12-22 19:32:21 open-webui | {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'messages': [{'role': 'user', 'content': '### Task:\nYou are an autocompletion system. Continue the text in `<text>` based on the **completion type** in `<type>` and the given language. \n\n### **Instructions**:\n1. Analyze `<text>` for context and meaning. \n2. Use `<type>` to guide your output: \n - **General**: Provide a natural, concise continuation. \n - **Search Query**: Complete as if generating a realistic search query. \n3. Start as if you are directly continuing `<text>`. Do **not** repeat, paraphrase, or respond as a model. Simply complete the text. \n4. Ensure the continuation:\n - Flows naturally from `<text>`. \n - Avoids repetition, overexplaining, or unrelated ideas. \n5. If unsure, return: `{ "text": "" }`. \n\n### **Output Rules**:\n- Respond only in JSON format: `{ "text": "<your_completion>" }`.\n\n### **Examples**:\n#### Example 1: \nInput: \n<type>General</type> \n<text>The sun was setting over the horizon, painting the sky</text> \nOutput: \n{ "text": "with vibrant shades of orange and pink." }\n\n#### Example 2: \nInput: \n<type>Search Query</type> \n<text>Top-rated restaurants in</text> \nOutput: \n{ "text": "New York City for Italian cuisine." } \n\n---\n### Context:\n<chat_history>\n\n</chat_history>\n<type>search query</type> \n<text>Homer, talk to me about </text> \n#### Output:\n'}], 'stream': False, 'metadata': {'task': 'autocomplete_generation', 'task_body': {'model': 'hf.co/mradermacher/HomerCreativeAnvita-Mix-Qw7B-i1-GGUF:Q6_K', 'prompt': 'Homer, talk to me about ', 'type': 'search query', 'stream': False}, 'chat_id': None}} 2024-12-22 19:32:21 open-webui | INFO: 127.0.0.1:57710 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.161Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.083582445 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.411Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.333496096 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.626Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 33289" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.627Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.647Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.661Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.583968982 model=/root/.ollama/models/blobs/sha256-3b70c65c6448a92a2419fee421689daf69dc85e3df83e54aef73de319c1f4ff6 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:32:24 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:32:24 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.680Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:33289" 2024-12-22 19:32:24 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:32:24 ollama | time=2024-12-23T01:32:24.878Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:32:24 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:32:24 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:32:24 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:32:24 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:32:24 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:32:25 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:32:25 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:32:25 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:32:25 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:32:25 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:32:25 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:32:25 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:32:25 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:32:25 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:32:25 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:32:25 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:32:25 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:32:25 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:32:25 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/new HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:32:25 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:32:25 pipelines | INFO: 172.18.0.1:56674 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:32:26 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:32:26 ollama | [GIN] 2024/12/23 - 01:32:26 | 200 | 77.050597ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:32:27 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:32:27 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:32:27 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:32:27 pipelines | INFO: 172.18.0.1:37230 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:32:28 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:32:51 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:32:51 open-webui | INFO: 127.0.0.1:42742 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:21 open-webui | INFO: 127.0.0.1:36904 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:33:51 open-webui | INFO: 127.0.0.1:58296 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:21 open-webui | INFO: 127.0.0.1:60204 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:34:51 open-webui | INFO: 127.0.0.1:43350 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:11 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: offloaded 28/29 layers to GPU 2024-12-22 19:35:11 ollama | llm_load_tensors: CPU_Mapped model buffer size = 852.73 MiB 2024-12-22 19:35:11 ollama | llm_load_tensors: CUDA0 model buffer size = 5106.06 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:35:12 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:35:12 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:35:12 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:35:12 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CPU output buffer size = 0.59 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 730.36 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:35:12 ollama | llama_new_context_with_model: graph splits = 4 (with bs=512), 3 (with bs=1) 2024-12-22 19:35:13 ollama | time=2024-12-23T01:35:13.002Z level=INFO source=server.go:594 msg="llama runner started in 168.39 seconds" 2024-12-22 19:35:15 ollama | [GIN] 2024/12/23 - 01:35:15 | 200 | 2m56s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:35:15 open-webui | INFO: 172.18.0.1:57508 - "POST /api/task/auto/completions HTTP/1.1" 200 OK 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.743Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.744Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=145 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.747Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39149" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.748Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.904Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:35:20 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:35:20 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=128 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.976Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:39149" 2024-12-22 19:35:21 ollama | time=2024-12-23T01:35:21.000Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:35:21 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:35:21 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:35:21 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:35:21 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:35:21 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:35:21 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:35:21 open-webui | INFO: 127.0.0.1:55554 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:35:21 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:35:21 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:35:21 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:35:21 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:35:21 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:35:21 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:35:21 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:35:21 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:35:21 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:35:21 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:35:21 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:35:21 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:35:21 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:35:21 ollama | llm_load_print_meta: max token length = 256 2024-12-22 19:35:51 open-webui | INFO: 127.0.0.1:36152 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:21 open-webui | INFO: 127.0.0.1:58318 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:36:51 open-webui | INFO: 127.0.0.1:46402 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:21 open-webui | INFO: 127.0.0.1:39014 - "GET / HTTP/1.1" 200 OK 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading 28 repeating layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloading output layer to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: offloaded 29/29 layers to GPU 2024-12-22 19:37:22 ollama | llm_load_tensors: CPU_Mapped model buffer size = 426.36 MiB 2024-12-22 19:37:22 ollama | llm_load_tensors: CUDA0 model buffer size = 5532.43 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_seq_max = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq = 8192 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_batch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ubatch = 512 2024-12-22 19:37:24 ollama | llama_new_context_with_model: flash_attn = 0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_base = 1000000.0 2024-12-22 19:37:24 ollama | llama_new_context_with_model: freq_scale = 1 2024-12-22 19:37:24 ollama | llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (32768) -- the full capacity of the model will not be utilized 2024-12-22 19:37:24 ollama | llama_kv_cache_init: CUDA0 KV buffer size = 448.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: KV self size = 448.00 MiB, K (f16): 224.00 MiB, V (f16): 224.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA0 compute buffer size = 492.00 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: CUDA_Host compute buffer size = 23.01 MiB 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph nodes = 986 2024-12-22 19:37:24 ollama | llama_new_context_with_model: graph splits = 2 2024-12-22 19:37:24 ollama | time=2024-12-23T01:37:24.223Z level=INFO source=server.go:594 msg="llama runner started in 123.48 seconds" 2024-12-22 19:37:24 open-webui | INFO: 172.18.0.1:57524 - "POST /ollama/api/chat HTTP/1.1" 200 OK 2024-12-22 19:37:31 ollama | [GIN] 2024/12/23 - 01:37:31 | 200 | 5m3s | 172.18.0.5 | POST "/api/chat" 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:31 open-webui | INFO [open_webui.apps.openai.main] get_all_models() 2024-12-22 19:37:31 pipelines | INFO: 172.18.0.1:49478 - "GET /models HTTP/1.1" 200 OK 2024-12-22 19:37:32 open-webui | INFO [open_webui.apps.ollama.main] get_all_models() 2024-12-22 19:37:32 ollama | [GIN] 2024/12/23 - 01:37:32 | 200 | 63.627221ms | 172.18.0.5 | GET "/api/tags" 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49494 - "POST /function_calling_scaffold/filter/outlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | Fetching models from https://api.mistral.ai/v1/models 2024-12-22 19:37:34 open-webui | <Encoding 'o200k_base'> 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/chat/completed HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "POST /api/v1/chats/bfc9c299-7266-4e71-b434-ed75b9ee3c5a HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO: 172.18.0.1:57524 - "GET /api/v1/chats/?page=1 HTTP/1.1" 200 OK 2024-12-22 19:37:34 pipelines | pipe:blueprints.function_calling_blueprint 2024-12-22 19:37:34 pipelines | {'id': '9b978aa8-4155-47c3-9e9a-4dac714d9078', 'email': 'chrisldukes@gmail.com', 'name': 'Chris Dukes', 'role': 'admin'} 2024-12-22 19:37:34 pipelines | Error: 400 Client Error: Bad Request for url: https://api.openai.com/v1/chat/completions 2024-12-22 19:37:34 pipelines | INFO: 172.18.0.1:49500 - "POST /function_calling_scaffold/filter/inlet HTTP/1.1" 200 OK 2024-12-22 19:37:34 open-webui | INFO [open_webui.apps.ollama.main] url: http://ollama:11434 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.076Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.190654251 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.326Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.440366167 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.576Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.690480729 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=server.go:104 msg="system memory" total="23.4 GiB" free="16.2 GiB" free_swap="6.0 GiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.580Z level=INFO source=memory.go:356 msg="offload to cuda" layers.requested=-1 layers.model=29 layers.offload=28 layers.split="" memory.available="[6.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="7.2 GiB" memory.required.partial="6.8 GiB" memory.required.kv="448.0 MiB" memory.required.allocations="[6.8 GiB]" memory.weights.total="5.4 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="426.4 MiB" memory.graph.full="478.0 MiB" memory.graph.partial="730.4 MiB" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.584Z level=INFO source=server.go:376 msg="starting llama server" cmd="/usr/lib/ollama/runners/cuda_v12_avx/ollama_llama_server runner --model /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d --ctx-size 8192 --batch-size 512 --n-gpu-layers 28 --threads 8 --parallel 1 --port 46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=sched.go:449 msg="loaded runners" count=1 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:555 msg="waiting for llama runner to start responding" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.585Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server error" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.749Z level=INFO source=runner.go:945 msg="starting go runner" 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2024-12-22 19:37:40 ollama | ggml_cuda_init: found 1 CUDA devices: 2024-12-22 19:37:40 ollama | Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=runner.go:946 msg=system info="CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | LLAMAFILE = 1 | AARCH64_REPACK = 1 | cgo(gcc)" threads=8 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.823Z level=INFO source=.:0 msg="Server listening on 127.0.0.1:46047" 2024-12-22 19:37:40 ollama | time=2024-12-23T01:37:40.837Z level=INFO source=server.go:589 msg="waiting for server to become available" status="llm server loading model" 2024-12-22 19:37:40 ollama | llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4060 Ti) - 7065 MiB free 2024-12-22 19:37:41 ollama | llama_model_loader: loaded meta data with 46 key-value pairs and 339 tensors from /root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d (version GGUF V3 (latest)) 2024-12-22 19:37:41 ollama | llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 0: general.architecture str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 1: general.type str = model 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 2: general.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 3: general.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 4: general.finetune str = HomerCreative-Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 5: general.basename str = Qwen2.5 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 6: general.size_label str = 7B 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 7: general.base_model.count u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 7B HomerAnvita NerdMix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 9: general.base_model.0.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 11: general.base_model.1.name str = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 12: general.base_model.1.organization str = ZeroXClem 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 13: general.base_model.1.repo_url str = https://huggingface.co/ZeroXClem/Qwen... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 14: general.tags arr[str,2] = ["mergekit", "merge"] 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 15: qwen2.block_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 16: qwen2.context_length u32 = 32768 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 17: qwen2.embedding_length u32 = 3584 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 18: qwen2.feed_forward_length u32 = 18944 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 19: qwen2.attention.head_count u32 = 28 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 20: qwen2.attention.head_count_kv u32 = 4 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 21: qwen2.rope.freq_base f32 = 1000000.000000 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 22: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 23: general.file_type u32 = 18 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 24: tokenizer.ggml.model str = gpt2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 25: tokenizer.ggml.pre str = qwen2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 26: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 27: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 28: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 29: tokenizer.ggml.eos_token_id u32 = 151645 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 30: tokenizer.ggml.padding_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 31: tokenizer.ggml.bos_token_id u32 = 151643 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 32: tokenizer.ggml.add_bos_token bool = false 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 33: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 34: general.quantization_version u32 = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 35: general.url str = https://huggingface.co/mradermacher/H... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 36: mradermacher.quantize_version str = 2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 37: mradermacher.quantized_by str = mradermacher 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 38: mradermacher.quantized_at str = 2024-11-23T07:32:30+01:00 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 39: mradermacher.quantized_on str = db2 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 40: general.source.url str = https://huggingface.co/suayptalha/Hom... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 41: mradermacher.convert_type str = hf 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 42: quantize.imatrix.file str = HomerCreativeAnvita-Mix-Qw7B-i1-GGUF/... 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 43: quantize.imatrix.dataset str = imatrix-training-full-3 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 44: quantize.imatrix.entries_count i32 = 196 2024-12-22 19:37:41 ollama | llama_model_loader: - kv 45: quantize.imatrix.chunks_count i32 = 318 2024-12-22 19:37:41 ollama | llama_model_loader: - type f32: 141 tensors 2024-12-22 19:37:41 ollama | llama_model_loader: - type q6_K: 198 tensors 2024-12-22 19:37:41 ollama | llm_load_vocab: special tokens cache size = 22 2024-12-22 19:37:41 ollama | llm_load_vocab: token to piece cache size = 0.9310 MB 2024-12-22 19:37:41 ollama | llm_load_print_meta: format = GGUF V3 (latest) 2024-12-22 19:37:41 ollama | llm_load_print_meta: arch = qwen2 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab type = BPE 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_vocab = 152064 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_merges = 151387 2024-12-22 19:37:41 ollama | llm_load_print_meta: vocab_only = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_train = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd = 3584 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_layer = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head = 28 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_head_kv = 4 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_rot = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_swa = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_k = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_head_v = 128 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_gqa = 7 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_k_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_embd_v_gqa = 512 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_eps = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_norm_rms_eps = 1.0e-06 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_clamp_kqv = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_max_alibi_bias = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: f_logit_scale = 0.0e+00 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ff = 18944 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_expert_used = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: causal attn = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: pooling type = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope type = 2 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope scaling = linear 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_base_train = 1000000.0 2024-12-22 19:37:41 ollama | llm_load_print_meta: freq_scale_train = 1 2024-12-22 19:37:41 ollama | llm_load_print_meta: n_ctx_orig_yarn = 32768 2024-12-22 19:37:41 ollama | llm_load_print_meta: rope_finetuned = unknown 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_conv = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_inner = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_d_state = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_rank = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: ssm_dt_b_c_rms = 0 2024-12-22 19:37:41 ollama | llm_load_print_meta: model type = 7B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model ftype = Q6_K 2024-12-22 19:37:41 ollama | llm_load_print_meta: model params = 7.62 B 2024-12-22 19:37:41 ollama | llm_load_print_meta: model size = 5.82 GiB (6.56 BPW) 2024-12-22 19:37:41 ollama | llm_load_print_meta: general.name = Qwen2.5 7B HomerCreative Mix 2024-12-22 19:37:41 ollama | llm_load_print_meta: BOS token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOS token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOT token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: PAD token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: LF token = 148848 'ÄĬ' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151643 '<|endoftext|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151645 '<|im_end|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151662 '<|fim_pad|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151663 '<|repo_name|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: EOG token = 151664 '<|file_sep|>' 2024-12-22 19:37:41 ollama | llm_load_print_meta: max token length = 256` Parts of logs I feel could be relevant to my noob eyes? `2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.219Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.159822309 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.469Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.4100184989999995 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d 2024-12-22 19:35:20 ollama | time=2024-12-23T01:35:20.719Z level=WARN source=sched.go:646 msg="gpu VRAM usage didn't recover within timeout" seconds=5.659698822 model=/root/.ollama/models/blobs/sha256-8ab964a02bc84c985039e98ff965ab87fbb4c139d8c88901aab62d2e4f20eb2d` I also thought somewhere I saw the llamarunner take an inordinate amount of time? I'm not sure if any of this is helpful or not, but I've been racking my brains trying to figure it out. As I said at the top, I run my configuration through a .yaml in Docker Compose. I can upload the .yaml if it's helpful.
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

Could you either add block markdown markers around the logs (```) or add the logs as an attachment, it's very difficult to parse the logs as they are.

<!-- gh-comment-id:2558747029 --> @rick-github commented on GitHub (Dec 23, 2024): Could you either add block markdown markers around the logs (```) or add the logs as an attachment, it's very difficult to parse the logs as they are.
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

Actually, I see that there is some sort of block, it seems that the text inside is badly formatted. Adding as an attachment would help a lot.

<!-- gh-comment-id:2558747977 --> @rick-github commented on GitHub (Dec 23, 2024): Actually, I see that there is some sort of block, it seems that the text inside is badly formatted. Adding as an attachment would help a lot.
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

From your screenshot, the model generated a very respectable 36 tokens per second, but the overall response took just over 5 minutes. Logs will show for sure, but this seems like most of this time was spent loading the model. If you haven't set OLLAMA_KEEP_ALIVE in your docker compose file, then ollama will unload a model after 5 minutes of inactivity. This may lead to the "wonky inference times" you mention - the first inference takes 5 minutes because the model needs to load, a second inference takes 7 seconds, you leave it for 10 minutes and the third inferences takes 5 minutes because the model has been evicted.

<!-- gh-comment-id:2558759033 --> @rick-github commented on GitHub (Dec 23, 2024): From your screenshot, the model generated a very respectable 36 tokens per second, but the overall response took just over 5 minutes. Logs will show for sure, but this seems like most of this time was spent loading the model. If you haven't set `OLLAMA_KEEP_ALIVE` in your docker compose file, then ollama will unload a model after 5 minutes of inactivity. This may lead to the "wonky inference times" you mention - the first inference takes 5 minutes because the model needs to load, a second inference takes 7 seconds, you leave it for 10 minutes and the third inferences takes 5 minutes because the model has been evicted.
Author
Owner

@clduab11 commented on GitHub (Dec 23, 2024):

Thanks so much for the response @rick-github! My apologies, I'm definitely still very new to GitHub so I'll try to make this easier...

First of all, I'm sure this doesn't have a lot to do with it...but my Watchtower is included in my .yaml, and this is the current Ollama version I have...

image

Otherwise, you're absolutely correct; it's definitely the model loading that's taking the longest. I wish I had prior logs from older versions, but my initial model load used to never be very long in the 0.4.x version(s).

I've done some testing this AM and this is what I've noticed...

Screenshot 2024-12-23 075747

This is the data for the first inference, indicative of model load. However, when going to prompt my model a second time (immediately after the first output fully generated)... these were my results...

Screenshot 2024-12-23 080244

I noticed while watching my logs in Docker that it almost appeared as if... for lack of a better description, re-inferencing? It went through similar mechanisms twice before generating the second follow-up output (the screenshot directly above).

Here's some .txt's for my logs...One shows the logs between the 2nd input in -> 2nd output out, and one shows the full logs from the moment the first output generated -> 2nd output out (if that makes sense, so sorry if that's poorly phrased!)

second-inf-logs.txt
first-inference-to-end-of-2nd.txt

<!-- gh-comment-id:2559776963 --> @clduab11 commented on GitHub (Dec 23, 2024): Thanks so much for the response @rick-github! My apologies, I'm definitely still very new to GitHub so I'll try to make this easier... First of all, I'm sure this doesn't have a lot to do with it...but my Watchtower is included in my .yaml, and this is the current Ollama version I have... ![image](https://github.com/user-attachments/assets/8ed479a6-604b-4dca-aa5e-0fde49e5de28) Otherwise, you're absolutely correct; it's definitely the model loading that's taking the longest. I wish I had prior logs from older versions, but my initial model load used to never be very long in the 0.4.x version(s). I've done some testing this AM and this is what I've noticed... ![Screenshot 2024-12-23 075747](https://github.com/user-attachments/assets/996100f1-6c37-4716-9be7-cb5ae17c6f9c) This is the data for the first inference, indicative of model load. However, when going to prompt my model a second time (immediately after the first output fully generated)... these were my results... ![Screenshot 2024-12-23 080244](https://github.com/user-attachments/assets/a39087ab-d019-42d0-aa2c-3f7f303ba309) I noticed while watching my logs in Docker that it almost appeared as if... for lack of a better description, re-inferencing? It went through similar mechanisms twice before generating the second follow-up output (the screenshot directly above). Here's some .txt's for my logs...One shows the logs between the 2nd input in -> 2nd output out, and one shows the full logs from the moment the first output generated -> 2nd output out (if that makes sense, so sorry if that's poorly phrased!) [second-inf-logs.txt](https://github.com/user-attachments/files/18230164/second-inf-logs.txt) [first-inference-to-end-of-2nd.txt](https://github.com/user-attachments/files/18230165/first-inference-to-end-of-2nd.txt)
Author
Owner

@rick-github commented on GitHub (Dec 23, 2024):

It's not re-inferencing, open-webui has a couple of features which result in multiple calls to an LLM API for a single inference. The first is the summary generation, where open-webui takes the first response in a session and asks the LLM to summarize it so that it can add it to the chat list on the left hand panel. The second is auto-complete, where open-webui takes the text you've typed in and asks the LLM to guess what you are going to type, to autocomplete the prompt.

I think these are playing into the delays you are seeing because the size of context window keeps changing:

time=2024-12-23T13:53:30.207Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 38201"
time=2024-12-23T13:57:45.534Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39467"
time=2024-12-23T14:00:02.048Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 41171"
time=2024-12-23T14:02:37.657Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 42163"

I think what's happening is that you have a context window of 8192 configured somewhere (in the model with PARAMETER num_ctx or in open-webui somewhere) and open-webui uses that for a completion, and then when it does it's secondary completion (summary or autocomplete or some other "helper" function) it uses the default context window (either explicitly with "options":{"num_ctx":2048} or implicitly by not setting num_ctx). Unfortunately a change in context window results in a model eviction and immediate reload, which could cause the delays you are seeing - the actual completion finishes in seconds, but all the model unloading/loading around it makes it seem slow. I think you will have to poke around in the open-webui settings and either turn off these functions or configure them to use the same context window as the primary completion. There is an open PR which would alleviate this problem but it's not ready for integration yet.

<!-- gh-comment-id:2560395085 --> @rick-github commented on GitHub (Dec 23, 2024): It's not re-inferencing, open-webui has a couple of features which result in multiple calls to an LLM API for a single inference. The first is the summary generation, where open-webui takes the first response in a session and asks the LLM to summarize it so that it can add it to the chat list on the left hand panel. The second is auto-complete, where open-webui takes the text you've typed in and asks the LLM to guess what you are going to type, to autocomplete the prompt. I think these are playing into the delays you are seeing because the size of context window keeps changing: ``` time=2024-12-23T13:53:30.207Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 38201" time=2024-12-23T13:57:45.534Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 39467" time=2024-12-23T14:00:02.048Z msg="starting llama server" cmd=" ... --ctx-size 8192 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 41171" time=2024-12-23T14:02:37.657Z msg="starting llama server" cmd=" ... --ctx-size 2048 --batch-size 512 --n-gpu-layers 145 --threads 128 --parallel 1 --port 42163" ``` I think what's happening is that you have a context window of 8192 configured somewhere (in the model with `PARAMETER num_ctx` or in open-webui somewhere) and open-webui uses that for a completion, and then when it does it's secondary completion (summary or autocomplete or some other "helper" function) it uses the default context window (either explicitly with `"options":{"num_ctx":2048}` or implicitly by not setting `num_ctx`). Unfortunately a change in context window results in a model eviction and immediate reload, which could cause the delays you are seeing - the actual completion finishes in seconds, but all the model unloading/loading around it makes it seem slow. I think you will have to poke around in the open-webui settings and either turn off these functions or configure them to use the same context window as the primary completion. There is an [open PR](https://github.com/ollama/ollama/pull/8029) which would alleviate this problem but it's not ready for integration yet.
Author
Owner

@clduab11 commented on GitHub (Dec 24, 2024):

Oh wow, and to think I had seen that earlier this morning and was like "hmm that's odd, I wonder why my num_ctx is set at 2048 for that..." and figured the OWUI interface just "overrode" it somehow. This makes perfect sense, and I super appreciate you going out of your way to help me with this! I will reach out to the folks on OWUI's end and see where I should be configuring this to help alleviate some the delay.

Thank you so so much! Ollama rocks!! :)

EDIT: Set the num_ctx at the model level (instead of at the system level) and disabled Autogeneration feature in OWUI, brought my initial load down by a substantial margin, and further prompts conversation-style to the model load as they should; woo! :)

Will eagerly await the next awesome update and the PR to be able to use the AutoComplete feature again without it evicting the model!

<!-- gh-comment-id:2560465774 --> @clduab11 commented on GitHub (Dec 24, 2024): Oh wow, and to think I had seen that earlier this morning and was like "hmm that's odd, I wonder why my num_ctx is set at 2048 for that..." and figured the OWUI interface just "overrode" it somehow. This makes perfect sense, and I super appreciate you going out of your way to help me with this! I will reach out to the folks on OWUI's end and see where I should be configuring this to help alleviate some the delay. Thank you so so much! Ollama rocks!! :) EDIT: Set the num_ctx at the model level (instead of at the system level) and disabled Autogeneration feature in OWUI, brought my initial load down by a substantial margin, and further prompts conversation-style to the model load as they should; woo! :) Will eagerly await the next awesome update and the PR to be able to use the AutoComplete feature again without it evicting the model!
Author
Owner

@tne-ops commented on GitHub (Jan 31, 2025):

Just a little confused the issue in ollama is closed but the ollama PR is still open after a month? It duesnt seem fixed? And causes very slow performance with open webui :-)

<!-- gh-comment-id:2628560206 --> @tne-ops commented on GitHub (Jan 31, 2025): Just a little confused the issue in ollama is closed but the ollama PR is still open after a month? It duesnt seem fixed? And causes very slow performance with open webui :-)
Author
Owner

@rick-github commented on GitHub (Jan 31, 2025):

The performance decline was likely due to licensing issues, but the OP didn't respond to a request for more information, so this issue was closed as stale. The unrelated issue from a different poster would be resolved by the PR, but the ollama team are busy with other things. Feel free to open a new issue to highlight the need for the PR to be merged.

<!-- gh-comment-id:2628569361 --> @rick-github commented on GitHub (Jan 31, 2025): The performance decline was likely due to licensing issues, but the OP didn't respond to a request for more information, so this issue was closed as stale. The unrelated issue from a different poster would be resolved by the PR, but the ollama team are busy with other things. Feel free to open a new issue to highlight the need for the PR to be merged.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#67125