[GH-ISSUE #6148] Model unloaded each request if OLLAMA_NUM_PARALLEL > 1 #3839

Closed
opened 2026-04-12 14:40:28 -05:00 by GiteaMirror · 30 comments
Owner

Originally created by @abes200 on GitHub (Aug 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6148

What is the issue?

I did see an issue this was mentioned in it but it was closed and said as fixed in version 0.2.1
I wasn't having this issue when I was using 0.3.0. I missed a few updates, but updated to the most recent and now if I have OLLAMA_NUM_PARALLEL in my system variables or use it as an option in python using Ollama the model reloads for every request that is sent.
Just to clarify, using Ollama CLI in windows:
Ollama run gemma2
send a message
send another message

Remove OLLAMA_NUM_PARALLEL from system variables and model loads and responds as normal.
Models seem to take a little more memory to load when I have OLLAMA_NUM_PARALLEL in my system variables than without it.
However whether I have it in or not, I am no longer able to do parallel requests on models. Without it requests are now always queued, with it they are still queued but the models unload and reload before each response.

Have I missed something obvious somewhere? That does happen a lot.

OS

Windows

GPU

Nvidia

CPU

AMD

Ollama version

0.3.3

Originally created by @abes200 on GitHub (Aug 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6148 ### What is the issue? I did see an issue this was mentioned in it but it was closed and said as fixed in version 0.2.1 I wasn't having this issue when I was using 0.3.0. I missed a few updates, but updated to the most recent and now if I have OLLAMA_NUM_PARALLEL in my system variables or use it as an option in python using Ollama the model reloads for every request that is sent. Just to clarify, using Ollama CLI in windows: Ollama run gemma2 <model is loaded> send a message <model is unloaded and reloaded then responds> send another message <model is unloaded and reloaded then responds> Remove OLLAMA_NUM_PARALLEL from system variables and model loads and responds as normal. Models seem to take a little more memory to load when I have OLLAMA_NUM_PARALLEL in my system variables than without it. However whether I have it in or not, I am no longer able to do parallel requests on models. Without it requests are now always queued, with it they are still queued but the models unload and reload before each response. Have I missed something obvious somewhere? That does happen a lot. ### OS Windows ### GPU Nvidia ### CPU AMD ### Ollama version 0.3.3
GiteaMirror added the bug label 2026-04-12 14:40:28 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 3, 2024):

Server logs may help in debugging. I just tried it and there was no model re-load between messages.

$ ollama run gemma2
>>> hello
Hello! 👋

How can I help you today? 😊

>>> hello
Hello!  😊

Is there anything I can do for you?  Perhaps you have a question, want to write something creative, or just want
to chat?
>>>
$ docker compose logs ollama | egrep " POST|llama_model_loader:.*.ollama/models/blobs|OLLAMA_NUM_PARALLEL"
ollama  | 2024/08/03 17:44:47 routes.go:1108: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
ollama  | [GIN] 2024/08/03 - 17:44:58 | 200 |   36.207579ms |   192.168.192.1 | POST     "/api/show"
ollama  | llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /root/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest))
ollama  | [GIN] 2024/08/03 - 17:45:00 | 200 |   1.86565098s |   192.168.192.1 | POST     "/api/chat"
ollama  | [GIN] 2024/08/03 - 17:45:03 | 200 |  322.760164ms |   192.168.192.1 | POST     "/api/chat"
ollama  | [GIN] 2024/08/03 - 17:45:08 | 200 |  742.126561ms |   192.168.192.1 | POST     "/api/chat"
$ curl localhost:11434/api/version
{"version":"0.3.3"}

Parallel requests were also handled correctly (OLLAMA_NUM_PARALLEL:2):

ollama  | [GIN] 2024/08/03 - 17:57:27 | 200 |  3.982415041s |   192.168.192.1 | POST     "/api/generate"
ollama  | [GIN] 2024/08/03 - 17:57:28 | 200 |  3.906120315s |   192.168.192.1 | POST     "/api/generate"
<!-- gh-comment-id:2267088399 --> @rick-github commented on GitHub (Aug 3, 2024): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may help in debugging. I just tried it and there was no model re-load between messages. ``` $ ollama run gemma2 >>> hello Hello! 👋 How can I help you today? 😊 >>> hello Hello! 😊 Is there anything I can do for you? Perhaps you have a question, want to write something creative, or just want to chat? >>> ``` ``` $ docker compose logs ollama | egrep " POST|llama_model_loader:.*.ollama/models/blobs|OLLAMA_NUM_PARALLEL" ollama | 2024/08/03 17:44:47 routes.go:1108: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:4 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:2 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" ollama | [GIN] 2024/08/03 - 17:44:58 | 200 | 36.207579ms | 192.168.192.1 | POST "/api/show" ollama | llama_model_loader: loaded meta data with 29 key-value pairs and 464 tensors from /root/.ollama/models/blobs/sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 (version GGUF V3 (latest)) ollama | [GIN] 2024/08/03 - 17:45:00 | 200 | 1.86565098s | 192.168.192.1 | POST "/api/chat" ollama | [GIN] 2024/08/03 - 17:45:03 | 200 | 322.760164ms | 192.168.192.1 | POST "/api/chat" ollama | [GIN] 2024/08/03 - 17:45:08 | 200 | 742.126561ms | 192.168.192.1 | POST "/api/chat" ``` ``` $ curl localhost:11434/api/version {"version":"0.3.3"} ``` Parallel requests were also handled correctly (`OLLAMA_NUM_PARALLEL:2`): ``` ollama | [GIN] 2024/08/03 - 17:57:27 | 200 | 3.982415041s | 192.168.192.1 | POST "/api/generate" ollama | [GIN] 2024/08/03 - 17:57:28 | 200 | 3.906120315s | 192.168.192.1 | POST "/api/generate" ```
Author
Owner

@abes200 commented on GitHub (Aug 3, 2024):

I've run it with debug enabled and compared a log between parallel=1 and 2. Everything is basically the same up to this point:
parallel = 2

time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0

parallel = 1

time=2024-08-04T06:26:43.939 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="12488" timestamp=1722718603
time=2024-08-04T06:26:43.942 level=DEBUG source=routes.go:1346 msg="chat request" images=0 prompt="<start_of_turn>user\nHello<end_of_turn>\n<start_of_turn>model\n"

For some reason whenever parallel is set to anything higher than 1 it now immediately clears the model to make room. I would include the entire log, but as it contains a lot of information about my personal computer I'd rather not share publicly. I had to remove my username and GMT from the small bits above.
So this was working until a few days ago when I updated to the latest version. Now suddenly it just immediately clears the model to make room for a new one on every request. Very frustrating as a python program I was working on requires multi threading to work correctly, and now suddenly I have the model reloading 3 times on every message instead.
If however I use a tiny model that I can load multiple ones of, it works just fine. It's like for some reason now it checks if there is enough memory to load a second model and if there is then it just goes ahead with the request. It doesn't load a second model or make any other changes to memory. But if there isn't enough memory left to load a second model it clears the first and loads a new one before it responds.
Just to be clear, the model has finished loading and I still have a few GB's of memory left to spare, but the model unloads and reloads every time before it then responds.
I was hoping there was something obvious I have missed or something that changed how I should be using OLLAMA_NUM_PARALLEL in some way.
Also, I did notice someone else mentioned they had this issue on the closed issue about 3 weeks ago, that would be a little after I updated last. LINK
If anyone figures out what stupid thing I have done this time please let me know. Thanks!

<!-- gh-comment-id:2267172523 --> @abes200 commented on GitHub (Aug 3, 2024): I've run it with debug enabled and compared a log between parallel=1 and 2. Everything is basically the same up to this point: parallel = 2 ``` time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 ```` parallel = 1 ``` time=2024-08-04T06:26:43.939 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="12488" timestamp=1722718603 time=2024-08-04T06:26:43.942 level=DEBUG source=routes.go:1346 msg="chat request" images=0 prompt="<start_of_turn>user\nHello<end_of_turn>\n<start_of_turn>model\n" ``` For some reason whenever parallel is set to anything higher than 1 it now immediately clears the model to make room. I would include the entire log, but as it contains a lot of information about my personal computer I'd rather not share publicly. I had to remove my username and GMT from the small bits above. So this was working until a few days ago when I updated to the latest version. Now suddenly it just immediately clears the model to make room for a new one on every request. Very frustrating as a python program I was working on requires multi threading to work correctly, and now suddenly I have the model reloading 3 times on every message instead. If however I use a tiny model that I can load multiple ones of, it works just fine. It's like for some reason now it checks if there is enough memory to load a second model and if there is then it just goes ahead with the request. It doesn't load a second model or make any other changes to memory. But if there isn't enough memory left to load a second model it clears the first and loads a new one before it responds. Just to be clear, the model has finished loading and I still have a few GB's of memory left to spare, but the model unloads and reloads every time before it then responds. I was hoping there was something obvious I have missed or something that changed how I should be using OLLAMA_NUM_PARALLEL in some way. Also, I did notice someone else mentioned they had this issue on the closed issue about 3 weeks ago, that would be a little after I updated last. [LINK](https://github.com/ollama/ollama/issues/5557) If anyone figures out what stupid thing I have done this time please let me know. Thanks!
Author
Owner

@rick-github commented on GitHub (Aug 3, 2024):

Server logs really would make this easier. At the very least, egrep " POST|ctx" from the logs.

<!-- gh-comment-id:2267175615 --> @rick-github commented on GitHub (Aug 3, 2024): Server logs really would make this easier. At the very least, `egrep " POST|ctx"` from the logs.
Author
Owner

@abes200 commented on GitHub (Aug 3, 2024):

egrep " POST|ctx"
I do not know what that is sorry.

<!-- gh-comment-id:2267177030 --> @abes200 commented on GitHub (Aug 3, 2024): egrep " POST|ctx" I do not know what that is sorry.
Author
Owner

@rick-github commented on GitHub (Aug 3, 2024):

findstr /c:" POST|ctx" logfile

<!-- gh-comment-id:2267177986 --> @rick-github commented on GitHub (Aug 3, 2024): `findstr /c:" POST|ctx" logfile`
Author
Owner

@abes200 commented on GitHub (Aug 3, 2024):

Right... so I am correct in assuming you are looking for a string " POST|ctx" in the log file?
It does not exist.
findstr /c:" POST|ctx" server.log
is what I am assuming you mean... searching the log file for that string? That command results in literally nothing happening.

C:\Users<username>\AppData\Local\Ollama> findstr /c:" POST|ctx" server.log
C:\Users<username>\AppData\Local\Ollama>

I also searched for "POST" in a server log that reloaded, here are the lines where the word POST are mentioned:

[GIN] 2024/08/04 - 06:18:50 | 200 |            0s |       127.0.0.1 | HEAD     "/"
[GIN] 2024/08/04 - 06:18:50 | 200 |     56.7317ms |       127.0.0.1 | POST     "/api/show"
time=2024-08-04T06:18:50.386 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="16.0 GiB" before.free="9.9 GiB" before.free_swap="14.0 GiB" now.total="16.0 GiB" now.free="9.8 GiB" now.free_swap="14.0 GiB"
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=7771 status=200 tid="32" timestamp=1722718248
[GIN] 2024/08/04 - 06:20:48 | 200 |   22.6509886s |       127.0.0.1 | POST     "/api/chat"
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=7786 status=200 tid="14772" timestamp=1722718293
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="9192" timestamp=1722718293
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=7787 status=200 tid="14356" timestamp=1722718293
time=2024-08-04T06:21:33.963 level=DEBUG source=routes.go:1346 msg="chat request" images=0 prompt="<start_of_turn>user\nHello<end_of_turn>\n<start_of_turn>model\nHello! 👋  How can I help you today? 😊<end_of_turn>\n<start_of_turn>user\nJust running a test. Say hello again for me.<end_of_turn>\n<start_of_turn>model\n"

It really would be easier if I could just share the whole log. Perhaps it could be changed so when it creates the log it doesn't include things like my windows username, GPU id, complete path lists, and my timezone. I'm not sure if there are other things specific to my computer/data/location included in the logs which could be considered personal information.
There are these lines which show an error, although I am not sure what that means. There does not seem to be any other errors occurring in the logs.

time=2024-08-04T06:21:25.386 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-04T06:21:25.386 level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-08-04T06:21:25.390 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"

Although those lines are also in the logs where the models do not reload when I have OLLAMA_NUM_PARALLEL = 1.

<!-- gh-comment-id:2267187265 --> @abes200 commented on GitHub (Aug 3, 2024): Right... so I am correct in assuming you are looking for a string " POST|ctx" in the log file? It does not exist. `findstr /c:" POST|ctx" server.log` is what I am assuming you mean... searching the log file for that string? That command results in literally nothing happening. > C:\Users\<username>\AppData\Local\Ollama> findstr /c:" POST|ctx" server.log > C:\Users\<username>\AppData\Local\Ollama> I also searched for "POST" in a server log that reloaded, here are the lines where the word POST are mentioned: ``` [GIN] 2024/08/04 - 06:18:50 | 200 | 0s | 127.0.0.1 | HEAD "/" [GIN] 2024/08/04 - 06:18:50 | 200 | 56.7317ms | 127.0.0.1 | POST "/api/show" time=2024-08-04T06:18:50.386 level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="16.0 GiB" before.free="9.9 GiB" before.free_swap="14.0 GiB" now.total="16.0 GiB" now.free="9.8 GiB" now.free_swap="14.0 GiB" ``` ``` DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=7771 status=200 tid="32" timestamp=1722718248 [GIN] 2024/08/04 - 06:20:48 | 200 | 22.6509886s | 127.0.0.1 | POST "/api/chat" ``` ``` DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=7786 status=200 tid="14772" timestamp=1722718293 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="9192" timestamp=1722718293 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=7787 status=200 tid="14356" timestamp=1722718293 time=2024-08-04T06:21:33.963 level=DEBUG source=routes.go:1346 msg="chat request" images=0 prompt="<start_of_turn>user\nHello<end_of_turn>\n<start_of_turn>model\nHello! 👋 How can I help you today? 😊<end_of_turn>\n<start_of_turn>user\nJust running a test. Say hello again for me.<end_of_turn>\n<start_of_turn>model\n" ``` It really would be easier if I could just share the whole log. Perhaps it could be changed so when it creates the log it doesn't include things like my windows username, GPU id, complete path lists, and my timezone. I'm not sure if there are other things specific to my computer/data/location included in the logs which could be considered personal information. There are these lines which show an error, although I am not sure what that means. There does not seem to be any other errors occurring in the logs. ``` time=2024-08-04T06:21:25.386 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-04T06:21:25.386 level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-08-04T06:21:25.390 level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" ``` Although those lines are also in the logs where the models do not reload when I have OLLAMA_NUM_PARALLEL = 1.
Author
Owner

@rick-github commented on GitHub (Aug 3, 2024):

"POST|ctx" is an OR operation. I'm looking for lines that contain " POST" and "ctx" and the order in which they occur. Looking for POST with leading space will limit the potential PII. I haven't used windows for a long time so I'm sorry the command wasn't quite right, does findstr /c:" POST" /c:ctx generate output?

<!-- gh-comment-id:2267190333 --> @rick-github commented on GitHub (Aug 3, 2024): `"POST|ctx"` is an OR operation. I'm looking for lines that contain " POST" and "ctx" and the order in which they occur. Looking for POST with leading space will limit the potential PII. I haven't used windows for a long time so I'm sorry the command wasn't quite right, does `findstr /c:" POST" /c:ctx` generate output?
Author
Owner

@abes200 commented on GitHub (Aug 4, 2024):

Ah, I see what your after. God I wish I could say I haven't used windows in a long time...
findstr /c:" POST" /c:"ctx" <logfile> works in getting what your after.. I hope.
To help ensure consistency in the test I performed 2 new test identical to each other except I changed OLLAMA_NUM_PARALLEL from 2 to 1 between the first and second test.
Started with Ollama App closed, loaded from start menu. "Ollama run gemma2", wait for model to load, say "Hello" and wait for response. "/bye" and then close Ollama App. Then I ran findstr /c:" POST" /c:"ctx" server.log. Changed parallel from 2 to 1 and repeated.

These are the results from the log file which is the one that has parallel at 2 and reloads the model:

[GIN] 2024/08/04 - 09:21:55 | 200 |     56.7349ms |       127.0.0.1 | POST     "/api/show"
time=2024-08-04T09:21:55.565 level=INFO source=server.go:384 msg="starting llama server" cmd="C:\\Users\\<username>\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\<username>\\.ollama\\models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --verbose --no-mmap --parallel 1 --port 8839"
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_tensors: ggml ctx size =    0.41 MiB
llama_new_context_with_model: n_ctx      = 4096
DEBUG [initialize] new slot | n_ctx_slot=4096 slot_id=0 tid="14792" timestamp=1722729127
[GIN] 2024/08/04 - 09:22:08 | 200 |   12.5907395s |       127.0.0.1 | POST     "/api/chat"
time=2024-08-04T09:22:16.499 level=INFO source=server.go:384 msg="starting llama server" cmd="C:\\Users\\<username>\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\<username>\\.ollama\\models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 24 --verbose --no-mmap --parallel 1 --port 8853"
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_tensors: ggml ctx size =    0.41 MiB
llama_new_context_with_model: n_ctx      = 4096
DEBUG [initialize] new slot | n_ctx_slot=4096 slot_id=0 tid="10496" timestamp=1722729148
DEBUG [update_slots] slot released | n_cache_tokens=22 n_ctx=4096 n_past=21 n_system_tokens=0 slot_id=0 task_id=2 tid="10496" timestamp=1722729159 truncated=false
[GIN] 2024/08/04 - 09:22:39 | 200 |   24.1314859s |       127.0.0.1 | POST     "/api/chat"

These are the results from the log file with parallel set to 1:

[GIN] 2024/08/04 - 09:27:03 | 200 |     58.0161ms |       127.0.0.1 | POST     "/api/show"
time=2024-08-04T09:27:03.807 level=INFO source=server.go:384 msg="starting llama server" cmd="C:\\Users\\<username>\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\<username>\\.ollama\\models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --verbose --no-mmap --parallel 1 --port 8902"
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_tensors: ggml ctx size =    0.41 MiB
llama_new_context_with_model: n_ctx      = 2048
DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="6180" timestamp=1722729433
[GIN] 2024/08/04 - 09:27:13 | 200 |    9.8516782s |       127.0.0.1 | POST     "/api/chat"
DEBUG [update_slots] slot released | n_cache_tokens=23 n_ctx=2048 n_past=22 n_system_tokens=0 slot_id=0 task_id=3 tid="6180" timestamp=1722729453 truncated=false
[GIN] 2024/08/04 - 09:27:33 | 200 |    9.9535589s |       127.0.0.1 | POST     "/api/chat"

I did not make any changes to context sizes or settings of any kind except to the parallel variable in system variables, and I did definitely only say "Hello" to the model once on both occasions and nothing else.

<!-- gh-comment-id:2267206613 --> @abes200 commented on GitHub (Aug 4, 2024): Ah, I see what your after. God I wish I could say I haven't used windows in a long time... `findstr /c:" POST" /c:"ctx" <logfile>` works in getting what your after.. I hope. To help ensure consistency in the test I performed 2 new test identical to each other except I changed OLLAMA_NUM_PARALLEL from 2 to 1 between the first and second test. Started with Ollama App closed, loaded from start menu. "Ollama run gemma2", wait for model to load, say "Hello" and wait for response. "/bye" and then close Ollama App. Then I ran `findstr /c:" POST" /c:"ctx" server.log`. Changed parallel from 2 to 1 and repeated. These are the results from the log file which is the one that has parallel at 2 and reloads the model: ``` [GIN] 2024/08/04 - 09:21:55 | 200 | 56.7349ms | 127.0.0.1 | POST "/api/show" time=2024-08-04T09:21:55.565 level=INFO source=server.go:384 msg="starting llama server" cmd="C:\\Users\\<username>\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\<username>\\.ollama\\models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 23 --verbose --no-mmap --parallel 1 --port 8839" llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_tensors: ggml ctx size = 0.41 MiB llama_new_context_with_model: n_ctx = 4096 DEBUG [initialize] new slot | n_ctx_slot=4096 slot_id=0 tid="14792" timestamp=1722729127 [GIN] 2024/08/04 - 09:22:08 | 200 | 12.5907395s | 127.0.0.1 | POST "/api/chat" time=2024-08-04T09:22:16.499 level=INFO source=server.go:384 msg="starting llama server" cmd="C:\\Users\\<username>\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\<username>\\.ollama\\models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 4096 --batch-size 512 --embedding --log-disable --n-gpu-layers 24 --verbose --no-mmap --parallel 1 --port 8853" llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_tensors: ggml ctx size = 0.41 MiB llama_new_context_with_model: n_ctx = 4096 DEBUG [initialize] new slot | n_ctx_slot=4096 slot_id=0 tid="10496" timestamp=1722729148 DEBUG [update_slots] slot released | n_cache_tokens=22 n_ctx=4096 n_past=21 n_system_tokens=0 slot_id=0 task_id=2 tid="10496" timestamp=1722729159 truncated=false [GIN] 2024/08/04 - 09:22:39 | 200 | 24.1314859s | 127.0.0.1 | POST "/api/chat" ``` These are the results from the log file with parallel set to 1: ``` [GIN] 2024/08/04 - 09:27:03 | 200 | 58.0161ms | 127.0.0.1 | POST "/api/show" time=2024-08-04T09:27:03.807 level=INFO source=server.go:384 msg="starting llama server" cmd="C:\\Users\\<username>\\AppData\\Local\\Programs\\Ollama\\ollama_runners\\cuda_v11.3\\ollama_llama_server.exe --model C:\\Users\\<username>\\.ollama\\models\\blobs\\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu-layers 27 --verbose --no-mmap --parallel 1 --port 8902" llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_tensors: ggml ctx size = 0.41 MiB llama_new_context_with_model: n_ctx = 2048 DEBUG [initialize] new slot | n_ctx_slot=2048 slot_id=0 tid="6180" timestamp=1722729433 [GIN] 2024/08/04 - 09:27:13 | 200 | 9.8516782s | 127.0.0.1 | POST "/api/chat" DEBUG [update_slots] slot released | n_cache_tokens=23 n_ctx=2048 n_past=22 n_system_tokens=0 slot_id=0 task_id=3 tid="6180" timestamp=1722729453 truncated=false [GIN] 2024/08/04 - 09:27:33 | 200 | 9.9535589s | 127.0.0.1 | POST "/api/chat" ``` I did not make any changes to context sizes or settings of any kind except to the parallel variable in system variables, and I did definitely only say "Hello" to the model once on both occasions and nothing else.
Author
Owner

@rick-github commented on GitHub (Aug 4, 2024):

--n-gpu-layers changed between runs for OLLAMA_NUM_PARALLEL=2. What does findstr /c:layers.model show? And what's the output of nvidia-smi?

<!-- gh-comment-id:2267210262 --> @rick-github commented on GitHub (Aug 4, 2024): `--n-gpu-layers` changed between runs for OLLAMA_NUM_PARALLEL=2. What does `findstr /c:layers.model` show? And what's the output of `nvidia-smi`?
Author
Owner

@rick-github commented on GitHub (Aug 4, 2024):

In the meantime, you may be able to stop the model reloading behaviour by explicitly setting the number of layers to be offloaded, see https://github.com/ollama/ollama/issues/5913#issuecomment-2248262520.

<!-- gh-comment-id:2267211576 --> @rick-github commented on GitHub (Aug 4, 2024): In the meantime, you may be able to stop the model reloading behaviour by explicitly setting the number of layers to be offloaded, see https://github.com/ollama/ollama/issues/5913#issuecomment-2248262520.
Author
Owner

@abes200 commented on GitHub (Aug 4, 2024):

layers.model for the log with the parallel at 2 is:

time=2024-08-04T09:21:55.543+09:30 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=23 layers.split="" memory.available="[5.0 GiB]" memory.required.full="8.2 GiB" memory.required.partial="4.9 GiB" memory.required.kv="1.3 GiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"
time=2024-08-04T09:22:16.487+09:30 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=24 layers.split="" memory.available="[5.0 GiB]" memory.required.full="8.2 GiB" memory.required.partial="5.0 GiB" memory.required.kv="1.3 GiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"

for for parallel 1 is:

time=2024-08-04T09:27:03.784+09:30 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=27 layers.split="" memory.available="[5.0 GiB]" memory.required.full="7.5 GiB" memory.required.partial="5.0 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB"

The only reason is should be making any changes to how many layers or anything that is loaded, is due to its own internal settings and auto detection now.
Actually most of the outputs in nvidia-smi show N/A. Although I do just want to restate that this was working about a week ago before I updated the Ollama app and this started happening immediately after the update. As in I was working with a model, decided perhaps I should get that new update, installed the update, and then the model started reloading for every response.
As a note, I have noticed that it does occasionally change how many layers get sent to the GPU from one load to the next. It's possible there are some background tasks in windows which might alter it a little. I have not noticed a large difference in performance for the most part.

I attempted the solution of setting the gpu layers but that doesn't seem to work. I have tried 24, 25, and 27 so far. I tried setting it using /set parameter num_gpu while the model was running, and also setting it in a model file and creating a new model from gemma2.
The models still keep reloading if parallel is present in the variables and set to anything other than 1 or 0.

<!-- gh-comment-id:2267224051 --> @abes200 commented on GitHub (Aug 4, 2024): layers.model for the log with the parallel at 2 is: ``` time=2024-08-04T09:21:55.543+09:30 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=23 layers.split="" memory.available="[5.0 GiB]" memory.required.full="8.2 GiB" memory.required.partial="4.9 GiB" memory.required.kv="1.3 GiB" memory.required.allocations="[4.9 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" time=2024-08-04T09:22:16.487+09:30 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=24 layers.split="" memory.available="[5.0 GiB]" memory.required.full="8.2 GiB" memory.required.partial="5.0 GiB" memory.required.kv="1.3 GiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="5.7 GiB" memory.weights.repeating="5.0 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" ``` for for parallel 1 is: ``` time=2024-08-04T09:27:03.784+09:30 level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=43 layers.offload=27 layers.split="" memory.available="[5.0 GiB]" memory.required.full="7.5 GiB" memory.required.partial="5.0 GiB" memory.required.kv="672.0 MiB" memory.required.allocations="[5.0 GiB]" memory.weights.total="5.0 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="717.8 MiB" memory.graph.full="507.0 MiB" memory.graph.partial="1.2 GiB" ``` The only reason is should be making any changes to how many layers or anything that is loaded, is due to its own internal settings and auto detection now. Actually most of the outputs in nvidia-smi show N/A. Although I do just want to restate that this was working about a week ago before I updated the Ollama app and this started happening immediately after the update. As in I was working with a model, decided perhaps I should get that new update, installed the update, and then the model started reloading for every response. As a note, I have noticed that it does occasionally change how many layers get sent to the GPU from one load to the next. It's possible there are some background tasks in windows which might alter it a little. I have not noticed a large difference in performance for the most part. I attempted the solution of setting the gpu layers but that doesn't seem to work. I have tried 24, 25, and 27 so far. I tried setting it using `/set parameter num_gpu` while the model was running, and also setting it in a model file and creating a new model from gemma2. The models still keep reloading if parallel is present in the variables and set to anything other than 1 or 0.
Author
Owner

@rick-github commented on GitHub (Aug 4, 2024):

Do you have OLLAMA_KEEP_ALIVE set in the server environment (findstr /c:OLLAMA_KEEP_ALIVE server.log), or do you have a keep_alive option in your API call from your python script?

<!-- gh-comment-id:2267225890 --> @rick-github commented on GitHub (Aug 4, 2024): Do you have `OLLAMA_KEEP_ALIVE` set in the server environment (`findstr /c:OLLAMA_KEEP_ALIVE server.log`), or do you have a `keep_alive` option in your API call from your python script?
Author
Owner

@abes200 commented on GitHub (Aug 4, 2024):

I pulled OLLAMA_KEEP_ALIVE:10m0s from the log file.
Also I do have that set in the system variables, yes.

<!-- gh-comment-id:2267226178 --> @abes200 commented on GitHub (Aug 4, 2024): I pulled `OLLAMA_KEEP_ALIVE:10m0s` from the log file. Also I do have that set in the system variables, yes.
Author
Owner

@rick-github commented on GitHub (Aug 4, 2024):

What does findstr /c:sched.go server.log show?

<!-- gh-comment-id:2267228884 --> @rick-github commented on GitHub (Aug 4, 2024): What does `findstr /c:sched.go server.log` show?
Author
Owner

@abes200 commented on GitHub (Aug 4, 2024):

Using that on the original log file that reloads the models:

time=2024-08-04T06:18:37.397 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-04T06:18:50.451 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:18:50.503 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-04T06:19:00.101 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:19:00.101 level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-04T06:19:00.104 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s
time=2024-08-04T06:19:00.104 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:26.396 level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 0.37 seconds" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:27.001 level=DEBUG source=sched.go:376 msg="runner released" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:27.001 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:27.001 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:27.089 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:27.138 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-04T06:20:36.792 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:20:48.627 level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-04T06:20:48.627 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s
time=2024-08-04T06:20:48.627 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:24.717 level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 0.37 seconds" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:25.289 level=DEBUG source=sched.go:376 msg="runner released" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:25.289 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:25.289 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:25.356 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:25.386 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-04T06:21:33.957 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:21:36.953 level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-04T06:21:36.954 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s
time=2024-08-04T06:21:36.954 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
time=2024-08-04T06:21:46.780 level=DEBUG source=sched.go:313 msg="shutting down scheduler completed loop"
time=2024-08-04T06:21:46.780 level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop"
time=2024-08-04T06:21:46.780 level=DEBUG source=sched.go:791 msg="shutting down runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373

And for the log that did not reload the model:

time=2024-08-04T06:26:21.776 level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-04T06:26:29.480 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:26:29.575 level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s
time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
time=2024-08-04T06:26:43.939 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
time=2024-08-04T06:26:54.155 level=DEBUG source=sched.go:403 msg="context for request finished"
time=2024-08-04T06:26:54.156 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s
time=2024-08-04T06:26:54.156 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0
time=2024-08-04T06:27:03.291 level=DEBUG source=sched.go:313 msg="shutting down scheduler completed loop"
time=2024-08-04T06:27:03.291 level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop"
time=2024-08-04T06:27:03.291 level=DEBUG source=sched.go:791 msg="shutting down runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373
<!-- gh-comment-id:2267231849 --> @abes200 commented on GitHub (Aug 4, 2024): Using that on the original log file that reloads the models: ``` time=2024-08-04T06:18:37.397 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-08-04T06:18:50.451 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:18:50.503 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-04T06:19:00.101 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:19:00.101 level=DEBUG source=sched.go:462 msg="context for request finished" time=2024-08-04T06:19:00.104 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s time=2024-08-04T06:19:00.104 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:26.030 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:26.396 level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 0.37 seconds" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:27.001 level=DEBUG source=sched.go:376 msg="runner released" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:27.001 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:27.001 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:27.089 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:27.138 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-04T06:20:36.792 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:20:48.627 level=DEBUG source=sched.go:462 msg="context for request finished" time=2024-08-04T06:20:48.627 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s time=2024-08-04T06:20:48.627 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:24.350 level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:24.717 level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 0.37 seconds" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:25.289 level=DEBUG source=sched.go:376 msg="runner released" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:25.289 level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:25.289 level=DEBUG source=sched.go:297 msg="unload completed" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:25.356 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:25.386 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-04T06:21:33.957 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:21:36.953 level=DEBUG source=sched.go:462 msg="context for request finished" time=2024-08-04T06:21:36.954 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s time=2024-08-04T06:21:36.954 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 time=2024-08-04T06:21:46.780 level=DEBUG source=sched.go:313 msg="shutting down scheduler completed loop" time=2024-08-04T06:21:46.780 level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop" time=2024-08-04T06:21:46.780 level=DEBUG source=sched.go:791 msg="shutting down runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 ``` And for the log that did not reload the model: ``` time=2024-08-04T06:26:21.776 level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-08-04T06:26:29.480 level=DEBUG source=sched.go:219 msg="loading first model" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:26:29.575 level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:458 msg="finished setting up runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:462 msg="context for request finished" time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s time=2024-08-04T06:26:37.120 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 time=2024-08-04T06:26:43.939 level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 time=2024-08-04T06:26:54.155 level=DEBUG source=sched.go:403 msg="context for request finished" time=2024-08-04T06:26:54.156 level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 duration=10m0s time=2024-08-04T06:26:54.156 level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 refCount=0 time=2024-08-04T06:27:03.291 level=DEBUG source=sched.go:313 msg="shutting down scheduler completed loop" time=2024-08-04T06:27:03.291 level=DEBUG source=sched.go:119 msg="shutting down scheduler pending loop" time=2024-08-04T06:27:03.291 level=DEBUG source=sched.go:791 msg="shutting down runner" model=C:\Users\<username>\.ollama\models\blobs\sha256-ff1d1fc78170d787ee1201778e2dd65ea211654ca5fb7d69b5a2e7b123a50373 ```
Author
Owner

@abes200 commented on GitHub (Aug 4, 2024):

After looking through the code on github, I think I can say with some confidence, I have no idea what I am looking at. My python skills are no where near good enough to work it out.
I thought maybe it was something to do with the needsReload function, but then it should be doing it for everyone. So it must be an edge case of some kind.
Could it have something to do with how it calculates the free vram on my card specifically when parallelism is occurring?
Graphics: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes
CPU: AMD FX-8350 4.0ghz
Mem: 16gb

I question this as if I set the num_gpu to 0, I can run models in parallel with no problem and get 2 response at once and the model does not reload unless it is supposed to now. Although it's quite a bit slower.

I'll just add that currently the only solution is to disable any parallel calls, although this means everything is now queued, it does at least mean the model is not constantly reloading and queuing things anyway.
I've tried changing and adding multiple different options, `OLLAMA_NUM_PARALLEL' is the only option that causes the models to reload unless it is set to 0 or 1. If set to 2 or higher the model will reload for every message and only ever respond to 1 message at a time.
Is there anyone else having this issue or am I completely alone? If I work out what stupid thing I have done to break it I'll post it.

<!-- gh-comment-id:2267257937 --> @abes200 commented on GitHub (Aug 4, 2024): After looking through the code on github, I think I can say with some confidence, I have no idea what I am looking at. My python skills are no where near good enough to work it out. I thought maybe it was something to do with the `needsReload` function, but then it should be doing it for everyone. So it must be an edge case of some kind. Could it have something to do with how it calculates the free vram on my card specifically when parallelism is occurring? Graphics: NVIDIA GeForce GTX 1660 SUPER, compute capability 7.5, VMM: yes CPU: AMD FX-8350 4.0ghz Mem: 16gb I question this as if I set the num_gpu to 0, I can run models in parallel with no problem and get 2 response at once and the model does not reload unless it is supposed to now. Although it's quite a bit slower. I'll just add that currently the only solution is to disable any parallel calls, although this means everything is now queued, it does at least mean the model is not constantly reloading and queuing things anyway. I've tried changing and adding multiple different options, `OLLAMA_NUM_PARALLEL' is the only option that causes the models to reload unless it is set to 0 or 1. If set to 2 or higher the model will reload for every message and only ever respond to 1 message at a time. Is there anyone else having this issue or am I completely alone? If I work out what stupid thing I have done to break it I'll post it.
Author
Owner

@rick-github commented on GitHub (Aug 5, 2024):

You are right in that needsReload seems implicated here. I added some debugging over the weekend but was unable to see anything unusual, somebody who's actually experiencing the problem will have to dig around in the code. It is very curious that only a small subset of people who trigger that code are affected, so far I have no explanation.

<!-- gh-comment-id:2269249075 --> @rick-github commented on GitHub (Aug 5, 2024): You are right in that `needsReload` seems implicated here. I added some debugging over the weekend but was unable to see anything unusual, somebody who's actually experiencing the problem will have to dig around in the code. It is very curious that only a small subset of people who trigger that code are affected, so far I have no explanation.
Author
Owner

@abes200 commented on GitHub (Aug 6, 2024):

It's a bit of a pickle. I really appreciate your attempts to find a solution. Thank you for the help.
I tried to keep looking myself, but adding debug code and testing it is a bit beyond me.

I tried a few things over the weekend as well, but nothing as interesting as you. Tried different combinations of settings, different models, even updated my drivers.
All I can really confirm is that if OLLAMA_NUM_PARALLEL is 0 or 1 it works fine with no reloading, and that higher numbers also make it use more memory when it loads the model but will make it unload/load the model on every input. Nothing else seems to have any impact on this.
Except for 1 thing. If I load a very small model like Qwen2 0.5b of which I could fit more than 1 in memory, then the model works and does parallel requests just fine. No reloading, and it doesn't load a second model. If I try Qwen2 7b which I can only fit 1, it reloads on every request.
Also that is specific to my GPU memory, as in if I can fit more than 1 model in GPU memory, it will not reload and works fine. If only 1 fits, it will reload.
Except why that is occurring only with parallel I have no clue.

I did just run another test, if I expand Qwen2 0.5b num_ctx to 65536, it takes up most of my GPU memory so now I can only load one of them. This caused the model to start reloading like the 7b model does on every request as well, but again only with parallel>1. Not sure if that's helpful, but just thought of it before I posted, so thought I'd share the result.

<!-- gh-comment-id:2270299836 --> @abes200 commented on GitHub (Aug 6, 2024): It's a bit of a pickle. I really appreciate your attempts to find a solution. Thank you for the help. I tried to keep looking myself, but adding debug code and testing it is a bit beyond me. I tried a few things over the weekend as well, but nothing as interesting as you. Tried different combinations of settings, different models, even updated my drivers. All I can really confirm is that if OLLAMA_NUM_PARALLEL is 0 or 1 it works fine with no reloading, and that higher numbers also make it use more memory when it loads the model but will make it unload/load the model on every input. Nothing else seems to have any impact on this. Except for 1 thing. If I load a very small model like Qwen2 0.5b of which I could fit more than 1 in memory, then the model works and does parallel requests just fine. No reloading, and it doesn't load a second model. If I try Qwen2 7b which I can only fit 1, it reloads on every request. Also that is specific to my GPU memory, as in if I can fit more than 1 model in GPU memory, it will not reload and works fine. If only 1 fits, it will reload. Except why that is occurring only with parallel I have no clue. I did just run another test, if I expand Qwen2 0.5b num_ctx to 65536, it takes up most of my GPU memory so now I can only load one of them. This caused the model to start reloading like the 7b model does on every request as well, but again only with parallel>1. Not sure if that's helpful, but just thought of it before I posted, so thought I'd share the result.
Author
Owner

@MHugonKaliop commented on GitHub (Aug 6, 2024):

Hello
I have the same issue, and a different configuration : Linux / Ubuntu / Cuda / Nivida L4 with 24 Go VRAM

ollama 0.3.3 with these models :

NAME                   	ID          	SIZE  	MODIFIED     
codestral:latest       	fcc0019dcee9	12 GB 	23 hours ago	
nomic-embed-text:latest	0a109f422b47	274 MB	8 weeks ago 	

I can't say for sure what was the ollama version that began to have this issue, but I don't think it was before 0.3.0 because the issue is new for me.

My initial env variables were :

Environment="OLLAMA_HOST=0.0.0.0:9000"
Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"

I've first had issue with Codestral not offloading all layers on GPU, resulting in very low response times.

Now my env variables are :

Environment="OLLAMA_HOST=0.0.0.0:9000"
#Environment="OLLAMA_NUM_PARALLEL=2"
Environment="OLLAMA_MAX_LOADED_MODELS=2"
Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"

Codestral correctly offloads all layers to the GPU, using around 18 Gb of VRAM, and the thoughput is back to the usual.

But calling an embedding unloads Codestral every time (it's then reloaded the next call).

The log shows this :

Aug 06 07:02:12 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:12.587Z level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda total="22.0 GiB" available="499.0 MiB"
Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.618Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb parallel=4 available=23368695808 required="1.0 GiB"
Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[21.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama3672705594/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --flash-attn --parallel 4 --port 46853"
Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=sched.go:445 msg="loaded runners" count=1
Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.620Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
Aug 06 07:02:13 copilot-20240729-134717 ollama[66449]: INFO [main] build info | build=1 commit="6eeaeba" tid="123697634357248" timestamp=1722927733
Aug 06 07:02:13 copilot-20240729-134717 ollama[66449]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123697634357248" timestamp=1722927733 total_threads=12
Aug 06 07:02:13 copilot-20240729-134717 ollama[66449]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="46853" tid="123697634357248" timestamp=1722927733

And I've just witnessed something else.
Codestral is correctly loaded, and /api/generate is called regularly.
But if I send a simple embed query from the command line using the same codestral, it's offloaded !

Aug 06 07:19:39 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:39 | 200 | 10.561522924s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:40 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:40 | 200 | 11.207007378s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:40 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:40 | 200 | 11.197939962s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:41 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:41 | 200 | 11.955991229s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:41 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:41 | 200 | 11.875127314s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:42 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:42 | 200 | 11.720965167s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:42 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:42 | 200 | 11.685984404s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:42 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:42 | 200 | 11.233989232s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:43 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:43 | 200 | 11.264099337s |   145.239.103.2 | POST     "/api/generate"
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.915Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb parallel=4 available=23368695808 required="14.8 GiB"
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=57 layers.offload=57 layers.split="" memory.available="[21.8 GiB]" memory.required.full="14.8 GiB" memory.required.partial="14.8 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[14.8 GiB]" memory.weights.total="13.2 GiB" memory.weights.repeating="13.0 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="832.0 MiB" memory.graph.partial="860.3 MiB"
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama3672705594/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 57 --flash-attn --parallel 4 --port 44267"
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=sched.go:445 msg="loaded runners" count=1
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.917Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
Aug 06 07:19:44 copilot-20240729-134717 ollama[97394]: INFO [main] build info | build=1 commit="6eeaeba" tid="123224827387904" timestamp=1722928784
Aug 06 07:19:44 copilot-20240729-134717 ollama[97394]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123224827387904" timestamp=1722928784 total_threads=12
Aug 06 07:19:44 copilot-20240729-134717 ollama[97394]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="44267" tid="123224827387904" timestamp=1722928784
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: loaded meta data with 25 key-value pairs and 507 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e (version GGUF V3 (latest))
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   1:                               general.name str              = Codestral-22B-v0.1
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   2:                          llama.block_count u32              = 56
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   4:                     llama.embedding_length u32              = 6144
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 16384
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 48
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  10:                          general.file_type u32              = 2
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  13:            tokenizer.ggml.add_space_prefix bool             = true
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv  24:               general.quantization_version u32              = 2
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - type  f32:  113 tensors
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - type q4_0:  393 tensors
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - type q6_K:    1 tensors
Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llm_load_vocab: special tokens cache size = 771
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_vocab: token to piece cache size = 0.1731 MB
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: format           = GGUF V3 (latest)
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: arch             = llama
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: vocab type       = SPM
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_vocab          = 32768
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_merges         = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: vocab_only       = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_ctx_train      = 32768
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd           = 6144
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_layer          = 56
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_head           = 48
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_head_kv        = 8
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_rot            = 128
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_swa            = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_head_k    = 128
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_head_v    = 128
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_gqa            = 6
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_k_gqa     = 1024
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_v_gqa     = 1024
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_ff             = 16384
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_expert         = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_expert_used    = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: causal attn      = 1
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: pooling type     = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: rope type        = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: rope scaling     = linear
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: freq_base_train  = 1000000.0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: freq_scale_train = 1
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_ctx_orig_yarn  = 32768
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: rope_finetuned   = unknown
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_d_conv       = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_d_inner      = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_d_state      = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_dt_rank      = 0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model type       = ?B
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model ftype      = Q4_0
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model params     = 22.25 B
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model size       = 11.71 GiB (4.52 BPW)
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: general.name     = Codestral-22B-v0.1
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: BOS token        = 1 '<s>'
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: EOS token        = 2 '</s>'
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: UNK token        = 0 '<unk>'
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: LF token         = 781 '<0x0A>'
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: max token length = 48
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: ggml_cuda_init: found 1 CUDA devices:
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]:   Device 0: NVIDIA L4, compute capability 8.9, VMM: yes
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:45.167Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model"
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: ggml ctx size =    0.47 MiB
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: offloading 56 repeating layers to GPU
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: offloading non-repeating layers to GPU
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: offloaded 57/57 layers to GPU
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors:        CPU buffer size =   108.00 MiB
Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors:      CUDA0 buffer size = 11878.15 MiB
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: n_ctx      = 8192
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: n_batch    = 512
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: n_ubatch   = 512
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: flash_attn = 1
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: freq_base  = 1000000.0
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: freq_scale = 1
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_kv_cache_init:      CUDA0 KV buffer size =  1792.00 MiB
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.59 MiB
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model:      CUDA0 compute buffer size =   104.00 MiB
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model:  CUDA_Host compute buffer size =    28.01 MiB
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: graph nodes  = 1575
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: graph splits = 2
Aug 06 07:19:48 copilot-20240729-134717 ollama[97394]: INFO [main] model loaded | tid="123224827387904" timestamp=1722928788
Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:48.929Z level=INFO source=server.go:623 msg="llama runner started in 4.01 seconds"
Aug 06 07:19:49 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:49 | 200 | 15.332279562s |   145.239.103.1 | POST     "/api/embed"

<!-- gh-comment-id:2270550693 --> @MHugonKaliop commented on GitHub (Aug 6, 2024): Hello I have the same issue, and a different configuration : Linux / Ubuntu / Cuda / Nivida L4 with 24 Go VRAM ollama 0.3.3 with these models : ``` NAME ID SIZE MODIFIED codestral:latest fcc0019dcee9 12 GB 23 hours ago nomic-embed-text:latest 0a109f422b47 274 MB 8 weeks ago ``` I can't say for sure what was the ollama version that began to have this issue, but I don't think it was before 0.3.0 because the issue is new for me. My initial env variables were : ``` Environment="OLLAMA_HOST=0.0.0.0:9000" Environment="OLLAMA_NUM_PARALLEL=2" Environment="OLLAMA_MAX_LOADED_MODELS=2" Environment="OLLAMA_KEEP_ALIVE=-1" Environment="OLLAMA_FLASH_ATTENTION=1" ``` I've first had issue with Codestral not offloading all layers on GPU, resulting in very low response times. Now my env variables are : ``` Environment="OLLAMA_HOST=0.0.0.0:9000" #Environment="OLLAMA_NUM_PARALLEL=2" Environment="OLLAMA_MAX_LOADED_MODELS=2" Environment="OLLAMA_KEEP_ALIVE=-1" Environment="OLLAMA_FLASH_ATTENTION=1" ``` Codestral correctly offloads all layers to the GPU, using around 18 Gb of VRAM, and the thoughput is back to the usual. But calling an embedding unloads Codestral every time (it's then reloaded the next call). The log shows this : ``` Aug 06 07:02:12 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:12.587Z level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda total="22.0 GiB" available="499.0 MiB" Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.618Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb parallel=4 available=23368695808 required="1.0 GiB" Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[21.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama3672705594/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --flash-attn --parallel 4 --port 46853" Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=sched.go:445 msg="loaded runners" count=1 Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.619Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding" Aug 06 07:02:13 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:02:13.620Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" Aug 06 07:02:13 copilot-20240729-134717 ollama[66449]: INFO [main] build info | build=1 commit="6eeaeba" tid="123697634357248" timestamp=1722927733 Aug 06 07:02:13 copilot-20240729-134717 ollama[66449]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123697634357248" timestamp=1722927733 total_threads=12 Aug 06 07:02:13 copilot-20240729-134717 ollama[66449]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="46853" tid="123697634357248" timestamp=1722927733 ``` And I've just witnessed something else. Codestral is correctly loaded, and /api/generate is called regularly. But if I send a simple embed query from the command line using the same codestral, it's offloaded ! ``` Aug 06 07:19:39 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:39 | 200 | 10.561522924s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:40 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:40 | 200 | 11.207007378s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:40 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:40 | 200 | 11.197939962s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:41 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:41 | 200 | 11.955991229s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:41 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:41 | 200 | 11.875127314s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:42 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:42 | 200 | 11.720965167s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:42 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:42 | 200 | 11.685984404s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:42 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:42 | 200 | 11.233989232s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:43 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:43 | 200 | 11.264099337s | 145.239.103.2 | POST "/api/generate" Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.915Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb parallel=4 available=23368695808 required="14.8 GiB" Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=57 layers.offload=57 layers.split="" memory.available="[21.8 GiB]" memory.required.full="14.8 GiB" memory.required.partial="14.8 GiB" memory.required.kv="1.8 GiB" memory.required.allocations="[14.8 GiB]" memory.weights.total="13.2 GiB" memory.weights.repeating="13.0 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="832.0 MiB" memory.graph.partial="860.3 MiB" Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama3672705594/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e --ctx-size 8192 --batch-size 512 --embedding --log-disable --n-gpu-layers 57 --flash-attn --parallel 4 --port 44267" Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=sched.go:445 msg="loaded runners" count=1 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.916Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding" Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:44.917Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" Aug 06 07:19:44 copilot-20240729-134717 ollama[97394]: INFO [main] build info | build=1 commit="6eeaeba" tid="123224827387904" timestamp=1722928784 Aug 06 07:19:44 copilot-20240729-134717 ollama[97394]: INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123224827387904" timestamp=1722928784 total_threads=12 Aug 06 07:19:44 copilot-20240729-134717 ollama[97394]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="44267" tid="123224827387904" timestamp=1722928784 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: loaded meta data with 25 key-value pairs and 507 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e (version GGUF V3 (latest)) Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 0: general.architecture str = llama Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 1: general.name str = Codestral-22B-v0.1 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 2: llama.block_count u32 = 56 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 3: llama.context_length u32 = 32768 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 4: llama.embedding_length u32 = 6144 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 16384 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 6: llama.attention.head_count u32 = 48 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 10: general.file_type u32 = 2 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 11: llama.vocab_size u32 = 32768 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 13: tokenizer.ggml.add_space_prefix bool = true Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 14: tokenizer.ggml.model str = llama Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 15: tokenizer.ggml.pre str = default Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[... Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32768] = [0.000000, 0.000000, 0.000000, 0.0000... Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32768] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - kv 24: general.quantization_version u32 = 2 Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - type f32: 113 tensors Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - type q4_0: 393 tensors Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llama_model_loader: - type q6_K: 1 tensors Aug 06 07:19:44 copilot-20240729-134717 ollama[588]: llm_load_vocab: special tokens cache size = 771 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_vocab: token to piece cache size = 0.1731 MB Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: format = GGUF V3 (latest) Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: arch = llama Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: vocab type = SPM Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_vocab = 32768 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_merges = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: vocab_only = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_ctx_train = 32768 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd = 6144 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_layer = 56 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_head = 48 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_head_kv = 8 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_rot = 128 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_swa = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_head_k = 128 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_head_v = 128 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_gqa = 6 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_k_gqa = 1024 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_embd_v_gqa = 1024 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_norm_eps = 0.0e+00 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: f_logit_scale = 0.0e+00 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_ff = 16384 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_expert = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_expert_used = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: causal attn = 1 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: pooling type = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: rope type = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: rope scaling = linear Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: freq_base_train = 1000000.0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: freq_scale_train = 1 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: n_ctx_orig_yarn = 32768 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: rope_finetuned = unknown Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_d_conv = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_d_inner = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_d_state = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: ssm_dt_rank = 0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model type = ?B Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model ftype = Q4_0 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model params = 22.25 B Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: model size = 11.71 GiB (4.52 BPW) Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: general.name = Codestral-22B-v0.1 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: BOS token = 1 '<s>' Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: EOS token = 2 '</s>' Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: UNK token = 0 '<unk>' Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: LF token = 781 '<0x0A>' Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_print_meta: max token length = 48 Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: ggml_cuda_init: found 1 CUDA devices: Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: Device 0: NVIDIA L4, compute capability 8.9, VMM: yes Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:45.167Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model" Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: ggml ctx size = 0.47 MiB Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: offloading 56 repeating layers to GPU Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: offloading non-repeating layers to GPU Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: offloaded 57/57 layers to GPU Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: CPU buffer size = 108.00 MiB Aug 06 07:19:45 copilot-20240729-134717 ollama[588]: llm_load_tensors: CUDA0 buffer size = 11878.15 MiB Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: n_ctx = 8192 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: n_batch = 512 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: n_ubatch = 512 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: flash_attn = 1 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: freq_base = 1000000.0 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: freq_scale = 1 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_kv_cache_init: CUDA0 KV buffer size = 1792.00 MiB Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: KV self size = 1792.00 MiB, K (f16): 896.00 MiB, V (f16): 896.00 MiB Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: CUDA_Host output buffer size = 0.59 MiB Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: CUDA0 compute buffer size = 104.00 MiB Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: CUDA_Host compute buffer size = 28.01 MiB Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: graph nodes = 1575 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: llama_new_context_with_model: graph splits = 2 Aug 06 07:19:48 copilot-20240729-134717 ollama[97394]: INFO [main] model loaded | tid="123224827387904" timestamp=1722928788 Aug 06 07:19:48 copilot-20240729-134717 ollama[588]: time=2024-08-06T07:19:48.929Z level=INFO source=server.go:623 msg="llama runner started in 4.01 seconds" Aug 06 07:19:49 copilot-20240729-134717 ollama[588]: [GIN] 2024/08/06 - 07:19:49 | 200 | 15.332279562s | 145.239.103.1 | POST "/api/embed" ```
Author
Owner

@rick-github commented on GitHub (Aug 6, 2024):

@abes200 It really feels like a memory pressure problem. I know you tried adjusting num_gpu but in the example you gave, you are increasing the layers. To decrease the memory pressure you need to decrease the layers, have you tried with /set parameter num_gpu 20?

@MHugonKaliop Your context window is changing between calls. You can add "options":{"num_ctx":x} to the calls to maintain a consistent window size, or if you can't modify the API calls, you can create a new model with a default context window by adding PARAMETER num_ctx x to the modelfile.

<!-- gh-comment-id:2270667336 --> @rick-github commented on GitHub (Aug 6, 2024): @abes200 It really feels like a memory pressure problem. I know you tried adjusting `num_gpu` but in the example you gave, you are increasing the layers. To decrease the memory pressure you need to decrease the layers, have you tried with `/set parameter num_gpu 20`? @MHugonKaliop Your context window is changing between calls. You can add `"options":{"num_ctx":x}` to the calls to maintain a consistent window size, or if you can't modify the API calls, you can create a new model with a default context window by adding `PARAMETER num_ctx x` to the modelfile.
Author
Owner

@MHugonKaliop commented on GitHub (Aug 6, 2024):

@rick-github
you were right, it explains the offloading when using /embed for the same model.
But I still have the issue when a call goes to the other model. The first one is removed from VRAM.

<!-- gh-comment-id:2270746690 --> @MHugonKaliop commented on GitHub (Aug 6, 2024): @rick-github you were right, it explains the offloading when using /embed for the same model. But I still have the issue when a call goes to the other model. The first one is removed from VRAM.
Author
Owner

@rick-github commented on GitHub (Aug 6, 2024):

There's not enough context in your log snippet to see what's going on, can you post the full server log?

<!-- gh-comment-id:2270771935 --> @rick-github commented on GitHub (Aug 6, 2024): There's not enough context in your log snippet to see what's going on, can you post the full server log?
Author
Owner

@abes200 commented on GitHub (Aug 6, 2024):

@rick-github
That is a good idea... I did notice it was adjusting the layers automatically as I changed the settings, lower it as I increased num_ctx and vice versa. I think I just assumed it was working that out correctly.
After some testing it seems it's miscalculating the layers by 1. I tried changing the num_ctx up to adjust the layers it was loading, every time if I reduced it by 1 from what was in the log file it stops reloading.

So while I'm still puzzled why it's not doing that for everyone, or why it suddenly started doing that recently.
It does allow me to make a fix for myself! Thank you! Ollama is awesome so I am happy again.
I'll leave this open for now until Ollama works automatically again(unless requested to close it), in case I discover it is something I did or you want me to try anything else.

<!-- gh-comment-id:2270860899 --> @abes200 commented on GitHub (Aug 6, 2024): @rick-github That is a good idea... I did notice it was adjusting the layers automatically as I changed the settings, lower it as I increased num_ctx and vice versa. I think I just assumed it was working that out correctly. After some testing it seems it's miscalculating the layers by 1. I tried changing the num_ctx up to adjust the layers it was loading, every time if I reduced it by 1 from what was in the log file it stops reloading. So while I'm still puzzled why it's not doing that for everyone, or why it suddenly started doing that recently. It does allow me to make a fix for myself! Thank you! Ollama is awesome so I am happy again. I'll leave this open for now until Ollama works automatically again(unless requested to close it), in case I discover it is something I did or you want me to try anything else.
Author
Owner

@MHugonKaliop commented on GitHub (Aug 6, 2024):

Here we go with the logs with ollama_debug=1

Initialisation phase :

2024/08/06 12:29:23 routes.go:1108: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_HOST:http://0.0.0.0:9000 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
time=2024-08-06T12:29:23.809Z level=INFO source=images.go:781 msg="total blobs: 9"
time=2024-08-06T12:29:23.809Z level=INFO source=images.go:788 msg="total unused blobs removed: 0"
time=2024-08-06T12:29:23.809Z level=INFO source=routes.go:1155 msg="Listening on [::]:9000 (version 0.3.3)"
time=2024-08-06T12:29:23.809Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2871822748/runners
time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz
time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz
time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz
time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz
time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz
time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz
time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz
time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/deps.txt.gz
time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/ollama_llama_server.gz
time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server
time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server
time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server
time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server
time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server
time=2024-08-06T12:29:28.626Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60102 cpu]"
time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
time=2024-08-06T12:29:28.626Z level=DEBUG source=sched.go:105 msg="starting llm scheduler"
time=2024-08-06T12:29:28.626Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
time=2024-08-06T12:29:28.627Z level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
time=2024-08-06T12:29:28.627Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
time=2024-08-06T12:29:28.627Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/home/mhugon_kaliop_com/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
time=2024-08-06T12:29:28.628Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07]
CUDA driver version: 12.4
time=2024-08-06T12:29:28.738Z level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07
[GPU-635e6271-5712-7287-dd2d-29daacc179fb] CUDA totalMem 22478 mb
[GPU-635e6271-5712-7287-dd2d-29daacc179fb] CUDA freeMem 22286 mb
[GPU-635e6271-5712-7287-dd2d-29daacc179fb] Compute Capability 8.9
time=2024-08-06T12:29:28.913Z level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu"
releasing cuda driver library
time=2024-08-06T12:29:28.913Z level=INFO source=types.go:105 msg="inference compute" id=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda compute=8.9 driver=12.4 name="NVIDIA L4" total="22.0 GiB" available="21.8 GiB"
time=2024-08-06T12:29:47.990Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="46.1 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="46.0 GiB" now.free_swap="0 B"
CUDA driver version: 12.4
time=2024-08-06T12:29:48.161Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="21.8 GiB" now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB"
releasing cuda driver library
time=2024-08-06T12:29:48.167Z level=DEBUG source=sched.go:219 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:29:48.167Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2024-08-06T12:29:48.168Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2024-08-06T12:29:48.169Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2024-08-06T12:29:48.169Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2024-08-06T12:29:48.170Z level=DEBUG source=server.go:100 msg="system memory" total="47.0 GiB" free="46.0 GiB" free_swap="0 B"
time=2024-08-06T12:29:48.170Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2024-08-06T12:29:48.171Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=57 layers.offload=53 layers.split="" memory.available="[21.8 GiB]" memory.required.full="22.6 GiB" memory.required.partial="21.5 GiB" memory.required.kv="7.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="18.3 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="3.1 GiB" memory.graph.partial="3.2 GiB"
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server
time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server
time=2024-08-06T12:29:48.172Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server
time=2024-08-06T12:29:48.172Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server
time=2024-08-06T12:29:48.172Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 53 --verbose --flash-attn --parallel 1 --port 39197"
time=2024-08-06T12:29:48.172Z level=DEBUG source=server.go:401 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin LD_LIBRARY_PATH=/tmp/ollama2871822748/runners/cuda_v11:/tmp/ollama2871822748/runners CUDA_VISIBLE_DEVICES=GPU-635e6271-5712-7287-dd2d-29daacc179fb]"
time=2024-08-06T12:29:48.173Z level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-06T12:29:48.173Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-08-06T12:29:48.173Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"
INFO [main] build info | build=1 commit="6eeaeba" tid="127585882058752" timestamp=1722947388
INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127585882058752" timestamp=1722947388 total_threads=12
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="39197" tid="127585882058752" timestamp=1722947388

Then comes a first call (an ambedding), Codestral loads

INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="39197" tid="127585882058752" timestamp=1722947388
llama_model_loader: loaded meta data with 25 key-value pairs and 507 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Codestral-22B-v0.1
llama_model_loader: - kv   2:                          llama.block_count u32              = 56
llama_model_loader: - kv   3:                       llama.context_length u32              = 32768
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 6144
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 16384
llama_model_loader: - kv   6:                 llama.attention.head_count u32              = 48
llama_model_loader: - kv   7:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                           llama.vocab_size u32              = 32768
llama_model_loader: - kv  12:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  13:            tokenizer.ggml.add_space_prefix bool             = true
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  15:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,32768]   = ["<unk>", "<s>", "</s>", "[INST]", "[...
llama_model_loader: - kv  17:                      tokenizer.ggml.scores arr[f32,32768]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  18:                  tokenizer.ggml.token_type arr[i32,32768]   = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  22:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  23:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  24:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  113 tensors
llama_model_loader: - type q4_0:  393 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens cache size = 771
llm_load_vocab: token to piece cache size = 0.1731 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32768
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_layer          = 56
llm_load_print_meta: n_head           = 48
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 16384
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 22.25 B
llm_load_print_meta: model size       = 11.71 GiB (4.52 BPW) 
llm_load_print_meta: general.name     = Codestral-22B-v0.1
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 781 '<0x0A>'
llm_load_print_meta: max token length = 48
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA L4, compute capability 8.9, VMM: yes
time=2024-08-06T12:29:48.425Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model"
llm_load_tensors: ggml ctx size =    0.47 MiB
llm_load_tensors: offloading 53 repeating layers to GPU
llm_load_tensors: offloaded 53/57 layers to GPU
llm_load_tensors:        CPU buffer size = 11986.15 MiB
llm_load_tensors:      CUDA0 buffer size = 11092.73 MiB
time=2024-08-06T12:29:49.178Z level=DEBUG source=server.go:629 msg="model load progress 0.10"
time=2024-08-06T12:29:49.429Z level=DEBUG source=server.go:629 msg="model load progress 0.20"
time=2024-08-06T12:29:49.680Z level=DEBUG source=server.go:629 msg="model load progress 0.30"
time=2024-08-06T12:29:49.931Z level=DEBUG source=server.go:629 msg="model load progress 0.39"
time=2024-08-06T12:29:50.181Z level=DEBUG source=server.go:629 msg="model load progress 0.48"
time=2024-08-06T12:29:50.432Z level=DEBUG source=server.go:629 msg="model load progress 0.58"
time=2024-08-06T12:29:50.683Z level=DEBUG source=server.go:629 msg="model load progress 0.68"
time=2024-08-06T12:29:50.934Z level=DEBUG source=server.go:629 msg="model load progress 0.77"
time=2024-08-06T12:29:51.185Z level=DEBUG source=server.go:629 msg="model load progress 0.86"
time=2024-08-06T12:29:51.436Z level=DEBUG source=server.go:629 msg="model load progress 0.96"
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
time=2024-08-06T12:29:51.688Z level=DEBUG source=server.go:629 msg="model load progress 1.00"
llama_kv_cache_init:  CUDA_Host KV buffer size =   384.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =  6784.00 MiB
llama_new_context_with_model: KV self size  = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.15 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   268.25 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    76.01 MiB
llama_new_context_with_model: graph nodes  = 1575
llama_new_context_with_model: graph splits = 37
time=2024-08-06T12:29:51.938Z level=DEBUG source=server.go:632 msg="model load completed, waiting for server to become available" status="llm server loading model"
DEBUG [initialize] initializing slots | n_slots=1 tid="127585882058752" timestamp=1722947392
DEBUG [initialize] new slot | n_ctx_slot=32768 slot_id=0 tid="127585882058752" timestamp=1722947392
INFO [main] model loaded | tid="127585882058752" timestamp=1722947392
DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="127585882058752" timestamp=1722947392
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="127585882058752" timestamp=1722947392
time=2024-08-06T12:29:52.190Z level=INFO source=server.go:623 msg="llama runner started in 4.02 seconds"
time=2024-08-06T12:29:52.190Z level=DEBUG source=sched.go:458 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="127585882058752" timestamp=1722947392
DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=59190 status=200 tid="127585400893440" timestamp=1722947392
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="127585882058752" timestamp=1722947392
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="127585882058752" timestamp=1722947392
DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="127585882058752" timestamp=1722947392
DEBUG [update_slots] slot released | n_cache_tokens=7 n_ctx=32768 n_past=7 n_system_tokens=0 slot_id=0 task_id=3 tid="127585882058752" timestamp=1722947392 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/embedding" remote_addr="127.0.0.1" remote_port=59190 status=200 tid="127585400893440" timestamp=1722947392
[GIN] 2024/08/06 - 12:29:52 | 200 |  4.695654868s |   145.239.103.1 | POST     "/api/embed"
time=2024-08-06T12:29:52.685Z level=DEBUG source=sched.go:462 msg="context for request finished"
time=2024-08-06T12:29:52.685Z level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e duration=2562047h47m16.854775807s
time=2024-08-06T12:29:52.685Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=0

I've noticed that it offloads 53/57 layers to GPU.

then comes a call to api/generate (same model)

time=2024-08-06T12:30:24.448Z level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=6 tid="127585882058752" timestamp=1722947424
time=2024-08-06T12:30:24.449Z level=DEBUG source=routes.go:208 msg="generate request" prompt="[SUFFIX]\n    print(\"... Loading reranker\")\n    global reranker\n    reranker = get_reranker_cohere()\n\n    # print(\"... Loading RAG chain\")\n\n    print(\"... Loading Classifier\")\n    checkpoint_path_hug = hf_hub_download(\n        repo_id=\"amaluuu/my-model_more_data\", filename=\"best-checkpoint.ckpt\"\n    )\n    global hug_model\n    hug_model = LightningModel.load_from_checkpoint(checkpoint_path_hug)\n[PREFIX])  # Default to Azure if not specified\n\n\ndef load_everything():\n    \"\"\"\n    Initialize and load all necessary components for Sofia.\n    \"\"\"\n    # Instanciate LLM, embedder and store:\n    print(\"... Langchain version:\", langchain.__version__)\n\n    print(\"... Loading LLMs\")\n    global llm\n    llm = get_mistral_llm()\n    global llm_solo\n    llm_solo = get_mistral_llm()\n    global llm_claude\n    llm_claude = get_llm_claude()\n\n    print(\"... Loading embedder\")\n    global embedder\n    embedder = get_embedder()\n\n    print(\"... Loading vector store\")\n    global vectorstore_chunks\n    global index_chunks\n    vectorstore_chunks, index_chunks = get_vectorstore_pinecone(embedder, \"chunks\")\n\n" images=[]
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=7 tid="127585882058752" timestamp=1722947424
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947424
DEBUG [update_slots] slot progression | ga_i=0 n_past=1 n_past_se=0 n_prompt_tokens_processed=373 slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947424
DEBUG [update_slots] kv cache rm [p0, end) | p0=1 slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947424
time=2024-08-06T12:30:24.939Z level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=10 tid="127585882058752" timestamp=1722947425
time=2024-08-06T12:30:25.118Z level=DEBUG source=routes.go:208 msg="generate request" prompt="[SUFFIX]\n    print(\"... Loading reranker\")\n    global reranker\n    reranker = get_reranker_cohere()\n\n    # print(\"... Loading RAG chain\")\n\n    print(\"... Loading Classifier\")\n    checkpoint_path_hug = hf_hub_download(\n        repo_id=\"amaluuu/my-model_more_data\", filename=\"best-checkpoint.ckpt\"\n    )\n    global hug_model\n    hug_model = LightningModel.load_from_checkpoint(checkpoint_path_hug)\n[PREFIX])  # Default to Azure if not specified\n\n\ndef load_everything():\n    \"\"\"\n    Initialize and load all necessary components for Sofia.\n    \"\"\"\n    # Instanciate LLM, embedder and store:\n    print(\"... Langchain version:\", langchain.__version__)\n\n    print(\"... Loading LLMs\")\n    global llm\n    llm = get_mistral_llm()\n    global llm_solo\n    llm_solo = get_mistral_llm()\n    global llm_claude\n    llm_claude = get_llm_claude()\n\n    print(\"... Loading embedder\")\n    global embedder\n    embedder = get_embedder()\n\n    print(\"... Loading vector store\")\n    global vectorstore_chunks\n    global index_chunks\n    vectorstore_chunks, index_chunks = get_vectorstore_pinecone(embedder, \"chunks\")\n\n\n" images=[]
time=2024-08-06T12:30:25.245Z level=DEBUG source=sched.go:403 msg="context for request finished"
time=2024-08-06T12:30:25.245Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=1
[GIN] 2024/08/06 - 12:30:25 | 200 |  803.513666ms |   145.239.103.1 | POST     "/api/generate"
DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=13 tid="127585882058752" timestamp=1722947425
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=60170 status=200 tid="127585392500736" timestamp=1722947425
DEBUG [update_slots] slot released | n_cache_tokens=376 n_ctx=32768 n_past=375 n_system_tokens=0 slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947425 truncated=false
DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=16 tid="127585882058752" timestamp=1722947425
DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",1092]
DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947425
DEBUG [update_slots] slot progression | ga_i=0 n_past=373 n_past_se=0 n_prompt_tokens_processed=374 slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947425
DEBUG [update_slots] kv cache rm [p0, end) | p0=373 slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947425
time=2024-08-06T12:30:26.022Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="46.0 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="45.3 GiB" now.free_swap="0 B"
CUDA driver version: 12.4
time=2024-08-06T12:30:26.265Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="21.8 GiB" now.total="22.0 GiB" now.free="3.8 GiB" now.used="18.2 GiB"
releasing cuda driver library
time=2024-08-06T12:30:26.269Z level=DEBUG source=sched.go:492 msg="gpu reported" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda available="3.8 GiB"
time=2024-08-06T12:30:26.269Z level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda total="22.0 GiB" available="499.0 MiB"

At this point, there will be another call to api/generate, and the the model will unload

time=2024-08-06T12:30:26.269Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}"
time=2024-08-06T12:30:26.271Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers"
time=2024-08-06T12:30:26.271Z level=DEBUG source=sched.go:782 msg="no idle runners, picking the shortest duration" count=1
time=2024-08-06T12:30:26.271Z level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=1
time=2024-08-06T12:30:26.271Z level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
DEBUG [print_timings] prompt eval time     =      92.96 ms /   374 tokens (    0.25 ms per token,  4023.32 tokens per second) | n_prompt_tokens_processed=374 n_tokens_second=4023.322360635986 slot_id=0 t_prompt_processing=92.958 t_token=0.24855080213903744 task_id=17 tid="127585882058752" timestamp=1722947428
DEBUG [print_timings] generation eval time =    3180.56 ms /    35 runs   (   90.87 ms per token,    11.00 tokens per second) | n_decoded=35 n_tokens_second=11.00434105534146 slot_id=0 t_token=90.87322857142857 t_token_generation=3180.563 task_id=17 tid="127585882058752" timestamp=1722947428
DEBUG [print_timings]           total time =    3273.52 ms | slot_id=0 t_prompt_processing=92.958 t_token_generation=3180.563 t_total=3273.521 task_id=17 tid="127585882058752" timestamp=1722947428
DEBUG [update_slots] slot released | n_cache_tokens=409 n_ctx=32768 n_past=408 n_system_tokens=0 slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947428 truncated=false
DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=60186 status=200 tid="127585384108032" timestamp=1722947428
[GIN] 2024/08/06 - 12:30:28 | 200 |  3.690068188s |   145.239.103.1 | POST     "/api/generate"
time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:403 msg="context for request finished"
time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:327 msg="runner with zero duration has gone idle, expiring to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=0
time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:30:28.623Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="45.3 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="45.3 GiB" now.free_swap="0 B"
CUDA driver version: 12.4
time=2024-08-06T12:30:28.798Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="3.8 GiB" now.total="22.0 GiB" now.free="3.8 GiB" now.used="18.2 GiB"
releasing cuda driver library
time=2024-08-06T12:30:28.798Z level=DEBUG source=server.go:1042 msg="stopping llama server"
time=2024-08-06T12:30:28.798Z level=DEBUG source=server.go:1048 msg="waiting for llama server to exit"
time=2024-08-06T12:30:29.049Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="45.3 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="45.5 GiB" now.free_swap="0 B"
CUDA driver version: 12.4
time=2024-08-06T12:30:29.164Z level=DEBUG source=server.go:1052 msg="llama server stopped"
time=2024-08-06T12:30:29.164Z level=DEBUG source=sched.go:376 msg="runner released" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:30:29.425Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="3.8 GiB" now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB"
releasing cuda driver library
time=2024-08-06T12:30:29.425Z level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 0.80 seconds" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:30:29.425Z level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:30:29.425Z level=DEBUG source=sched.go:297 msg="unload completed" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e
time=2024-08-06T12:30:29.425Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="45.5 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="46.0 GiB" now.free_swap="0 B"
CUDA driver version: 12.4
time=2024-08-06T12:30:29.621Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="21.8 GiB" now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB"
releasing cuda driver library
time=2024-08-06T12:30:29.626Z level=DEBUG source=sched.go:219 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6
time=2024-08-06T12:30:29.626Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2024-08-06T12:30:29.626Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb parallel=4 available=23368695808 required="1.0 GiB"
time=2024-08-06T12:30:29.626Z level=DEBUG source=server.go:100 msg="system memory" total="47.0 GiB" free="46.0 GiB" free_swap="0 B"
time=2024-08-06T12:30:29.626Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]"
time=2024-08-06T12:30:29.627Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[21.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server
time=2024-08-06T12:30:29.627Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --flash-attn --parallel 4 --port 43015"
time=2024-08-06T12:30:29.627Z level=DEBUG source=server.go:401 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin LD_LIBRARY_PATH=/tmp/ollama2871822748/runners/cuda_v11:/tmp/ollama2871822748/runners CUDA_VISIBLE_DEVICES=GPU-635e6271-5712-7287-dd2d-29daacc179fb]"
time=2024-08-06T12:30:29.628Z level=INFO source=sched.go:445 msg="loaded runners" count=1
time=2024-08-06T12:30:29.628Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding"
time=2024-08-06T12:30:29.628Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error"

<!-- gh-comment-id:2271218678 --> @MHugonKaliop commented on GitHub (Aug 6, 2024): Here we go with the logs with ollama_debug=1 Initialisation phase : ``` 2024/08/06 12:29:23 routes.go:1108: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:true OLLAMA_HOST:http://0.0.0.0:9000 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:2562047h47m16.854775807s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/usr/share/ollama/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" time=2024-08-06T12:29:23.809Z level=INFO source=images.go:781 msg="total blobs: 9" time=2024-08-06T12:29:23.809Z level=INFO source=images.go:788 msg="total unused blobs removed: 0" time=2024-08-06T12:29:23.809Z level=INFO source=routes.go:1155 msg="Listening on [::]:9000 (version 0.3.3)" time=2024-08-06T12:29:23.809Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama2871822748/runners time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/x86_64/cpu/bin/ollama_llama_server.gz time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx file=build/linux/x86_64/cpu_avx/bin/ollama_llama_server.gz time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu_avx2 file=build/linux/x86_64/cpu_avx2/bin/ollama_llama_server.gz time=2024-08-06T12:29:23.809Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublas.so.11.gz time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcublasLt.so.11.gz time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/libcudart.so.11.0.gz time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/x86_64/cuda_v11/bin/ollama_llama_server.gz time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/deps.txt.gz time=2024-08-06T12:29:23.810Z level=DEBUG source=payload.go:182 msg=extracting variant=rocm_v60102 file=build/linux/x86_64/rocm_v60102/bin/ollama_llama_server.gz time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server time=2024-08-06T12:29:28.626Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 cuda_v11 rocm_v60102 cpu]" time=2024-08-06T12:29:28.626Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" time=2024-08-06T12:29:28.626Z level=DEBUG source=sched.go:105 msg="starting llm scheduler" time=2024-08-06T12:29:28.626Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs" time=2024-08-06T12:29:28.627Z level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" time=2024-08-06T12:29:28.627Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* time=2024-08-06T12:29:28.627Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/home/mhugon_kaliop_com/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" time=2024-08-06T12:29:28.628Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07] CUDA driver version: 12.4 time=2024-08-06T12:29:28.738Z level=DEBUG source=gpu.go:124 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07 [GPU-635e6271-5712-7287-dd2d-29daacc179fb] CUDA totalMem 22478 mb [GPU-635e6271-5712-7287-dd2d-29daacc179fb] CUDA freeMem 22286 mb [GPU-635e6271-5712-7287-dd2d-29daacc179fb] Compute Capability 8.9 time=2024-08-06T12:29:28.913Z level=DEBUG source=amd_linux.go:371 msg="amdgpu driver not detected /sys/module/amdgpu" releasing cuda driver library time=2024-08-06T12:29:28.913Z level=INFO source=types.go:105 msg="inference compute" id=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda compute=8.9 driver=12.4 name="NVIDIA L4" total="22.0 GiB" available="21.8 GiB" time=2024-08-06T12:29:47.990Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="46.1 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="46.0 GiB" now.free_swap="0 B" CUDA driver version: 12.4 time=2024-08-06T12:29:48.161Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="21.8 GiB" now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB" releasing cuda driver library time=2024-08-06T12:29:48.167Z level=DEBUG source=sched.go:219 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:29:48.167Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2024-08-06T12:29:48.168Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2024-08-06T12:29:48.169Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2024-08-06T12:29:48.169Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2024-08-06T12:29:48.170Z level=DEBUG source=server.go:100 msg="system memory" total="47.0 GiB" free="46.0 GiB" free_swap="0 B" time=2024-08-06T12:29:48.170Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2024-08-06T12:29:48.171Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=57 layers.offload=53 layers.split="" memory.available="[21.8 GiB]" memory.required.full="22.6 GiB" memory.required.partial="21.5 GiB" memory.required.kv="7.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="18.3 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="3.1 GiB" memory.graph.partial="3.2 GiB" time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server time=2024-08-06T12:29:48.171Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server time=2024-08-06T12:29:48.172Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server time=2024-08-06T12:29:48.172Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server time=2024-08-06T12:29:48.172Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 53 --verbose --flash-attn --parallel 1 --port 39197" time=2024-08-06T12:29:48.172Z level=DEBUG source=server.go:401 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin LD_LIBRARY_PATH=/tmp/ollama2871822748/runners/cuda_v11:/tmp/ollama2871822748/runners CUDA_VISIBLE_DEVICES=GPU-635e6271-5712-7287-dd2d-29daacc179fb]" time=2024-08-06T12:29:48.173Z level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-06T12:29:48.173Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-08-06T12:29:48.173Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" INFO [main] build info | build=1 commit="6eeaeba" tid="127585882058752" timestamp=1722947388 INFO [main] system info | n_threads=6 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="127585882058752" timestamp=1722947388 total_threads=12 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="39197" tid="127585882058752" timestamp=1722947388 ``` Then comes a first call (an ambedding), Codestral loads ``` INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="11" port="39197" tid="127585882058752" timestamp=1722947388 llama_model_loader: loaded meta data with 25 key-value pairs and 507 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Codestral-22B-v0.1 llama_model_loader: - kv 2: llama.block_count u32 = 56 llama_model_loader: - kv 3: llama.context_length u32 = 32768 llama_model_loader: - kv 4: llama.embedding_length u32 = 6144 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 16384 llama_model_loader: - kv 6: llama.attention.head_count u32 = 48 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 32768 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.add_space_prefix bool = true llama_model_loader: - kv 14: tokenizer.ggml.model str = llama llama_model_loader: - kv 15: tokenizer.ggml.pre str = default llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,32768] = ["<unk>", "<s>", "</s>", "[INST]", "[... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,32768] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,32768] = [2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ... llama_model_loader: - kv 19: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 21: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 23: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 24: general.quantization_version u32 = 2 llama_model_loader: - type f32: 113 tensors llama_model_loader: - type q4_0: 393 tensors llama_model_loader: - type q6_K: 1 tensors llm_load_vocab: special tokens cache size = 771 llm_load_vocab: token to piece cache size = 0.1731 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32768 llm_load_print_meta: n_merges = 0 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 6144 llm_load_print_meta: n_layer = 56 llm_load_print_meta: n_head = 48 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 6 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 16384 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 22.25 B llm_load_print_meta: model size = 11.71 GiB (4.52 BPW) llm_load_print_meta: general.name = Codestral-22B-v0.1 llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: LF token = 781 '<0x0A>' llm_load_print_meta: max token length = 48 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA L4, compute capability 8.9, VMM: yes time=2024-08-06T12:29:48.425Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server loading model" llm_load_tensors: ggml ctx size = 0.47 MiB llm_load_tensors: offloading 53 repeating layers to GPU llm_load_tensors: offloaded 53/57 layers to GPU llm_load_tensors: CPU buffer size = 11986.15 MiB llm_load_tensors: CUDA0 buffer size = 11092.73 MiB time=2024-08-06T12:29:49.178Z level=DEBUG source=server.go:629 msg="model load progress 0.10" time=2024-08-06T12:29:49.429Z level=DEBUG source=server.go:629 msg="model load progress 0.20" time=2024-08-06T12:29:49.680Z level=DEBUG source=server.go:629 msg="model load progress 0.30" time=2024-08-06T12:29:49.931Z level=DEBUG source=server.go:629 msg="model load progress 0.39" time=2024-08-06T12:29:50.181Z level=DEBUG source=server.go:629 msg="model load progress 0.48" time=2024-08-06T12:29:50.432Z level=DEBUG source=server.go:629 msg="model load progress 0.58" time=2024-08-06T12:29:50.683Z level=DEBUG source=server.go:629 msg="model load progress 0.68" time=2024-08-06T12:29:50.934Z level=DEBUG source=server.go:629 msg="model load progress 0.77" time=2024-08-06T12:29:51.185Z level=DEBUG source=server.go:629 msg="model load progress 0.86" time=2024-08-06T12:29:51.436Z level=DEBUG source=server.go:629 msg="model load progress 0.96" llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 time=2024-08-06T12:29:51.688Z level=DEBUG source=server.go:629 msg="model load progress 1.00" llama_kv_cache_init: CUDA_Host KV buffer size = 384.00 MiB llama_kv_cache_init: CUDA0 KV buffer size = 6784.00 MiB llama_new_context_with_model: KV self size = 7168.00 MiB, K (f16): 3584.00 MiB, V (f16): 3584.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 0.15 MiB llama_new_context_with_model: CUDA0 compute buffer size = 268.25 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 76.01 MiB llama_new_context_with_model: graph nodes = 1575 llama_new_context_with_model: graph splits = 37 time=2024-08-06T12:29:51.938Z level=DEBUG source=server.go:632 msg="model load completed, waiting for server to become available" status="llm server loading model" DEBUG [initialize] initializing slots | n_slots=1 tid="127585882058752" timestamp=1722947392 DEBUG [initialize] new slot | n_ctx_slot=32768 slot_id=0 tid="127585882058752" timestamp=1722947392 INFO [main] model loaded | tid="127585882058752" timestamp=1722947392 DEBUG [update_slots] all slots are idle and system prompt is empty, clear the KV cache | tid="127585882058752" timestamp=1722947392 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=0 tid="127585882058752" timestamp=1722947392 time=2024-08-06T12:29:52.190Z level=INFO source=server.go:623 msg="llama runner started in 4.02 seconds" time=2024-08-06T12:29:52.190Z level=DEBUG source=sched.go:458 msg="finished setting up runner" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=1 tid="127585882058752" timestamp=1722947392 DEBUG [log_server_request] request | method="POST" params={} path="/tokenize" remote_addr="127.0.0.1" remote_port=59190 status=200 tid="127585400893440" timestamp=1722947392 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=2 tid="127585882058752" timestamp=1722947392 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=3 tid="127585882058752" timestamp=1722947392 DEBUG [update_slots] kv cache rm [p0, end) | p0=0 slot_id=0 task_id=3 tid="127585882058752" timestamp=1722947392 DEBUG [update_slots] slot released | n_cache_tokens=7 n_ctx=32768 n_past=7 n_system_tokens=0 slot_id=0 task_id=3 tid="127585882058752" timestamp=1722947392 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/embedding" remote_addr="127.0.0.1" remote_port=59190 status=200 tid="127585400893440" timestamp=1722947392 [GIN] 2024/08/06 - 12:29:52 | 200 | 4.695654868s | 145.239.103.1 | POST "/api/embed" time=2024-08-06T12:29:52.685Z level=DEBUG source=sched.go:462 msg="context for request finished" time=2024-08-06T12:29:52.685Z level=DEBUG source=sched.go:334 msg="runner with non-zero duration has gone idle, adding timer" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e duration=2562047h47m16.854775807s time=2024-08-06T12:29:52.685Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=0 ``` I've noticed that it offloads 53/57 layers to GPU. then comes a call to api/generate (same model) ``` time=2024-08-06T12:30:24.448Z level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=6 tid="127585882058752" timestamp=1722947424 time=2024-08-06T12:30:24.449Z level=DEBUG source=routes.go:208 msg="generate request" prompt="[SUFFIX]\n print(\"... Loading reranker\")\n global reranker\n reranker = get_reranker_cohere()\n\n # print(\"... Loading RAG chain\")\n\n print(\"... Loading Classifier\")\n checkpoint_path_hug = hf_hub_download(\n repo_id=\"amaluuu/my-model_more_data\", filename=\"best-checkpoint.ckpt\"\n )\n global hug_model\n hug_model = LightningModel.load_from_checkpoint(checkpoint_path_hug)\n[PREFIX]) # Default to Azure if not specified\n\n\ndef load_everything():\n \"\"\"\n Initialize and load all necessary components for Sofia.\n \"\"\"\n # Instanciate LLM, embedder and store:\n print(\"... Langchain version:\", langchain.__version__)\n\n print(\"... Loading LLMs\")\n global llm\n llm = get_mistral_llm()\n global llm_solo\n llm_solo = get_mistral_llm()\n global llm_claude\n llm_claude = get_llm_claude()\n\n print(\"... Loading embedder\")\n global embedder\n embedder = get_embedder()\n\n print(\"... Loading vector store\")\n global vectorstore_chunks\n global index_chunks\n vectorstore_chunks, index_chunks = get_vectorstore_pinecone(embedder, \"chunks\")\n\n" images=[] DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=7 tid="127585882058752" timestamp=1722947424 DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947424 DEBUG [update_slots] slot progression | ga_i=0 n_past=1 n_past_se=0 n_prompt_tokens_processed=373 slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947424 DEBUG [update_slots] kv cache rm [p0, end) | p0=1 slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947424 time=2024-08-06T12:30:24.939Z level=DEBUG source=sched.go:571 msg="evaluating already loaded" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=10 tid="127585882058752" timestamp=1722947425 time=2024-08-06T12:30:25.118Z level=DEBUG source=routes.go:208 msg="generate request" prompt="[SUFFIX]\n print(\"... Loading reranker\")\n global reranker\n reranker = get_reranker_cohere()\n\n # print(\"... Loading RAG chain\")\n\n print(\"... Loading Classifier\")\n checkpoint_path_hug = hf_hub_download(\n repo_id=\"amaluuu/my-model_more_data\", filename=\"best-checkpoint.ckpt\"\n )\n global hug_model\n hug_model = LightningModel.load_from_checkpoint(checkpoint_path_hug)\n[PREFIX]) # Default to Azure if not specified\n\n\ndef load_everything():\n \"\"\"\n Initialize and load all necessary components for Sofia.\n \"\"\"\n # Instanciate LLM, embedder and store:\n print(\"... Langchain version:\", langchain.__version__)\n\n print(\"... Loading LLMs\")\n global llm\n llm = get_mistral_llm()\n global llm_solo\n llm_solo = get_mistral_llm()\n global llm_claude\n llm_claude = get_llm_claude()\n\n print(\"... Loading embedder\")\n global embedder\n embedder = get_embedder()\n\n print(\"... Loading vector store\")\n global vectorstore_chunks\n global index_chunks\n vectorstore_chunks, index_chunks = get_vectorstore_pinecone(embedder, \"chunks\")\n\n\n" images=[] time=2024-08-06T12:30:25.245Z level=DEBUG source=sched.go:403 msg="context for request finished" time=2024-08-06T12:30:25.245Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=1 [GIN] 2024/08/06 - 12:30:25 | 200 | 803.513666ms | 145.239.103.1 | POST "/api/generate" DEBUG [process_single_task] slot data | n_idle_slots=0 n_processing_slots=1 task_id=13 tid="127585882058752" timestamp=1722947425 DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=60170 status=200 tid="127585392500736" timestamp=1722947425 DEBUG [update_slots] slot released | n_cache_tokens=376 n_ctx=32768 n_past=375 n_system_tokens=0 slot_id=0 task_id=8 tid="127585882058752" timestamp=1722947425 truncated=false DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=16 tid="127585882058752" timestamp=1722947425 DEBUG [prefix_slot] slot with common prefix found | 0=["slot_id",0,"characters",1092] DEBUG [launch_slot_with_data] slot is processing task | slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947425 DEBUG [update_slots] slot progression | ga_i=0 n_past=373 n_past_se=0 n_prompt_tokens_processed=374 slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947425 DEBUG [update_slots] kv cache rm [p0, end) | p0=373 slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947425 time=2024-08-06T12:30:26.022Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="46.0 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="45.3 GiB" now.free_swap="0 B" CUDA driver version: 12.4 time=2024-08-06T12:30:26.265Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="21.8 GiB" now.total="22.0 GiB" now.free="3.8 GiB" now.used="18.2 GiB" releasing cuda driver library time=2024-08-06T12:30:26.269Z level=DEBUG source=sched.go:492 msg="gpu reported" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda available="3.8 GiB" time=2024-08-06T12:30:26.269Z level=INFO source=sched.go:503 msg="updated VRAM based on existing loaded models" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb library=cuda total="22.0 GiB" available="499.0 MiB" ``` At this point, there will be another call to api/generate, and the the model will unload ``` time=2024-08-06T12:30:26.269Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}" time=2024-08-06T12:30:26.271Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers" time=2024-08-06T12:30:26.271Z level=DEBUG source=sched.go:782 msg="no idle runners, picking the shortest duration" count=1 time=2024-08-06T12:30:26.271Z level=DEBUG source=sched.go:278 msg="resetting model to expire immediately to make room" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=1 time=2024-08-06T12:30:26.271Z level=DEBUG source=sched.go:291 msg="waiting for pending requests to complete and unload to occur" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e DEBUG [print_timings] prompt eval time = 92.96 ms / 374 tokens ( 0.25 ms per token, 4023.32 tokens per second) | n_prompt_tokens_processed=374 n_tokens_second=4023.322360635986 slot_id=0 t_prompt_processing=92.958 t_token=0.24855080213903744 task_id=17 tid="127585882058752" timestamp=1722947428 DEBUG [print_timings] generation eval time = 3180.56 ms / 35 runs ( 90.87 ms per token, 11.00 tokens per second) | n_decoded=35 n_tokens_second=11.00434105534146 slot_id=0 t_token=90.87322857142857 t_token_generation=3180.563 task_id=17 tid="127585882058752" timestamp=1722947428 DEBUG [print_timings] total time = 3273.52 ms | slot_id=0 t_prompt_processing=92.958 t_token_generation=3180.563 t_total=3273.521 task_id=17 tid="127585882058752" timestamp=1722947428 DEBUG [update_slots] slot released | n_cache_tokens=409 n_ctx=32768 n_past=408 n_system_tokens=0 slot_id=0 task_id=17 tid="127585882058752" timestamp=1722947428 truncated=false DEBUG [log_server_request] request | method="POST" params={} path="/completion" remote_addr="127.0.0.1" remote_port=60186 status=200 tid="127585384108032" timestamp=1722947428 [GIN] 2024/08/06 - 12:30:28 | 200 | 3.690068188s | 145.239.103.1 | POST "/api/generate" time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:403 msg="context for request finished" time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:327 msg="runner with zero duration has gone idle, expiring to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:352 msg="after processing request finished event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e refCount=0 time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:355 msg="runner expired event received" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:30:28.623Z level=DEBUG source=sched.go:371 msg="got lock to unload" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:30:28.623Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="45.3 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="45.3 GiB" now.free_swap="0 B" CUDA driver version: 12.4 time=2024-08-06T12:30:28.798Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="3.8 GiB" now.total="22.0 GiB" now.free="3.8 GiB" now.used="18.2 GiB" releasing cuda driver library time=2024-08-06T12:30:28.798Z level=DEBUG source=server.go:1042 msg="stopping llama server" time=2024-08-06T12:30:28.798Z level=DEBUG source=server.go:1048 msg="waiting for llama server to exit" time=2024-08-06T12:30:29.049Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="45.3 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="45.5 GiB" now.free_swap="0 B" CUDA driver version: 12.4 time=2024-08-06T12:30:29.164Z level=DEBUG source=server.go:1052 msg="llama server stopped" time=2024-08-06T12:30:29.164Z level=DEBUG source=sched.go:376 msg="runner released" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:30:29.425Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="3.8 GiB" now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB" releasing cuda driver library time=2024-08-06T12:30:29.425Z level=DEBUG source=sched.go:655 msg="gpu VRAM free memory converged after 0.80 seconds" model=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:30:29.425Z level=DEBUG source=sched.go:380 msg="sending an unloaded event" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:30:29.425Z level=DEBUG source=sched.go:297 msg="unload completed" modelPath=/usr/share/ollama/.ollama/models/blobs/sha256-22a849aafe3ded20e9b6551b02684d8fa911537c35895dd2a1bf9eb70da8f69e time=2024-08-06T12:30:29.425Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="47.0 GiB" before.free="45.5 GiB" before.free_swap="0 B" now.total="47.0 GiB" now.free="46.0 GiB" now.free_swap="0 B" CUDA driver version: 12.4 time=2024-08-06T12:30:29.621Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="21.8 GiB" now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB" releasing cuda driver library time=2024-08-06T12:30:29.626Z level=DEBUG source=sched.go:219 msg="loading first model" model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 time=2024-08-06T12:30:29.626Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2024-08-06T12:30:29.626Z level=INFO source=sched.go:710 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb parallel=4 available=23368695808 required="1.0 GiB" time=2024-08-06T12:30:29.626Z level=DEBUG source=server.go:100 msg="system memory" total="47.0 GiB" free="46.0 GiB" free_swap="0 B" time=2024-08-06T12:30:29.626Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[21.8 GiB]" time=2024-08-06T12:30:29.627Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[21.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cpu_avx2/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server time=2024-08-06T12:30:29.627Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama2871822748/runners/rocm_v60102/ollama_llama_server time=2024-08-06T12:30:29.627Z level=INFO source=server.go:384 msg="starting llama server" cmd="/tmp/ollama2871822748/runners/cuda_v11/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 32768 --batch-size 512 --embedding --log-disable --n-gpu-layers 13 --verbose --flash-attn --parallel 4 --port 43015" time=2024-08-06T12:30:29.627Z level=DEBUG source=server.go:401 msg=subprocess environment="[PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin LD_LIBRARY_PATH=/tmp/ollama2871822748/runners/cuda_v11:/tmp/ollama2871822748/runners CUDA_VISIBLE_DEVICES=GPU-635e6271-5712-7287-dd2d-29daacc179fb]" time=2024-08-06T12:30:29.628Z level=INFO source=sched.go:445 msg="loaded runners" count=1 time=2024-08-06T12:30:29.628Z level=INFO source=server.go:584 msg="waiting for llama runner to start responding" time=2024-08-06T12:30:29.628Z level=INFO source=server.go:618 msg="waiting for server to become available" status="llm server error" ```
Author
Owner

@MHugonKaliop commented on GitHub (Aug 6, 2024):

I have another server with ollama v0.3.0, and codestral:latest only asks for 15GB
On this server, I have room left for llama3:latest

OLLAMA_HOST=0.0.0.0:9000 ollama ps
NAME            	ID          	SIZE  	PROCESSOR	UNTIL   
codestral:latest	fcc0019dcee9	15 GB 	100% GPU 	Forever	
llama3:latest   	a6990ed6be41	6.7 GB	100% GPU 	Forever	
<!-- gh-comment-id:2271294277 --> @MHugonKaliop commented on GitHub (Aug 6, 2024): I have another server with ollama v0.3.0, and codestral:latest only asks for 15GB On this server, I have room left for llama3:latest ``` OLLAMA_HOST=0.0.0.0:9000 ollama ps NAME ID SIZE PROCESSOR UNTIL codestral:latest fcc0019dcee9 15 GB 100% GPU Forever llama3:latest a6990ed6be41 6.7 GB 100% GPU Forever ```
Author
Owner

@rick-github commented on GitHub (Aug 6, 2024):

Your GPU has 21.8G free:

time=2024-08-06T12:29:48.161Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data"
  gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B"
  before.total="22.0 GiB" before.free="21.8 GiB"
  now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB"

The model requires 18.4G for the weights and because you have a large context window (32768), another 7GB is required for KV space. 25G won't fit into the available space, so ollama loads as much as it can into VRAM, 21.5G. This leaves <1G free.

time=2024-08-06T12:29:48.171Z level=INFO source=memory.go:309 msg="offload to cuda" 
  layers.requested=-1 layers.model=57 layers.offload=53 layers.split="" 
  memory.available="[21.8 GiB]" memory.required.full="22.6 GiB" memory.required.partial="21.5 GiB"
  memory.required.kv="7.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB"
  memory.weights.repeating="18.3 GiB" memory.weights.nonrepeating="157.5 MiB"
  memory.graph.full="3.1 GiB" memory.graph.partial="3.2 GiB"

Now you do the embed call. The model requires 1G for weights, KV space, and the graph:

time=2024-08-06T12:30:29.627Z level=INFO source=memory.go:309 msg="offload to cuda"
  layers.requested=-1 layers.model=13 layers.offload=13 layers.split=""
  memory.available="[21.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB"
  memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB"
  memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB"
  memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB"

ollama can't fit both models:

time=2024-08-06T12:30:26.269Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1
  available="[499.0 MiB]"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers"
  gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector
  extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false
  ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}"
time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers"

So it unloads one model in order to load another. When an API call is made to the previous model, it unloads the current one and reloads the old one.

There are several ways to mitigate this.

  1. More hardware
  2. Set the num_gpu parameter for the codestral model to some number less than the number of layers that ollama computes it can offload. This leaves room for loading the smaller embedding mode with the cost of slightly slower inference by codestral.
  3. Set the num_gpu parameter for the embedding model to zero. This will cause the model to only run in system RAM using the CPU, at the cost of slower embeddings.

The trade off between 2 and 3 is where you think most of the inference time is spent. If you are doing a lot of embeddings. then having codestral run a little bit slower may be acceptable. On the other hand, if embedding calls are infrequent, you want to let codestral have all the VRAM and let the embedding model make do with CPU.

There are two ways to adjust the number of layers offloaded. You can either add "options":{"num_gpu":50} to the API calls, or you can set a default value in the model itself:

$ ollama show --modelfile codestral > Modelfile
# edit the file and add "PARAMETER num_gpu 50", and modify the FROM line to use the model name, not path
$ ollama create codestral-n50 -f Modelfile

Same holds for setting the num_gpu value for the emebdding model.

In the case where the other server only takes 15G for codestral, I suspect that the context window is smaller than the original server.

It can be argued that the model scheduler could be smarter, and the ollama team have mentioned this. But because the scheduler can't know the workload, it will sometimes be wrong, and doing manual overrides like above are sometimes necessary.

<!-- gh-comment-id:2271375633 --> @rick-github commented on GitHub (Aug 6, 2024): Your GPU has 21.8G free: ``` time=2024-08-06T12:29:48.161Z level=DEBUG source=gpu.go:406 msg="updating cuda memory data" gpu=GPU-635e6271-5712-7287-dd2d-29daacc179fb name="NVIDIA L4" overhead="0 B" before.total="22.0 GiB" before.free="21.8 GiB" now.total="22.0 GiB" now.free="21.8 GiB" now.used="192.2 MiB" ``` The model requires 18.4G for the weights and because you have a large context window (32768), another 7GB is required for KV space. 25G won't fit into the available space, so ollama loads as much as it can into VRAM, 21.5G. This leaves <1G free. ``` time=2024-08-06T12:29:48.171Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=57 layers.offload=53 layers.split="" memory.available="[21.8 GiB]" memory.required.full="22.6 GiB" memory.required.partial="21.5 GiB" memory.required.kv="7.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="18.4 GiB" memory.weights.repeating="18.3 GiB" memory.weights.nonrepeating="157.5 MiB" memory.graph.full="3.1 GiB" memory.graph.partial="3.2 GiB" ``` Now you do the embed call. The model requires 1G for weights, KV space, and the graph: ``` time=2024-08-06T12:30:29.627Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[21.8 GiB]" memory.required.full="1.0 GiB" memory.required.partial="1.0 GiB" memory.required.kv="96.0 MiB" memory.required.allocations="[1.0 GiB]" memory.weights.total="312.1 MiB" memory.weights.repeating="267.4 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="192.0 MiB" memory.graph.partial="192.0 MiB" ``` ollama can't fit both models: ``` time=2024-08-06T12:30:26.269Z level=DEBUG source=memory.go:101 msg=evaluating library=cuda gpu_count=1 available="[499.0 MiB]" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:168 msg="gpu has too little memory to allocate any layers" gpu="{memInfo:{TotalMemory:23570219008 FreeMemory:523270144 FreeSwap:0} Library:cuda Variant:no vector extensions MinimumMemory:479199232 DependencyPath: EnvWorkarounds:[] UnreliableFreeMemory:false ID:GPU-635e6271-5712-7287-dd2d-29daacc179fb Name:NVIDIA L4 Compute:8.9 DriverMajor:12 DriverMinor:4}" time=2024-08-06T12:30:26.270Z level=DEBUG source=memory.go:296 msg="insufficient VRAM to load any model layers" ``` So it unloads one model in order to load another. When an API call is made to the previous model, it unloads the current one and reloads the old one. There are several ways to mitigate this. 1. More hardware 2. Set the `num_gpu` parameter for the codestral model to some number less than the number of layers that ollama computes it can offload. This leaves room for loading the smaller embedding mode with the cost of slightly slower inference by codestral. 3. Set the `num_gpu` parameter for the embedding model to zero. This will cause the model to only run in system RAM using the CPU, at the cost of slower embeddings. The trade off between 2 and 3 is where you think most of the inference time is spent. If you are doing a lot of embeddings. then having codestral run a little bit slower may be acceptable. On the other hand, if embedding calls are infrequent, you want to let codestral have all the VRAM and let the embedding model make do with CPU. There are two ways to adjust the number of layers offloaded. You can either add `"options":{"num_gpu":50}` to the API calls, or you can set a default value in the model itself: ```sh $ ollama show --modelfile codestral > Modelfile # edit the file and add "PARAMETER num_gpu 50", and modify the FROM line to use the model name, not path $ ollama create codestral-n50 -f Modelfile ``` Same holds for setting the `num_gpu` value for the emebdding model. In the case where the other server only takes 15G for codestral, I suspect that the context window is smaller than the original server. It can be argued that the model scheduler could be smarter, and the ollama team have mentioned this. But because the scheduler can't know the workload, it will sometimes be wrong, and doing manual overrides like above are sometimes necessary.
Author
Owner

@MHugonKaliop commented on GitHub (Aug 6, 2024):

Very interesting, thank you so much for the time you spent for this detailed explanation !
I wasn't aware of the size needed for the context window !

Makes perfect sense now

<!-- gh-comment-id:2271395775 --> @MHugonKaliop commented on GitHub (Aug 6, 2024): Very interesting, thank you so much for the time you spent for this detailed explanation ! I wasn't aware of the size needed for the context window ! Makes perfect sense now
Author
Owner

@trevorboydsmith commented on GitHub (Jan 2, 2025):

just an FYI. in my case i was seeing the log file for my ollama v0.4.1 unload and reload a lot. i am doing CPU only processing with llama3.2:3b. i saw here he is doing embed and so am i. i also do /api/generate with context size that is 32*1000. my /api/embed did not set the options context value. so i changed my call to /api/embed with the options set to the same context size of 32*1000 as the /api/generate and now my log file shows that the model no longer unloads and reloads --> all fixed.

i did see this line in the server log:

time=2025-01-02T21:18:32.366Z level=INFO source=memory.go:343 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[22.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.7 GiB" memory.required.partial="0 B" memory.required.kv="13.7 GiB" memory.required.allocations="[22.4 GiB]" memory.weights.total="15.2 GiB" memory.weights.repeating="14.9 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="6.1 GiB" memory.graph.partial="6.6 GiB"

so... this log file message looks like (i'm no expert on reading this log file here) something about not having enough memory... hence the constant unload/reload inbetween /api/generate with num_ctx=32000 and /api/embed .

<!-- gh-comment-id:2568402497 --> @trevorboydsmith commented on GitHub (Jan 2, 2025): just an FYI. in my case i was seeing the log file for my ollama v0.4.1 unload and reload a lot. i am doing CPU only processing with `llama3.2:3b`. i saw here he is doing embed and so am i. i also do /api/generate with context size that is `32*1000`. my `/api/embed` did not set the options context value. so i changed my call to /api/embed with the options set to the same context size of `32*1000` as the /api/generate and now my log file shows that the model no longer unloads and reloads --> all fixed. i did see this line in the server log: ``` time=2025-01-02T21:18:32.366Z level=INFO source=memory.go:343 msg="offload to cpu" layers.requested=-1 layers.model=29 layers.offload=0 layers.split="" memory.available="[22.6 GiB]" memory.gpu_overhead="0 B" memory.required.full="22.7 GiB" memory.required.partial="0 B" memory.required.kv="13.7 GiB" memory.required.allocations="[22.4 GiB]" memory.weights.total="15.2 GiB" memory.weights.repeating="14.9 GiB" memory.weights.nonrepeating="308.2 MiB" memory.graph.full="6.1 GiB" memory.graph.partial="6.6 GiB" ``` so... this log file message looks like (i'm no expert on reading this log file here) something about not having enough memory... hence the constant unload/reload inbetween /api/generate with num_ctx=32000 and /api/embed .
Author
Owner

@rick-github commented on GitHub (Jan 3, 2025):

You are correct in your analysis that the missing num_ctx causes the unload/reload. This will be fixed if #8029 is merged.

<!-- gh-comment-id:2568700211 --> @rick-github commented on GitHub (Jan 3, 2025): You are correct in your analysis that the missing `num_ctx` causes the unload/reload. This will be fixed if #8029 is merged.
Author
Owner

@ghost commented on GitHub (Feb 7, 2025):

a400df48c0/server/sched.go (L147)
a400df48c0/server/sched.go (L603)

It seems that there is a design in the scheduling logic: if the parameters of two requests are inconsistent, it will trigger a model reload. I am currently encountering frequent restart issues, and the scenario I am reproducing is: sending two requests with inconsistent parameters to the same model will definitely cause the model to reload. I don't know what the original intention of this design is; it's a bit difficult to understand.

@rick-github @abes200

<!-- gh-comment-id:2642206253 --> @ghost commented on GitHub (Feb 7, 2025): https://github.com/ollama/ollama/blob/a400df48c06f6526ed74bfa3fd1af783ed0b4899/server/sched.go#L147 https://github.com/ollama/ollama/blob/a400df48c06f6526ed74bfa3fd1af783ed0b4899/server/sched.go#L603 It seems that there is a design in the scheduling logic: if the parameters of two requests are inconsistent, it will trigger a model reload. I am currently encountering frequent restart issues, and the scenario I am reproducing is: sending two requests with inconsistent parameters to the same model will definitely cause the model to reload. I don't know what the original intention of this design is; it's a bit difficult to understand. @rick-github @abes200
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3839