[GH-ISSUE #6616] A100 shared GPU - Server not responding (always after some time where it works) #66203

Closed
opened 2026-05-04 00:42:43 -05:00 by GiteaMirror · 21 comments
Owner

Originally created by @Ida-Ida on GitHub (Sep 3, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6616

Originally assigned to: @dhiltgen on GitHub.

Hi,
when running ollama, it hangs up after a few times calling "generate".
It shows no error or something, justt hangs up for hours until it is killed manually.
Stopping and then restarting ollama does not resolve the issue, only after restart it works again for some short time (about 20 generate calls).
Reinstalling ollama did not help, too.

Tested with llava:7b model.

My setup:

  • ubuntu 24.04.1, as VM on a server with A100 GPU
  • 32 GB RAM

(Also important to note: on my desktop ubuntu PC this issue does not appear, just on the VM)

image

A section from the journalctl log:

Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: offloading 32 repeating layers to GPU
Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: offloaded 33/33 layers to GPU
Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors:        CPU buffer size =    70.31 MiB
Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors:      CUDA0 buffer size =  3847.55 MiB
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: n_ctx      = 2048
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: n_batch    = 512
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: n_ubatch   = 512
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: flash_attn = 0
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: freq_base  = 1000000.0
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: freq_scale = 1
Sep 03 20:59:43 ada-gym ollama[1623]: llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: graph nodes  = 1030
Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: graph splits = 2
Sep 03 21:01:19 ada-gym ollama[2737]: INFO [main] model loaded | tid="126876040495104" timestamp=1725397279
Sep 03 21:01:19 ada-gym ollama[1623]: time=2024-09-03T21:01:19.371Z level=INFO source=server.go:630 msg="llama runner started in 178.41 seconds"
Sep 03 21:02:59 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:02:59 | 500 |         4m39s |       127.0.0.1 | POST     "/api/generate"
Sep 03 21:03:46 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:46 | 200 |  684.665122ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:47 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:47 | 200 |  616.762244ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:47 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:47 | 200 |  551.942963ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:48 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:48 | 200 |  561.448248ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:49 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:49 | 200 |  870.515355ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:50 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:50 | 200 |  1.383561874s |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:51 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:51 | 200 |   555.17719ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:51 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:51 | 200 |  590.355197ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:52 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:52 | 200 |  582.477293ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:53 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:53 | 200 |  641.736396ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:53 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:53 | 200 |  604.034325ms |       127.0.0.1 | POST     "/api/pull"
Sep 03 21:03:53 ada-gym ollama[1623]: time=2024-09-03T21:03:53.908Z level=WARN source=types.go:509 msg="invalid option provided" option=keep_alive
Sep 03 21:03:53 ada-gym ollama[1623]: time=2024-09-03T21:03:53.908Z level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.542Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 gpu=GPU-3dda594e-2d7a-11ef-8ccb-044f68c14296 parallel=1 available=19272630272 required="5.3 GiB
"
Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.543Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[17.9 GiB]" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[5.3 GiB]" memory.weights.tota
l="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1525758512/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu
-layers 33 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --parallel 1 --port 34627"
Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 03 21:04:04 ada-gym ollama[2828]: INFO [main] build info | build=1 commit="1e6f655" tid="135144095866880" timestamp=1725397444
Sep 03 21:04:04 ada-gym ollama[2828]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
 tid="135144095866880" timestamp=1725397444 total_threads=16
Sep 03 21:04:04 ada-gym ollama[2828]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="34627" tid="135144095866880" timestamp=1725397444
Sep 03 21:04:04 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 03 21:04:04 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 03 21:04:04 ada-gym ollama[1623]: ggml_cuda_init: found 1 CUDA devices:
Sep 03 21:04:04 ada-gym ollama[1623]:   Device 0: GRID A100-20C, compute capability 8.0, VMM: no
Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.796Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 21:04:50 ada-gym ollama[1623]: llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest))
Sep 03 21:04:50 ada-gym ollama[1623]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 03 21:04:50 ada-gym ollama[1623]: llama_model_loader: - kv   0:                       general.architecture str              = llama

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.9

Originally created by @Ida-Ida on GitHub (Sep 3, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6616 Originally assigned to: @dhiltgen on GitHub. Hi, when running ollama, it hangs up after a few times calling "generate". It shows no error or something, justt hangs up for hours until it is killed manually. Stopping and then restarting ollama does not resolve the issue, only after restart it works again for some short time (about 20 `generate` calls). Reinstalling ollama did not help, too. Tested with llava:7b model. My setup: - ubuntu 24.04.1, as VM on a server with A100 GPU - 32 GB RAM (Also important to note: on my desktop ubuntu PC this issue does not appear, just on the VM) ![image](https://github.com/user-attachments/assets/17d6326e-a8f8-4cbf-93fc-c63e3181dd8f) A section from the journalctl log: ``` Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: offloading 32 repeating layers to GPU Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: offloading non-repeating layers to GPU Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: offloaded 33/33 layers to GPU Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: CPU buffer size = 70.31 MiB Sep 03 20:59:07 ada-gym ollama[1623]: llm_load_tensors: CUDA0 buffer size = 3847.55 MiB Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: n_ctx = 2048 Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: n_batch = 512 Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: n_ubatch = 512 Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: flash_attn = 0 Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: freq_base = 1000000.0 Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: freq_scale = 1 Sep 03 20:59:43 ada-gym ollama[1623]: llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: CUDA0 compute buffer size = 164.00 MiB Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: graph nodes = 1030 Sep 03 20:59:43 ada-gym ollama[1623]: llama_new_context_with_model: graph splits = 2 Sep 03 21:01:19 ada-gym ollama[2737]: INFO [main] model loaded | tid="126876040495104" timestamp=1725397279 Sep 03 21:01:19 ada-gym ollama[1623]: time=2024-09-03T21:01:19.371Z level=INFO source=server.go:630 msg="llama runner started in 178.41 seconds" Sep 03 21:02:59 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:02:59 | 500 | 4m39s | 127.0.0.1 | POST "/api/generate" Sep 03 21:03:46 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:46 | 200 | 684.665122ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:47 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:47 | 200 | 616.762244ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:47 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:47 | 200 | 551.942963ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:48 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:48 | 200 | 561.448248ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:49 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:49 | 200 | 870.515355ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:50 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:50 | 200 | 1.383561874s | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:51 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:51 | 200 | 555.17719ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:51 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:51 | 200 | 590.355197ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:52 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:52 | 200 | 582.477293ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:53 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:53 | 200 | 641.736396ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:53 ada-gym ollama[1623]: [GIN] 2024/09/03 - 21:03:53 | 200 | 604.034325ms | 127.0.0.1 | POST "/api/pull" Sep 03 21:03:53 ada-gym ollama[1623]: time=2024-09-03T21:03:53.908Z level=WARN source=types.go:509 msg="invalid option provided" option=keep_alive Sep 03 21:03:53 ada-gym ollama[1623]: time=2024-09-03T21:03:53.908Z level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet" Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.542Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 gpu=GPU-3dda594e-2d7a-11ef-8ccb-044f68c14296 parallel=1 available=19272630272 required="5.3 GiB " Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.543Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[17.9 GiB]" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[5.3 GiB]" memory.weights.tota l="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB" Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1525758512/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu -layers 33 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --parallel 1 --port 34627" Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.544Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 03 21:04:04 ada-gym ollama[2828]: INFO [main] build info | build=1 commit="1e6f655" tid="135144095866880" timestamp=1725397444 Sep 03 21:04:04 ada-gym ollama[2828]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="135144095866880" timestamp=1725397444 total_threads=16 Sep 03 21:04:04 ada-gym ollama[2828]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="34627" tid="135144095866880" timestamp=1725397444 Sep 03 21:04:04 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 03 21:04:04 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 03 21:04:04 ada-gym ollama[1623]: ggml_cuda_init: found 1 CUDA devices: Sep 03 21:04:04 ada-gym ollama[1623]: Device 0: GRID A100-20C, compute capability 8.0, VMM: no Sep 03 21:04:04 ada-gym ollama[1623]: time=2024-09-03T21:04:04.796Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 21:04:50 ada-gym ollama[1623]: llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest)) Sep 03 21:04:50 ada-gym ollama[1623]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 03 21:04:50 ada-gym ollama[1623]: llama_model_loader: - kv 0: general.architecture str = llama ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.9
GiteaMirror added the performancebugnvidia labels 2026-05-04 00:42:45 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 3, 2024):

Could you add OLLAMA_DEBUG=1 to the server environment, and then post logs that shows the calls to /api/generate? What client are you using to make the calls to /api/generate?

<!-- gh-comment-id:2327510265 --> @rick-github commented on GitHub (Sep 3, 2024): Could you add `OLLAMA_DEBUG=1` to the server environment, and then post logs that shows the calls to /api/generate? What client are you using to make the calls to /api/generate?
Author
Owner

@Ida-Ida commented on GitHub (Sep 3, 2024):

@rick-github Thanks for the fast reply!
I use the ollama python API.
Here is an exemplary call I did right now. I recognized it did not totally hang up yet, but was pretty slow. In the cases in which it seemed to totally hang up, the output response was probably much longer or it aggregates up after time.

image

The corresponding journalctl log to this call is:

Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.836Z level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet"
Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.971Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 gpu=GPU-3dda594e-2d7a-11ef-8ccb-044f68c14296 parallel=1 available=19272630272 required="5.3 GiB
"
Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.973Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[17.9 GiB]" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[5.3 GiB]" memory.weights.tota
l="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1525758512/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu
-layers 33 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --parallel 1 --port 34157"
Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=sched.go:450 msg="loaded runners" count=1
Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
Sep 03 22:04:40 ada-gym ollama[3613]: INFO [main] build info | build=1 commit="1e6f655" tid="123554969632768" timestamp=1725401080
Sep 03 22:04:40 ada-gym ollama[3613]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
 tid="123554969632768" timestamp=1725401080 total_threads=16
Sep 03 22:04:40 ada-gym ollama[3613]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="34157" tid="123554969632768" timestamp=1725401080
Sep 03 22:04:40 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
Sep 03 22:04:40 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
Sep 03 22:04:40 ada-gym ollama[1623]: ggml_cuda_init: found 1 CUDA devices:
Sep 03 22:04:40 ada-gym ollama[1623]:   Device 0: GRID A100-20C, compute capability 8.0, VMM: no
Sep 03 22:04:40 ada-gym ollama[1623]: time=2024-09-03T22:04:40.226Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest))
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   0:                       general.architecture str              = llama
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   1:                               general.name str              = liuhaotian
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   4:                          llama.block_count u32              = 32
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 1000000.000000
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  11:                          general.file_type u32              = 2
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 2
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32              = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {{ bos_token }}{% for message in mess...
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv  23:               general.quantization_version u32              = 2
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - type  f32:   65 tensors
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - type q4_0:  225 tensors
Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - type q6_K:    1 tensors
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_vocab: special tokens cache size = 3
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_vocab: token to piece cache size = 0.1637 MB
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: format           = GGUF V3 (latest)
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: arch             = llama
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: vocab type       = SPM
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_vocab          = 32000
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_merges         = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: vocab_only       = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_ctx_train      = 32768
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd           = 4096
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_layer          = 32
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_head           = 32
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_head_kv        = 8
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_rot            = 128
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_swa            = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_head_k    = 128
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_head_v    = 128
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_gqa            = 4
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_k_gqa     = 1024
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_v_gqa     = 1024
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_norm_eps       = 0.0e+00
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_clamp_kqv      = 0.0e+00
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_logit_scale    = 0.0e+00
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_ff             = 14336
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_expert         = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_expert_used    = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: causal attn      = 1
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: pooling type     = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: rope type        = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: rope scaling     = linear
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: freq_base_train  = 1000000.0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: freq_scale_train = 1
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_ctx_orig_yarn  = 32768
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: rope_finetuned   = unknown
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_d_conv       = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_d_inner      = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_d_state      = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_dt_rank      = 0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model type       = 7B
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model ftype      = Q4_0
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model params     = 7.24 B
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model size       = 3.83 GiB (4.54 BPW)
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: general.name     = liuhaotian
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: BOS token        = 1 '<s>'
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: EOS token        = 2 '</s>'
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: UNK token        = 0 '<unk>'
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: PAD token        = 0 '<unk>'
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: LF token         = 13 '<0x0A>'
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: max token length = 48
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: ggml ctx size =    0.27 MiB
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: offloading 32 repeating layers to GPU
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: offloading non-repeating layers to GPU
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: offloaded 33/33 layers to GPU
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors:        CPU buffer size =    70.31 MiB
Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors:      CUDA0 buffer size =  3847.55 MiB
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: n_ctx      = 2048
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: n_batch    = 512
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: n_ubatch   = 512
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: flash_attn = 0
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: freq_base  = 1000000.0
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: freq_scale = 1
Sep 03 22:06:02 ada-gym ollama[1623]: llama_kv_cache_init:      CUDA0 KV buffer size =   256.00 MiB
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.14 MiB
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model:      CUDA0 compute buffer size =   164.00 MiB
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model:  CUDA_Host compute buffer size =    12.01 MiB
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: graph nodes  = 1030
Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: graph splits = 2
Sep 03 22:07:38 ada-gym ollama[3613]: INFO [main] model loaded | tid="123554969632768" timestamp=1725401258
Sep 03 22:07:38 ada-gym ollama[1623]: time=2024-09-03T22:07:38.240Z level=INFO source=server.go:630 msg="llama runner started in 178.27 seconds"
Sep 03 22:12:31 ada-gym ollama[1623]: [GIN] 2024/09/03 - 22:12:31 | 200 |         7m51s |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2327547360 --> @Ida-Ida commented on GitHub (Sep 3, 2024): @rick-github Thanks for the fast reply! I use the ollama python API. Here is an exemplary call I did right now. I recognized it did not totally hang up yet, but was pretty slow. In the cases in which it seemed to totally hang up, the output response was probably much longer or it aggregates up after time. ![image](https://github.com/user-attachments/assets/42573d3e-998a-44db-9a2d-419453a9dd80) The corresponding journalctl log to this call is: ``` Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.836Z level=WARN source=sched.go:137 msg="multimodal models don't support parallel requests yet" Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.971Z level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=/usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 gpu=GPU-3dda594e-2d7a-11ef-8ccb-044f68c14296 parallel=1 available=19272630272 required="5.3 GiB " Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.973Z level=INFO source=memory.go:309 msg="offload to cuda" layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[17.9 GiB]" memory.required.full="5.3 GiB" memory.required.partial="5.3 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[5.3 GiB]" memory.weights.tota l="3.9 GiB" memory.weights.repeating="3.8 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB" Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama1525758512/runners/cuda_v12/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 --ctx-size 2048 --batch-size 512 --embedding --log-disable --n-gpu -layers 33 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-72d6f08a42f656d36b356dbe0920675899a99ce21192fd66266fb7d82ed07539 --parallel 1 --port 34157" Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=sched.go:450 msg="loaded runners" count=1 Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=server.go:591 msg="waiting for llama runner to start responding" Sep 03 22:04:39 ada-gym ollama[1623]: time=2024-09-03T22:04:39.974Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error" Sep 03 22:04:40 ada-gym ollama[3613]: INFO [main] build info | build=1 commit="1e6f655" tid="123554969632768" timestamp=1725401080 Sep 03 22:04:40 ada-gym ollama[3613]: INFO [main] system info | n_threads=16 n_threads_batch=-1 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123554969632768" timestamp=1725401080 total_threads=16 Sep 03 22:04:40 ada-gym ollama[3613]: INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="34157" tid="123554969632768" timestamp=1725401080 Sep 03 22:04:40 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no Sep 03 22:04:40 ada-gym ollama[1623]: ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no Sep 03 22:04:40 ada-gym ollama[1623]: ggml_cuda_init: found 1 CUDA devices: Sep 03 22:04:40 ada-gym ollama[1623]: Device 0: GRID A100-20C, compute capability 8.0, VMM: no Sep 03 22:04:40 ada-gym ollama[1623]: time=2024-09-03T22:04:40.226Z level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from /usr/share/ollama/.ollama/models/blobs/sha256-170370233dd5c5415250a2ecd5c71586352850729062ccef1496385647293868 (version GGUF V3 (latest)) Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 0: general.architecture str = llama Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 1: general.name str = liuhaotian Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 2: llama.context_length u32 = 32768 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 4: llama.block_count u32 = 32 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 11: general.file_type u32 = 2 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 12: tokenizer.ggml.model str = llama Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<... Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - kv 23: general.quantization_version u32 = 2 Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - type f32: 65 tensors Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - type q4_0: 225 tensors Sep 03 22:05:26 ada-gym ollama[1623]: llama_model_loader: - type q6_K: 1 tensors Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_vocab: special tokens cache size = 3 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_vocab: token to piece cache size = 0.1637 MB Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: format = GGUF V3 (latest) Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: arch = llama Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: vocab type = SPM Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_vocab = 32000 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_merges = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: vocab_only = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_ctx_train = 32768 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd = 4096 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_layer = 32 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_head = 32 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_head_kv = 8 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_rot = 128 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_swa = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_head_k = 128 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_head_v = 128 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_gqa = 4 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_ff = 14336 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_expert = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_expert_used = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: causal attn = 1 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: pooling type = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: rope type = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: rope scaling = linear Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: freq_base_train = 1000000.0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: freq_scale_train = 1 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: n_ctx_orig_yarn = 32768 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: rope_finetuned = unknown Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_d_conv = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_d_inner = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_d_state = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: ssm_dt_rank = 0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model type = 7B Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model ftype = Q4_0 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model params = 7.24 B Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: model size = 3.83 GiB (4.54 BPW) Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: general.name = liuhaotian Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: BOS token = 1 '<s>' Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: EOS token = 2 '</s>' Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: UNK token = 0 '<unk>' Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: PAD token = 0 '<unk>' Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: LF token = 13 '<0x0A>' Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_print_meta: max token length = 48 Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: ggml ctx size = 0.27 MiB Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: offloading 32 repeating layers to GPU Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: offloading non-repeating layers to GPU Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: offloaded 33/33 layers to GPU Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: CPU buffer size = 70.31 MiB Sep 03 22:05:26 ada-gym ollama[1623]: llm_load_tensors: CUDA0 buffer size = 3847.55 MiB Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: n_ctx = 2048 Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: n_batch = 512 Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: n_ubatch = 512 Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: flash_attn = 0 Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: freq_base = 1000000.0 Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: freq_scale = 1 Sep 03 22:06:02 ada-gym ollama[1623]: llama_kv_cache_init: CUDA0 KV buffer size = 256.00 MiB Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: CUDA_Host output buffer size = 0.14 MiB Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: CUDA0 compute buffer size = 164.00 MiB Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: CUDA_Host compute buffer size = 12.01 MiB Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: graph nodes = 1030 Sep 03 22:06:02 ada-gym ollama[1623]: llama_new_context_with_model: graph splits = 2 Sep 03 22:07:38 ada-gym ollama[3613]: INFO [main] model loaded | tid="123554969632768" timestamp=1725401258 Sep 03 22:07:38 ada-gym ollama[1623]: time=2024-09-03T22:07:38.240Z level=INFO source=server.go:630 msg="llama runner started in 178.27 seconds" Sep 03 22:12:31 ada-gym ollama[1623]: [GIN] 2024/09/03 - 22:12:31 | 200 | 7m51s | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@rick-github commented on GitHub (Sep 3, 2024):

I'm not familiar with the GRID A100-20C, but 178 seconds to load a 4G model, 90 seconds to process the prompt and 202 seconds to generate 19 tokens is very slow. Another user of an A100 with slow loading (https://github.com/ollama/ollama/issues/6425#issuecomment-2316002395) found that the following command helped (not sure the same will be true for you, since you are loading a much smaller model):

echo 0 > /proc/sys/kernel/numa_balancing
<!-- gh-comment-id:2327584817 --> @rick-github commented on GitHub (Sep 3, 2024): I'm not familiar with the GRID A100-20C, but 178 seconds to load a 4G model, 90 seconds to process the prompt and 202 seconds to generate 19 tokens is very slow. Another user of an A100 with slow loading (https://github.com/ollama/ollama/issues/6425#issuecomment-2316002395) found that the following command helped (not sure the same will be true for you, since you are loading a much smaller model): ``` echo 0 > /proc/sys/kernel/numa_balancing ```
Author
Owner

@Ida-Ida commented on GitHub (Sep 3, 2024):

After restarting the VM, the first 20-30 generate calls each need less than 2 seconds. Only afterwards it suddenly becomes very slow (and stays this slow even after stopping and starting ollama using sudo systemctl stop ollama, as described above).

Unfortunately, using the echo 0 > .. command is not possible on the VM due to missing permissions. Since the GPU is shared across VMs on the server, this is maybe for the better. Not sure if this setting would interact with the other VMs as well.

<!-- gh-comment-id:2327607831 --> @Ida-Ida commented on GitHub (Sep 3, 2024): After restarting the VM, the first 20-30 generate calls each need less than 2 seconds. Only afterwards it suddenly becomes very slow (and stays this slow even after stopping and starting ollama using `sudo systemctl stop ollama`, as described above). Unfortunately, using the `echo 0 > ..` command is not possible on the VM due to missing permissions. Since the GPU is shared across VMs on the server, this is maybe for the better. Not sure if this setting would interact with the other VMs as well.
Author
Owner

@Ida-Ida commented on GitHub (Sep 3, 2024):

I'll try installing a previous ollama version. The linked issue hints that it worked fine some time ago. Since my setup on the VM is freshly installed, I did not try an older version yet.

<!-- gh-comment-id:2327611369 --> @Ida-Ida commented on GitHub (Sep 3, 2024): I'll try installing a previous ollama version. The linked issue hints that it worked fine some time ago. Since my setup on the VM is freshly installed, I did not try an older version yet.
Author
Owner

@Ida-Ida commented on GitHub (Sep 4, 2024):

Update when tested on ollama 0.3.0 and with llava:7b and also llama3-8b model:

  • issue is still there: first the model works as expected, but after a few generate calls it gets very slow
  • however, it the ollama client now throws an error: ollama._types.ResponseError: timed out waiting for llama runner to start - progress 1.00 -

Latest part of the log (ollama serve):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: GRID A100-20C, compute capability 8.0, VMM: no
llm_load_tensors: ggml ctx size =    0.27 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =   281.81 MiB
llm_load_tensors:      CUDA0 buffer size =  4155.99 MiB
llama_new_context_with_model: n_ctx      = 8192
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.02 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   560.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    24.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 2
time=2024-09-04T00:12:56.039Z level=ERROR source=sched.go:443 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - "
[GIN] 2024/09/04 - 00:12:56 | 500 |         6m26s |       127.0.0.1 | POST     "/api/generate"
<!-- gh-comment-id:2327674630 --> @Ida-Ida commented on GitHub (Sep 4, 2024): Update when tested on ollama 0.3.0 and with llava:7b and also llama3-8b model: - issue is still there: first the model works as expected, but after a few generate calls it gets very slow - however, it the ollama client now throws an error: `ollama._types.ResponseError: timed out waiting for llama runner to start - progress 1.00 -` Latest part of the log (ollama serve): ``` ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: GRID A100-20C, compute capability 8.0, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 32 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 33/33 layers to GPU llm_load_tensors: CPU buffer size = 281.81 MiB llm_load_tensors: CUDA0 buffer size = 4155.99 MiB llama_new_context_with_model: n_ctx = 8192 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CUDA0 KV buffer size = 1024.00 MiB llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB llama_new_context_with_model: CUDA_Host output buffer size = 2.02 MiB llama_new_context_with_model: CUDA0 compute buffer size = 560.00 MiB llama_new_context_with_model: CUDA_Host compute buffer size = 24.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 2 time=2024-09-04T00:12:56.039Z level=ERROR source=sched.go:443 msg="error loading llama server" error="timed out waiting for llama runner to start - progress 1.00 - " [GIN] 2024/09/04 - 00:12:56 | 500 | 6m26s | 127.0.0.1 | POST "/api/generate" ```
Author
Owner

@dhiltgen commented on GitHub (Sep 5, 2024):

@Ida-Ida are you able to monitor performance of the GPU across the shared VM instances? Is it possible some other workload is saturating the compute on the GPU when the Ollama performance slows down?

<!-- gh-comment-id:2332222683 --> @dhiltgen commented on GitHub (Sep 5, 2024): @Ida-Ida are you able to monitor performance of the GPU across the shared VM instances? Is it possible some other workload is saturating the compute on the GPU when the Ollama performance slows down?
Author
Owner

@Ida-Ida commented on GitHub (Sep 5, 2024):

@Ida-Ida are you able to monitor performance of the GPU across the shared VM instances? Is it possible some other workload is saturating the compute on the GPU when the Ollama performance slows down?

I'm sorry, I am only able to monitor the GPU workload on my instance. However, restarting the VM always solves the problem for a short while, precisely until I call ollama generate a few times. E.g. if I restart and wait one day doing nothing, ollama still works. But after about 20 - 200 generate calls, the performance drastically slows down -- so it does not behave like another workload is the issue. Its totally weird.

I also thought about if generating answers in a loop introduces some memory leak or other issue, but this would not explain why same environment and code "on my machine" works fine (which is with a 4060TI). I also tried a minimal setup and ran into the same issue.

<!-- gh-comment-id:2332353908 --> @Ida-Ida commented on GitHub (Sep 5, 2024): > @Ida-Ida are you able to monitor performance of the GPU across the shared VM instances? Is it possible some other workload is saturating the compute on the GPU when the Ollama performance slows down? I'm sorry, I am only able to monitor the GPU workload on my instance. However, restarting the VM always solves the problem for a short while, precisely until I call ollama generate a few times. E.g. if I restart and wait one day doing nothing, ollama still works. But after about 20 - 200 generate calls, the performance drastically slows down -- so it does not behave like another workload is the issue. Its totally weird. I also thought about if generating answers in a loop introduces some memory leak or other issue, but this would not explain why same environment and code "on my machine" works fine (which is with a 4060TI). I also tried a minimal setup and ran into the same issue.
Author
Owner

@rick-github commented on GitHub (Sep 6, 2024):

What client are you using to call the API?

<!-- gh-comment-id:2334254750 --> @rick-github commented on GitHub (Sep 6, 2024): What client are you using to call the API?
Author
Owner

@Ida-Ida commented on GitHub (Sep 7, 2024):

What client are you using to call the API?

I use the ollama package, installed via pip (this one https://pypi.org/project/ollama/) . It should be a wrapper for the REST API.
So I first installed ollama via curl -fsSL https://ollama.com/install.sh | sh, then additionally installed the pip package.

However, after it hang up, I also tried running ollama directly in the bash console, so without the pip package. It also did not respond or took much too long.

I hope this is what you were referring to in your question. Or do you mean another kind of client thingy?

<!-- gh-comment-id:2335142016 --> @Ida-Ida commented on GitHub (Sep 7, 2024): > What client are you using to call the API? I use the ollama package, installed via pip (this one https://pypi.org/project/ollama/) . It should be a wrapper for the REST API. So I first installed ollama via `curl -fsSL https://ollama.com/install.sh | sh`, then additionally installed the pip package. However, after it hang up, I also tried running ollama directly in the bash console, so without the pip package. It also did not respond or took much too long. I hope this is what you were referring to in your question. Or do you mean another kind of client thingy?
Author
Owner

@rick-github commented on GitHub (Sep 10, 2024):

What's the CPU load like when things go slow? If you run top, what's near the top of the list?

<!-- gh-comment-id:2342298712 --> @rick-github commented on GitHub (Sep 10, 2024): What's the CPU load like when things go slow? If you run `top`, what's near the top of the list?
Author
Owner

@Ida-Ida commented on GitHub (Sep 13, 2024):

Good news: The problem is not reproducable anymore, after more than 1k calls to ollama.generate function, everything still works flawlessly.

What's changed:

  • updated to the new ollama version 3.10
  • reinstalled nvidia drivers and cuda-toolkit

The driver reinstallation was necessary due to some driver hickup (not sure if this was due the ollama update or some other thingy). However, after doing update + reinstallation, everything seems to work now. Thank you for helping me out!

<!-- gh-comment-id:2348396225 --> @Ida-Ida commented on GitHub (Sep 13, 2024): **Good news:** The problem is not reproducable anymore, after more than 1k calls to ollama.generate function, everything still works flawlessly. What's changed: - updated to the new ollama version 3.10 - reinstalled nvidia drivers and cuda-toolkit The driver reinstallation was necessary due to some driver hickup (not sure if this was due the ollama update or some other thingy). However, after doing update + reinstallation, everything seems to work now. Thank you for helping me out!
Author
Owner

@koasi commented on GitHub (Dec 20, 2024):

@Ida-Ida do you have detail about "new ollama version 3.10" , I cannot find any version about 3.10.
currently I'm using 0.54.
I also facing this proble, could you provide nvidia driver version and cuda-toolkit version for reference.

<!-- gh-comment-id:2556213926 --> @koasi commented on GitHub (Dec 20, 2024): @Ida-Ida do you have detail about "new ollama version 3.10" , I cannot find any version about 3.10. currently I'm using 0.54. I also facing this proble, could you provide nvidia driver version and cuda-toolkit version for reference.
Author
Owner

@rick-github commented on GitHub (Dec 20, 2024):

There is no version 3.10, Ida-Ida probably meant 0.3.10, in the same way you meant 0.5.4 instead of 0.54. What's the output of nvidia-smi -q?

<!-- gh-comment-id:2556237785 --> @rick-github commented on GitHub (Dec 20, 2024): There is no version 3.10, Ida-Ida probably meant 0.3.10, in the same way you meant 0.5.4 instead of 0.54. What's the output of `nvidia-smi -q`?
Author
Owner

@koasi commented on GitHub (Dec 21, 2024):

==============NVSMI LOG==============

Timestamp : Sat Dec 21 11:52:41 2024
Driver Version : 552.22
CUDA Version : 12.4

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 2060 with Max-Q Design
Product Brand : GeForce
Product Architecture : Turing
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : WDDM
Pending : WDDM
Serial Number : N/A
GPU UUID : GPU--------------
Minor Number : N/A
VBIOS Version : 90.06.58.40.03
MultiGPU Board : No
Board ID : 0x100
Board Part Number : N/A
GPU Part Number : 1F12-726-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G001.0000.02.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : Invalid Argument
Drain and Reset Recommended : Invalid Argument
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1F1210DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x1F111043
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 8x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 1 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 6144 MiB
Reserved : 189 MiB
Used : 338 MiB
Free : 5616 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 2 MiB
Free : 254 MiB
Conf Compute Protected Memory Usage
Total : N/A
Used : N/A
Free : N/A
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 42 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : 102 C
GPU Target Temperature : 87 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 22.09 W
Current Power Limit : 65.00 W
Requested Power Limit : 65.00 W
Default Power Limit : 65.00 W
Min Power Limit : 1.00 W
Max Power Limit : 65.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 975 MHz
SM : 975 MHz
Memory : 5500 MHz
Video : 915 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 2100 MHz
SM : 2100 MHz
Memory : 5501 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes : None

<!-- gh-comment-id:2557981899 --> @koasi commented on GitHub (Dec 21, 2024): ==============NVSMI LOG============== Timestamp : Sat Dec 21 11:52:41 2024 Driver Version : 552.22 CUDA Version : 12.4 Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce RTX 2060 with Max-Q Design Product Brand : GeForce Product Architecture : Turing Display Mode : Disabled Display Active : Disabled Persistence Mode : Enabled Addressing Mode : N/A MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : WDDM Pending : WDDM Serial Number : N/A GPU UUID : GPU-------------- Minor Number : N/A VBIOS Version : 90.06.58.40.03 MultiGPU Board : No Board ID : 0x100 Board Part Number : N/A GPU Part Number : 1F12-726-A1 FRU Part Number : N/A Module ID : 1 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A GPU Reset Status Reset Required : Invalid Argument Drain and Reset Recommended : Invalid Argument IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1F1210DE Bus Id : 00000000:01:00.0 Sub System Id : 0x1F111043 GPU Link Info PCIe Generation Max : 3 Current : 3 Device Current : 3 Device Max : 3 Host Max : 3 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 1 KB/s Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : N/A Performance State : P0 Clocks Event Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active Sparse Operation Mode : N/A FB Memory Usage Total : 6144 MiB Reserved : 189 MiB Used : 338 MiB Free : 5616 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Conf Compute Protected Memory Usage Total : N/A Used : N/A Free : N/A Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % JPEG : 0 % OFA : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 42 C GPU T.Limit Temp : N/A GPU Shutdown Temp : 98 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : 102 C GPU Target Temperature : 87 C Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : 22.09 W Current Power Limit : 65.00 W Requested Power Limit : 65.00 W Default Power Limit : 65.00 W Min Power Limit : 1.00 W Max Power Limit : 65.00 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 975 MHz SM : 975 MHz Memory : 5500 MHz Video : 915 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 5501 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Fabric State : N/A Status : N/A Processes : None
Author
Owner

@rick-github commented on GitHub (Dec 21, 2024):

You don't have a shared GPU so this is likely a different problem. Open a new ticket and add server logs.

<!-- gh-comment-id:2557985413 --> @rick-github commented on GitHub (Dec 21, 2024): You don't have a shared GPU so this is likely a different problem. Open a new ticket and add [server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues).
Author
Owner

@Sven1403 commented on GitHub (Apr 30, 2025):

Hi @Ida-Ida i am facing the exact same problem with A16 Nvidia Grid shared GPU. Here is my nvidia-smi:

Image

I am using the cuda files from the ollama lib which is working fine on a local machine.

Is the driver the problem? Which one solved it for you?

<!-- gh-comment-id:2841296795 --> @Sven1403 commented on GitHub (Apr 30, 2025): Hi @Ida-Ida i am facing the exact same problem with A16 Nvidia Grid shared GPU. Here is my nvidia-smi: ![Image](https://github.com/user-attachments/assets/dc95f083-0705-4abf-82dd-d149a3dc5021) I am using the cuda files from the ollama lib which is working fine on a local machine. Is the driver the problem? Which one solved it for you?
Author
Owner

@Sven1403 commented on GitHub (May 5, 2025):

I think for me its a problem with the licenses for the vGPU profile. We only have licenses for the A profile. With the Q profile its works but only for 20min without license:

Image

Will test it again after we got a license

<!-- gh-comment-id:2850849343 --> @Sven1403 commented on GitHub (May 5, 2025): I think for me its a problem with the licenses for the vGPU profile. We only have licenses for the A profile. With the Q profile its works but only for 20min without license: ![Image](https://github.com/user-attachments/assets/4d7e31e0-d6e0-4f04-ba16-49657522ce7b) Will test it again after we got a license
Author
Owner

@xor007 commented on GitHub (Aug 2, 2025):

I have the same problem. within less than 200 generate, ollama starts timing out.

I don't think Iit is a license issue. Not a vGPU capable GPU. OS is ubuntu 22.04. Not running in VM, directly running in the host OS. there are docker containers on the same machine but they are not using the GPU. I tried plotting metrics via dcgm exporter. it shows nothing remarquable:

Image

I have a 1080 ti:

==============NVSMI LOG==============

Timestamp : Sat Aug 2 22:21:55 2025
Driver Version : 535.247.01
CUDA Version : 12.2

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce GTX 1080 Ti
Product Brand : GeForce
Product Architecture : Pascal
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-6c580fd6-cbf4-6d43-19b7-6d53db8ea7e0
Minor Number : 0
VBIOS Version : 86.02.39.00.71
MultiGPU Board : No
Board ID : 0x100
Board Part Number : N/A
GPU Part Number : 1B06-350-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G001.0000.01.04
OEM Object : 1.1
ECC Object : N/A
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x1B0610DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x120F10DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Device Current : 1
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : 0 %
Performance State : P8
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 11264 MiB
Reserved : 91 MiB
Used : 8 MiB
Free : 11163 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 5 MiB
Free : 251 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows : N/A
Temperature
GPU Current Temp : 25 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 96 C
GPU Slowdown Temp : 93 C
GPU Max Operating Temp : N/A
GPU Target Temperature : 84 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Power Draw : 8.40 W
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 250.00 W
Min Power Limit : 125.00 W
Max Power Limit : 300.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1961 MHz
SM : 1961 MHz
Memory : 5505 MHz
Video : 1620 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 1522
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 4 MiB

<!-- gh-comment-id:3146710312 --> @xor007 commented on GitHub (Aug 2, 2025): I have the same problem. within less than 200 generate, ollama starts timing out. I don't think Iit is a license issue. Not a vGPU capable GPU. OS is ubuntu 22.04. Not running in VM, directly running in the host OS. there are docker containers on the same machine but they are not using the GPU. I tried plotting metrics via dcgm exporter. it shows nothing remarquable: <img width="1534" height="803" alt="Image" src="https://github.com/user-attachments/assets/1b95502f-278a-4aa8-962a-5f929b3d3eed" /> I have a 1080 ti: ==============NVSMI LOG============== Timestamp : Sat Aug 2 22:21:55 2025 Driver Version : 535.247.01 CUDA Version : 12.2 Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce GTX 1080 Ti Product Brand : GeForce Product Architecture : Pascal Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled Addressing Mode : N/A MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-6c580fd6-cbf4-6d43-19b7-6d53db8ea7e0 Minor Number : 0 VBIOS Version : 86.02.39.00.71 MultiGPU Board : No Board ID : 0x100 Board Part Number : N/A GPU Part Number : 1B06-350-A1 FRU Part Number : N/A Module ID : 1 Inforom Version Image Version : G001.0000.01.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A GPU Reset Status Reset Required : No Drain and Reset Recommended : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1B0610DE Bus Id : 00000000:01:00.0 Sub System Id : 0x120F10DE GPU Link Info PCIe Generation Max : 3 Current : 1 Device Current : 1 Device Max : 3 Host Max : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : 0 % Performance State : P8 Clocks Event Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active Sparse Operation Mode : N/A FB Memory Usage Total : 11264 MiB Reserved : 91 MiB Used : 8 MiB Free : 11163 MiB BAR1 Memory Usage Total : 256 MiB Used : 5 MiB Free : 251 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % JPEG : N/A OFA : N/A Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 25 C GPU T.Limit Temp : N/A GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : N/A GPU Target Temperature : 84 C Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : 8.40 W Current Power Limit : 250.00 W Requested Power Limit : 250.00 W Default Power Limit : 250.00 W Min Power Limit : 125.00 W Max Power Limit : 300.00 W Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 139 MHz SM : 139 MHz Memory : 405 MHz Video : 544 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : 1961 MHz SM : 1961 MHz Memory : 5505 MHz Video : 1620 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Fabric State : N/A Status : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1522 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 4 MiB
Author
Owner

@rick-github commented on GitHub (Aug 2, 2025):

Open a new issue, include full server log.

<!-- gh-comment-id:3146841516 --> @rick-github commented on GitHub (Aug 2, 2025): Open a new issue, include full server log.
Author
Owner

@xor007 commented on GitHub (Aug 17, 2025):

Open a new issue, include full server log.

Mine is not ollama specific. I noticed that I can trigger the issue with basic pytorch use. I have decided to get a new GPU and stop troubleshooting.

<!-- gh-comment-id:3194328233 --> @xor007 commented on GitHub (Aug 17, 2025): > Open a new issue, include full server log. Mine is not ollama specific. I noticed that I can trigger the issue with basic pytorch use. I have decided to get a new GPU and stop troubleshooting.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#66203