[GH-ISSUE #12225] Model not loaded into VRAM despite having two GPU spare #8134

Closed
opened 2026-04-12 20:31:02 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @Tianyu209 on GitHub (Sep 9, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12225

What is the issue?

What is the issue?
Model is not loaded into VRAM when I use ollama with 2 RTX 5060ti
Problem:

When I set CUDA_VISIBLE_DEVICES:0,1 and use two GPU, ollama detects and shows something like "offloaded 29/29 layers to GPU", however, even the 1.5B model generate super slowly, about 1 token/s. When I check nvidia-smi, it shows only 100/200MB is loaded in VRAM, other are loaded into system RAM. It works well when I set CUDA_VISIBLE_DEVICES to 0 or 1. Only when CUDA_VISIBLE_DEVICES =0,1, the model is loaded very slowly (30s above) and mostly in RAM, even there's 28GB VRAM free.
What I tried:

Updated Ollama to the latest version
Updated NVIDIA drivers
Updated CUDA toolkits
Notes:

Last week, I have been using 2 RTX 4070s and it loaded model normally. After I changed the GPU to 2 5060ti and update ollama to 0.11.10, this happened.

Would really appreciate guidance on how to resolve this.

nvidia-smi when a 1.5B model (1.1GB) is generating:
Image

Relevant log output

time=2025-09-09T17:51:49.981+08:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:8192 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:104857600 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\.ollama\\models OLLAMA_MULTIUSER_CACHE:true OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]"
time=2025-09-09T17:51:49.993+08:00 level=INFO source=images.go:477 msg="total blobs: 71"
time=2025-09-09T17:51:49.995+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0"
time=2025-09-09T17:51:49.995+08:00 level=INFO source=routes.go:1384 msg="Listening on [::]:11434 (version 0.11.10)"
time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1
time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1
time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=16 efficiency=8 threads=24
time=2025-09-09T17:51:50.158+08:00 level=INFO source=gpu.go:321 msg="detected OS VRAM overhead" id=GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 library=cuda compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" overhead="797.5 MiB"
time=2025-09-09T17:51:50.282+08:00 level=INFO source=gpu.go:321 msg="detected OS VRAM overhead" id=GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4 library=cuda compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" overhead="892.0 MiB"
time=2025-09-09T17:51:50.289+08:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found"
time=2025-09-09T17:51:50.289+08:00 level=INFO source=amd_windows.go:49 msg="no compatible amdgpu devices detected"
time=2025-09-09T17:51:50.291+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 library=cuda variant=v12 compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" total="15.9 GiB" available="14.7 GiB"
time=2025-09-09T17:51:50.291+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4 library=cuda variant=v12 compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" total="15.9 GiB" available="14.7 GiB"
time=2025-09-09T17:51:56.414+08:00 level=INFO source=server.go:166 msg="enabling new memory estimates"
time=2025-09-09T17:51:56.454+08:00 level=INFO source=server.go:216 msg="enabling flash attention"
time=2025-09-09T17:51:56.463+08:00 level=INFO source=server.go:398 msg="starting runner" cmd="C:\\Users\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\.ollama\\models\\blobs\\sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc --port 55606"
time=2025-09-09T17:51:56.466+08:00 level=INFO source=server.go:671 msg="loading model" "model layers"=29 requested=-1
time=2025-09-09T17:51:56.495+08:00 level=INFO source=runner.go:1251 msg="starting ollama engine"
time=2025-09-09T17:51:56.496+08:00 level=INFO source=runner.go:1286 msg="Server listening on 127.0.0.1:55606"
time=2025-09-09T17:51:56.502+08:00 level=INFO source=server.go:677 msg="system memory" total="95.7 GiB" free="58.6 GiB" free_swap="61.1 GiB"
time=2025-09-09T17:51:56.502+08:00 level=INFO source=server.go:685 msg="gpu memory" id=GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 available="14.2 GiB" free="14.7 GiB" minimum="457.0 MiB" overhead="100.0 MiB"
time=2025-09-09T17:51:56.502+08:00 level=INFO source=server.go:685 msg="gpu memory" id=GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4 available="14.2 GiB" free="14.7 GiB" minimum="457.0 MiB" overhead="100.0 MiB"
time=2025-09-09T17:51:56.503+08:00 level=INFO source=runner.go:1170 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 Layers:29(0..28)] MultiUserCache:true ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-09T17:51:56.519+08:00 level=INFO source=ggml.go:131 msg="" architecture=qwen2 file_type=Q4_K_M name="DeepSeek R1 Distill Qwen 1.5B" description="" num_tensors=339 num_key_values=27
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0
  Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4
load_backend: loaded CUDA backend from C:\Users\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll
load_backend: loaded CPU backend from C:\Users\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-09-09T17:51:56.626+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang)
time=2025-09-09T17:51:56.718+08:00 level=INFO source=runner.go:1170 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 Layers:29(0..28)] MultiUserCache:true ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-09T17:51:58.500+08:00 level=INFO source=runner.go:1170 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 Layers:29(0..28)] MultiUserCache:true ProjectorPath: MainGPU:0 UseMmap:false}"
time=2025-09-09T17:51:58.500+08:00 level=INFO source=ggml.go:487 msg="offloading 28 repeating layers to GPU"
time=2025-09-09T17:51:58.500+08:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU"
time=2025-09-09T17:51:58.500+08:00 level=INFO source=ggml.go:498 msg="offloaded 29/29 layers to GPU"
time=2025-09-09T17:51:58.500+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="934.7 MiB"
time=2025-09-09T17:51:58.500+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="125.2 MiB"
time=2025-09-09T17:51:58.500+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="119.0 MiB"
time=2025-09-09T17:51:58.501+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="49.0 MiB"
time=2025-09-09T17:51:58.501+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="3.0 MiB"
time=2025-09-09T17:51:58.501+08:00 level=INFO source=backend.go:342 msg="total memory" size="1.2 GiB"
time=2025-09-09T17:51:58.502+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1
time=2025-09-09T17:51:58.502+08:00 level=INFO source=server.go:1250 msg="waiting for llama runner to start responding"
time=2025-09-09T17:51:58.502+08:00 level=INFO source=server.go:1284 msg="waiting for server to become available" status="llm server loading model"
time=2025-09-09T17:52:12.278+08:00 level=INFO source=server.go:1288 msg="llama runner started in 15.81 seconds"
[GIN] 2025/09/09 - 17:52:17 | 200 |   20.7115438s |       127.0.0.1 | POST     "/api/chat"
[GIN] 2025/09/09 - 18:02:11 | 200 |         9m45s |       127.0.0.1 | POST     "/api/chat"

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

0.11.10

Originally created by @Tianyu209 on GitHub (Sep 9, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12225 ### What is the issue? What is the issue? Model is not loaded into VRAM when I use ollama with 2 RTX 5060ti Problem: When I set CUDA_VISIBLE_DEVICES:0,1 and use two GPU, ollama detects and shows something like "offloaded 29/29 layers to GPU", however, even the 1.5B model generate super slowly, about 1 token/s. When I check nvidia-smi, it shows only 100/200MB is loaded in VRAM, other are loaded into system RAM. It works well when I set CUDA_VISIBLE_DEVICES to 0 or 1. Only when CUDA_VISIBLE_DEVICES =0,1, the model is loaded very slowly (30s above) and mostly in RAM, even there's 28GB VRAM free. What I tried: Updated Ollama to the latest version Updated NVIDIA drivers Updated CUDA toolkits Notes: Last week, I have been using 2 RTX 4070s and it loaded model normally. After I changed the GPU to 2 5060ti and update ollama to 0.11.10, this happened. Would really appreciate guidance on how to resolve this. nvidia-smi when a 1.5B model (1.1GB) is generating: <img width="817" height="286" alt="Image" src="https://github.com/user-attachments/assets/2d128d3f-eb06-4d85-a864-b6863b26c06c" /> ### Relevant log output ```shell time=2025-09-09T17:51:49.981+08:00 level=INFO source=routes.go:1331 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0,1 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:8192 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:true OLLAMA_GPU_OVERHEAD:104857600 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE:q8_0 OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:2 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:C:\\Users\\.ollama\\models OLLAMA_MULTIUSER_CACHE:true OLLAMA_NEW_ENGINE:true OLLAMA_NEW_ESTIMATES:true OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES:]" time=2025-09-09T17:51:49.993+08:00 level=INFO source=images.go:477 msg="total blobs: 71" time=2025-09-09T17:51:49.995+08:00 level=INFO source=images.go:484 msg="total unused blobs removed: 0" time=2025-09-09T17:51:49.995+08:00 level=INFO source=routes.go:1384 msg="Listening on [::]:11434 (version 0.11.10)" time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu_windows.go:167 msg=packages count=1 time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu_windows.go:183 msg="efficiency cores detected" maxEfficiencyClass=1 time=2025-09-09T17:51:49.995+08:00 level=INFO source=gpu_windows.go:214 msg="" package=0 cores=16 efficiency=8 threads=24 time=2025-09-09T17:51:50.158+08:00 level=INFO source=gpu.go:321 msg="detected OS VRAM overhead" id=GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 library=cuda compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" overhead="797.5 MiB" time=2025-09-09T17:51:50.282+08:00 level=INFO source=gpu.go:321 msg="detected OS VRAM overhead" id=GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4 library=cuda compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" overhead="892.0 MiB" time=2025-09-09T17:51:50.289+08:00 level=INFO source=amd_hip_windows.go:103 msg="AMD ROCm reports no devices found" time=2025-09-09T17:51:50.289+08:00 level=INFO source=amd_windows.go:49 msg="no compatible amdgpu devices detected" time=2025-09-09T17:51:50.291+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 library=cuda variant=v12 compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" total="15.9 GiB" available="14.7 GiB" time=2025-09-09T17:51:50.291+08:00 level=INFO source=types.go:131 msg="inference compute" id=GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4 library=cuda variant=v12 compute=12.0 driver=13.0 name="NVIDIA GeForce RTX 5060 Ti" total="15.9 GiB" available="14.7 GiB" time=2025-09-09T17:51:56.414+08:00 level=INFO source=server.go:166 msg="enabling new memory estimates" time=2025-09-09T17:51:56.454+08:00 level=INFO source=server.go:216 msg="enabling flash attention" time=2025-09-09T17:51:56.463+08:00 level=INFO source=server.go:398 msg="starting runner" cmd="C:\\Users\\AppData\\Local\\Programs\\Ollama\\ollama.exe runner --ollama-engine --model C:\\Users\\.ollama\\models\\blobs\\sha256-aabd4debf0c8f08881923f2c25fc0fdeed24435271c2b3e92c4af36704040dbc --port 55606" time=2025-09-09T17:51:56.466+08:00 level=INFO source=server.go:671 msg="loading model" "model layers"=29 requested=-1 time=2025-09-09T17:51:56.495+08:00 level=INFO source=runner.go:1251 msg="starting ollama engine" time=2025-09-09T17:51:56.496+08:00 level=INFO source=runner.go:1286 msg="Server listening on 127.0.0.1:55606" time=2025-09-09T17:51:56.502+08:00 level=INFO source=server.go:677 msg="system memory" total="95.7 GiB" free="58.6 GiB" free_swap="61.1 GiB" time=2025-09-09T17:51:56.502+08:00 level=INFO source=server.go:685 msg="gpu memory" id=GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 available="14.2 GiB" free="14.7 GiB" minimum="457.0 MiB" overhead="100.0 MiB" time=2025-09-09T17:51:56.502+08:00 level=INFO source=server.go:685 msg="gpu memory" id=GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4 available="14.2 GiB" free="14.7 GiB" minimum="457.0 MiB" overhead="100.0 MiB" time=2025-09-09T17:51:56.503+08:00 level=INFO source=runner.go:1170 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 Layers:29(0..28)] MultiUserCache:true ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-09T17:51:56.519+08:00 level=INFO source=ggml.go:131 msg="" architecture=qwen2 file_type=Q4_K_M name="DeepSeek R1 Distill Qwen 1.5B" description="" num_tensors=339 num_key_values=27 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 Device 1: NVIDIA GeForce RTX 5060 Ti, compute capability 12.0, VMM: yes, ID: GPU-ca44d757-2cc4-a597-149c-30b12b5bacb4 load_backend: loaded CUDA backend from C:\Users\AppData\Local\Programs\Ollama\lib\ollama\ggml-cuda.dll load_backend: loaded CPU backend from C:\Users\AppData\Local\Programs\Ollama\lib\ollama\ggml-cpu-alderlake.dll time=2025-09-09T17:51:56.626+08:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX_VNNI=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 CUDA.1.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.1.USE_GRAPHS=1 CUDA.1.PEER_MAX_BATCH_SIZE=128 compiler=cgo(clang) time=2025-09-09T17:51:56.718+08:00 level=INFO source=runner.go:1170 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 Layers:29(0..28)] MultiUserCache:true ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-09T17:51:58.500+08:00 level=INFO source=runner.go:1170 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:true KvSize:8192 KvCacheType:q8_0 NumThreads:8 GPULayers:29[ID:GPU-fee6fb1c-00bf-c621-1289-0fbf6937dad0 Layers:29(0..28)] MultiUserCache:true ProjectorPath: MainGPU:0 UseMmap:false}" time=2025-09-09T17:51:58.500+08:00 level=INFO source=ggml.go:487 msg="offloading 28 repeating layers to GPU" time=2025-09-09T17:51:58.500+08:00 level=INFO source=ggml.go:493 msg="offloading output layer to GPU" time=2025-09-09T17:51:58.500+08:00 level=INFO source=ggml.go:498 msg="offloaded 29/29 layers to GPU" time=2025-09-09T17:51:58.500+08:00 level=INFO source=backend.go:310 msg="model weights" device=CUDA0 size="934.7 MiB" time=2025-09-09T17:51:58.500+08:00 level=INFO source=backend.go:315 msg="model weights" device=CPU size="125.2 MiB" time=2025-09-09T17:51:58.500+08:00 level=INFO source=backend.go:321 msg="kv cache" device=CUDA0 size="119.0 MiB" time=2025-09-09T17:51:58.501+08:00 level=INFO source=backend.go:332 msg="compute graph" device=CUDA0 size="49.0 MiB" time=2025-09-09T17:51:58.501+08:00 level=INFO source=backend.go:337 msg="compute graph" device=CPU size="3.0 MiB" time=2025-09-09T17:51:58.501+08:00 level=INFO source=backend.go:342 msg="total memory" size="1.2 GiB" time=2025-09-09T17:51:58.502+08:00 level=INFO source=sched.go:473 msg="loaded runners" count=1 time=2025-09-09T17:51:58.502+08:00 level=INFO source=server.go:1250 msg="waiting for llama runner to start responding" time=2025-09-09T17:51:58.502+08:00 level=INFO source=server.go:1284 msg="waiting for server to become available" status="llm server loading model" time=2025-09-09T17:52:12.278+08:00 level=INFO source=server.go:1288 msg="llama runner started in 15.81 seconds" [GIN] 2025/09/09 - 17:52:17 | 200 | 20.7115438s | 127.0.0.1 | POST "/api/chat" [GIN] 2025/09/09 - 18:02:11 | 200 | 9m45s | 127.0.0.1 | POST "/api/chat" ``` ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version 0.11.10
GiteaMirror added the bug label 2026-04-12 20:31:02 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 9, 2025):

How does it perform if you don't set CUDA_VISIBLE_DEVICES?

<!-- gh-comment-id:3270077499 --> @rick-github commented on GitHub (Sep 9, 2025): How does it perform if you don't set `CUDA_VISIBLE_DEVICES`?
Author
Owner

@Tianyu209 commented on GitHub (Sep 11, 2025):

How does it perform if you don't set CUDA_VISIBLE_DEVICES?

It is still loading into sys RAM and generate slowly. But after that I tried to reset all env and it works again. It generates normally now. Anyway, thanks for your suggestion.

<!-- gh-comment-id:3277045689 --> @Tianyu209 commented on GitHub (Sep 11, 2025): > How does it perform if you don't set `CUDA_VISIBLE_DEVICES`? It is still loading into sys RAM and generate slowly. But after that I tried to reset all env and it works again. It generates normally now. Anyway, thanks for your suggestion.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#8134