[GH-ISSUE #1204] GPU VRAM size calculations seem incorrect #26376

Closed
opened 2026-04-22 02:37:47 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @FairyTail2000 on GitHub (Nov 20, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/1204

[GIN] 2023/11/20 - 09:01:35 | 204 | 14.066µs | 127.0.0.1 | OPTIONS "/api/generate"
2023/11/20 09:01:36 llama.go:290: 3849 MB VRAM available, loading up to 25 GPU layers
2023/11/20 09:01:36 llama.go:415: starting llama runner
2023/11/20 09:01:36 llama.go:473: waiting for llama runner to start responding
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5
{"timestamp":1700467297,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"}
{"timestamp":1700467297,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":8,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "}
llama.cpp: loading model from /home/rafaels/.ollama/models/blobs/sha256:8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_head_kv = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: n_gqa = 1
llama_model_load_internal: rnorm_eps = 5.0e-06
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required = 1298.89 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 25 repeating layers to GPU
llama_model_load_internal: offloaded 25/35 layers to GPU
llama_model_load_internal: total VRAM used: 3099 MB
llama_new_context_with_model: kv self size = 1024.00 MB
llama server listening at http://127.0.0.1:59278
{"timestamp":1700467302,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":59278}
{"timestamp":1700467302,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":58042,"status":200,"method":"HEAD","path":"/","params":{}}
2023/11/20 09:01:42 llama.go:487: llama runner started in 6.203332 seconds
CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory
2023/11/20 09:01:46 llama.go:430: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory
2023/11/20 09:01:46 llama.go:504: llama runner stopped successfully
[GIN] 2023/11/20 - 09:01:46 | 200 | 11.036819193s | 127.0.0.1 | POST "/api/generate"

As visible it seems like ollama correctly calculated that not all layers fit into the gpu. However other factors seem to also play a role, espacially mem required

I'm running a GTX 1650 Laptop 4GB VRAM and 32 GB Main Memory. The current load before/after ollama on the VRAM are 54 MiB allocated by the system. Ollama Version: 0.1.10

Originally created by @FairyTail2000 on GitHub (Nov 20, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/1204 > [GIN] 2023/11/20 - 09:01:35 | 204 | 14.066µs | 127.0.0.1 | OPTIONS "/api/generate" 2023/11/20 09:01:36 llama.go:290: 3849 MB VRAM available, loading up to 25 GPU layers 2023/11/20 09:01:36 llama.go:415: starting llama runner 2023/11/20 09:01:36 llama.go:473: waiting for llama runner to start responding ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1650, compute capability 7.5 {"timestamp":1700467297,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} {"timestamp":1700467297,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":8,"total_threads":16,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "} llama.cpp: loading model from /home/rafaels/.ollama/models/blobs/sha256:8daa9615cce30c259a9555b1cc250d461d1bc69980a274b44d7eda0be78076d8 llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 1298.89 MB (+ 1024.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 25 repeating layers to GPU llama_model_load_internal: offloaded 25/35 layers to GPU llama_model_load_internal: total VRAM used: 3099 MB llama_new_context_with_model: kv self size = 1024.00 MB llama server listening at http://127.0.0.1:59278 {"timestamp":1700467302,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":59278} {"timestamp":1700467302,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":58042,"status":200,"method":"HEAD","path":"/","params":{}} 2023/11/20 09:01:42 llama.go:487: llama runner started in 6.203332 seconds CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory 2023/11/20 09:01:46 llama.go:430: 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory 2023/11/20 09:01:46 llama.go:504: llama runner stopped successfully [GIN] 2023/11/20 - 09:01:46 | 200 | 11.036819193s | 127.0.0.1 | POST "/api/generate" As visible it seems like ollama correctly calculated that not all layers fit into the gpu. However other factors seem to also play a role, espacially mem required I'm running a GTX 1650 Laptop 4GB VRAM and 32 GB Main Memory. The current load before/after ollama on the VRAM are 54 MiB allocated by the system. Ollama Version: 0.1.10
Author
Owner

@FairyTail2000 commented on GitHub (Nov 20, 2023):

Probably should have added, I'm using llama2:7b

<!-- gh-comment-id:1818419334 --> @FairyTail2000 commented on GitHub (Nov 20, 2023): Probably should have added, I'm using llama2:7b
Author
Owner

@pdevine commented on GitHub (Jan 25, 2024):

@jmorganca has been doing a bunch of changes with this, so it should be better. @FairyTail2000 are you still seeing issues?

<!-- gh-comment-id:1911066512 --> @pdevine commented on GitHub (Jan 25, 2024): @jmorganca has been doing a bunch of changes with this, so it should be better. @FairyTail2000 are you still seeing issues?
Author
Owner

@FairyTail2000 commented on GitHub (Jan 26, 2024):

Looks good. Although I'm now using the chat api which might be different in handling the calculation but since I don't known, solved

<!-- gh-comment-id:1911622147 --> @FairyTail2000 commented on GitHub (Jan 26, 2024): Looks good. Although I'm now using the chat api which might be different in handling the calculation but since I don't known, solved
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#26376