[GH-ISSUE #734] Need an option with low memory of GPU #62380

Closed
opened 2026-05-03 08:32:43 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @tacsotai on GitHub (Oct 8, 2023).
Original GitHub issue: https://github.com/ollama/ollama/issues/734

I tried your great program "ollama".
I was succeeded with CPU, but unfortunately my linux machine not have enough memory.
So, could you prepare an option with low memory of GPU ?

$ ollama serve
2023/10/08 06:05:12 images.go:996: total blobs: 17
2023/10/08 06:05:12 images.go:1003: total unused blobs removed: 0
2023/10/08 06:05:12 routes.go:572: Listening on 127.0.0.1:11434
2023/10/08 06:05:44 llama.go:239: 6144 MiB VRAM available, loading up to 54 GPU layers
2023/10/08 06:05:44 llama.go:313: starting llama runner
2023/10/08 06:05:44 llama.go:349: waiting for llama runner to start responding
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6
{"timestamp":1696745144,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"}
{"timestamp":1696745144,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "}
llama.cpp: loading model from /home/tac/.ollama/models/blobs/sha256:b5749cc827d33b7cb4c8869cede7b296a0a28d9e5d1982705c2ba4c603258159
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: using CUDA for GPU acceleration
llama_model_load_internal: mem required  =  468.40 MB (+ 1024.00 MB per state)
llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer
llama_model_load_internal: offloading 32 repeating layers to GPU
llama_model_load_internal: offloading non-repeating layers to GPU
llama_model_load_internal: offloading v cache to GPU
llama_model_load_internal: offloading k cache to GPU
llama_model_load_internal: offloaded 35/35 layers to GPU
llama_model_load_internal: total VRAM used: 4954 MB
llama_new_context_with_model: kv self size  = 1024.00 MB

llama server listening at http://127.0.0.1:52159

{"timestamp":1696745144,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":52159}
{"timestamp":1696745144,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":51346,"status":200,"method":"HEAD","path":"/","params":{}}
2023/10/08 06:05:44 llama.go:365: llama runner started in 0.802513 seconds
{"timestamp":1696745144,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":51346,"status":200,"method":"POST","path":"/tokenize","params":{}}
{"timestamp":1696745145,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":51346,"status":200,"method":"POST","path":"/tokenize","params":{}}
CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory
[GIN] 2023/10/08 - 06:05:45 | 200 |  2.741464312s |       127.0.0.1 | POST     "/api/generate"
2023/10/08 06:05:45 llama.go:323: llama runner exited with error: exit status 1
$ nvidia-smi
Sun Oct  8 06:04:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   38C    P0              N/A /  80W |      2MiB /  6144MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
Originally created by @tacsotai on GitHub (Oct 8, 2023). Original GitHub issue: https://github.com/ollama/ollama/issues/734 I tried your great program "ollama". I was succeeded with CPU, but unfortunately my linux machine not have enough memory. So, could you prepare an option with low memory of GPU ? ``` $ ollama serve 2023/10/08 06:05:12 images.go:996: total blobs: 17 2023/10/08 06:05:12 images.go:1003: total unused blobs removed: 0 2023/10/08 06:05:12 routes.go:572: Listening on 127.0.0.1:11434 2023/10/08 06:05:44 llama.go:239: 6144 MiB VRAM available, loading up to 54 GPU layers 2023/10/08 06:05:44 llama.go:313: starting llama runner 2023/10/08 06:05:44 llama.go:349: waiting for llama runner to start responding ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 3060 Laptop GPU, compute capability 8.6 {"timestamp":1696745144,"level":"INFO","function":"main","line":1190,"message":"build info","build":1009,"commit":"9e232f0"} {"timestamp":1696745144,"level":"INFO","function":"main","line":1192,"message":"system info","n_threads":6,"total_threads":12,"system_info":"AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | "} llama.cpp: loading model from /home/tac/.ollama/models/blobs/sha256:b5749cc827d33b7cb4c8869cede7b296a0a28d9e5d1982705c2ba4c603258159 llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_head_kv = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: n_gqa = 1 llama_model_load_internal: rnorm_eps = 5.0e-06 llama_model_load_internal: n_ff = 11008 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.08 MB llama_model_load_internal: using CUDA for GPU acceleration llama_model_load_internal: mem required = 468.40 MB (+ 1024.00 MB per state) llama_model_load_internal: allocating batch_size x (512 kB + n_ctx x 128 B) = 384 MB VRAM for the scratch buffer llama_model_load_internal: offloading 32 repeating layers to GPU llama_model_load_internal: offloading non-repeating layers to GPU llama_model_load_internal: offloading v cache to GPU llama_model_load_internal: offloading k cache to GPU llama_model_load_internal: offloaded 35/35 layers to GPU llama_model_load_internal: total VRAM used: 4954 MB llama_new_context_with_model: kv self size = 1024.00 MB llama server listening at http://127.0.0.1:52159 {"timestamp":1696745144,"level":"INFO","function":"main","line":1443,"message":"HTTP server listening","hostname":"127.0.0.1","port":52159} {"timestamp":1696745144,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":51346,"status":200,"method":"HEAD","path":"/","params":{}} 2023/10/08 06:05:44 llama.go:365: llama runner started in 0.802513 seconds {"timestamp":1696745144,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":51346,"status":200,"method":"POST","path":"/tokenize","params":{}} {"timestamp":1696745145,"level":"INFO","function":"log_server_request","line":1157,"message":"request","remote_addr":"127.0.0.1","remote_port":51346,"status":200,"method":"POST","path":"/tokenize","params":{}} CUDA error 2 at /go/src/github.com/jmorganca/ollama/llm/llama.cpp/ggml/ggml-cuda.cu:4856: out of memory [GIN] 2023/10/08 - 06:05:45 | 200 | 2.741464312s | 127.0.0.1 | POST "/api/generate" 2023/10/08 06:05:45 llama.go:323: llama runner exited with error: exit status 1 ``` ``` $ nvidia-smi Sun Oct 8 06:04:18 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3060 ... On | 00000000:01:00.0 Off | N/A | | N/A 38C P0 N/A / 80W | 2MiB / 6144MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ ```
Author
Owner

@tacsotai commented on GitHub (Oct 8, 2023):

It seems to be required 8GB at least for each model.
https://ollama.ai/library/llama2

So, my env is not good for it.

<!-- gh-comment-id:1751949743 --> @tacsotai commented on GitHub (Oct 8, 2023): It seems to be required 8GB at least for each model. https://ollama.ai/library/llama2 So, my env is not good for it.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#62380