[GH-ISSUE #5767] Ollama v0.2.+ with phi3:mini increased RAM consumption compared to 0.1.48 #3591

Closed
opened 2026-04-12 14:19:50 -05:00 by GiteaMirror · 5 comments
Owner

Originally created by @TomMalow on GitHub (Jul 18, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/5767

What is the issue?

Currently working in a project where we are integrating with LLM's and using Ollama with phi3:mini model in a container as a local testing environment. The project was initially using version 0.1.48 which can run on a fairly small vm, perfect for local testing, only taking 2.8 gb of RAM. However after upgrading to v0.2, Ollama now requires at least 5.6 gb of RAM to run the same model. That is an increase of 2.8 gb to run the same model between 0.1.48 and 0.2.6. Same issue for all 0.2.+, but only from 0.2.4 is the issue probably reported. Almost looks like the model is loaded twice into memory

The container running without the model only draws 28mb of ram.

Have a coworker who runs the same project on windows and does not see the same increase in RAM usage between the different models.

OS

macOS

GPU

Apple

CPU

Apple silicon M3

Ollama version

0.2.6

Originally created by @TomMalow on GitHub (Jul 18, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/5767 ### What is the issue? Currently working in a project where we are integrating with LLM's and using Ollama with phi3:mini model in a container as a local testing environment. The project was initially using version 0.1.48 which can run on a fairly small vm, perfect for local testing, only taking 2.8 gb of RAM. However after upgrading to v0.2, Ollama now requires at least 5.6 gb of RAM to run the same model. That is an increase of 2.8 gb to run the same model between 0.1.48 and 0.2.6. Same issue for all 0.2.+, but only from 0.2.4 is the issue probably reported. Almost looks like the model is loaded twice into memory The container running without the model only draws 28mb of ram. Have a coworker who runs the same project on windows and does not see the same increase in RAM usage between the different models. ### OS macOS ### GPU Apple ### CPU Apple silicon M3 ### Ollama version 0.2.6
GiteaMirror added the bug label 2026-04-12 14:19:50 -05:00
Author
Owner

@rick-github commented on GitHub (Jul 18, 2024):

Server logs might enable diagnosis of the problem.

<!-- gh-comment-id:2236949807 --> @rick-github commented on GitHub (Jul 18, 2024): Server logs might enable diagnosis of the problem.
Author
Owner

@TomMalow commented on GitHub (Jul 18, 2024):

That makes sense but I did not include them initially as they don't seem to hold any extra details. But I did find something looking them over again. So thanks for the reminder to add logs.

I noticed the following log message updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1. So decided to see what would happen if I set the OLLAMA_MAX_LOADED_MODELS=1. Nothing, but setting OLLAMA_NUM_PARALLEL=1 did help and now it consumes the same amount of memory as before 0.2.x. Which makes sense as the 0.2.x did introduce parallel models.

Played around with different values of OLLAMA_NUM_PARALLEL and each increment consumes more amount of RAM. The default value of 0 seems to actually mean 4 as that consumes the same amount of RAM as initially reported.
This is probably the intended behaviour, but could argue that a default value of 4 is a bit unnecessary as I assume most people will only have 1 model running at any given time. Also since as 0.1.x did not have any parallelism?

Debug log with default values shown below:

2024/07/18 16:35:57 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]"
2024-07-18T16:35:57.153195894Z time=2024-07-18T16:35:57.153Z level=INFO source=images.go:778 msg="total blobs: 5"
2024-07-18T16:35:57.154634086Z time=2024-07-18T16:35:57.154Z level=INFO source=images.go:785 msg="total unused blobs removed: 0"
2024-07-18T16:35:57.156344318Z time=2024-07-18T16:35:57.156Z level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.2.6)"
2024-07-18T16:35:57.156962019Z time=2024-07-18T16:35:57.156Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama277969984/runners
2024-07-18T16:35:57.157094309Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/arm64/cpu/bin/ollama_llama_server.gz
2024-07-18T16:35:57.157300390Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcublas.so.11.gz
2024-07-18T16:35:57.157303890Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcublasLt.so.11.gz
2024-07-18T16:35:57.157946175Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcudart.so.11.0.gz
2024-07-18T16:35:57.157948592Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/ollama_llama_server.gz
2024-07-18T16:35:59.631799045Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cpu/ollama_llama_server
2024-07-18T16:35:59.631810753Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cuda_v11/ollama_llama_server
2024-07-18T16:35:59.631812087Z time=2024-07-18T16:35:59.631Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cuda_v11]"
2024-07-18T16:35:59.631812962Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY"
2024-07-18T16:35:59.631813795Z time=2024-07-18T16:35:59.631Z level=DEBUG source=sched.go:102 msg="starting llm scheduler"
2024-07-18T16:35:59.631814503Z time=2024-07-18T16:35:59.631Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs"
2024-07-18T16:35:59.631815420Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA"
2024-07-18T16:35:59.631816170Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so*
2024-07-18T16:35:59.631860961Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
2024-07-18T16:35:59.632050584Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[]
2024-07-18T16:35:59.632052501Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcudart.so*
2024-07-18T16:35:59.632054084Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so** /tmp/ollama277969984/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]"
2024-07-18T16:35:59.632286123Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[/tmp/ollama277969984/runners/cuda_v11/libcudart.so.11.0]
2024-07-18T16:35:59.632654035Z cudaSetDevice err: 35
2024-07-18T16:35:59.632705201Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:533 msg="Unable to load cudart" library=/tmp/ollama277969984/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is too old or missing.  If you have a CUDA GPU please upgrade to run ollama"
2024-07-18T16:35:59.632706993Z time=2024-07-18T16:35:59.632Z level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu"
2024-07-18T16:35:59.632707826Z time=2024-07-18T16:35:59.632Z level=INFO source=gpu.go:346 msg="no compatible GPUs were discovered"
2024-07-18T16:35:59.632753700Z time=2024-07-18T16:35:59.632Z level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="5.8 GiB" available="5.0 GiB"
2024-07-18T16:36:02.156722849Z [GIN] 2024/07/18 - 16:36:02 | 200 |      71.582µs |       127.0.0.1 | HEAD     "/"
2024-07-18T16:36:02.168472601Z [GIN] 2024/07/18 - 16:36:02 | 200 |    11.21226ms |       127.0.0.1 | POST     "/api/show"
2024-07-18T16:36:02.179252034Z time=2024-07-18T16:36:02.179Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="5.8 GiB" before.free="5.0 GiB" before.free_swap="0 B" now.total="5.8 GiB" now.free="5.0 GiB" now.free_swap="0 B"
2024-07-18T16:36:02.179296116Z time=2024-07-18T16:36:02.179Z level=DEBUG source=sched.go:177 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
2024-07-18T16:36:02.184381253Z time=2024-07-18T16:36:02.184Z level=DEBUG source=sched.go:201 msg="cpu mode with first model, loading"
2024-07-18T16:36:02.184981578Z time=2024-07-18T16:36:02.184Z level=DEBUG source=server.go:100 msg="system memory" total="5.8 GiB" free="5.0 GiB" free_swap="0 B"
2024-07-18T16:36:02.184987245Z time=2024-07-18T16:36:02.184Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cpu/ollama_llama_server
2024-07-18T16:36:02.184989662Z time=2024-07-18T16:36:02.184Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cuda_v11/ollama_llama_server
2024-07-18T16:36:02.184991537Z time=2024-07-18T16:36:02.184Z level=DEBUG source=memory.go:101 msg=evaluating library=cpu gpu_count=1 available="[5.0 GiB]"
2024-07-18T16:36:02.184993287Z time=2024-07-18T16:36:02.184Z level=WARN source=server.go:132 msg="model request too large for system" requested="5.6 GiB" available=5330853888 total="5.8 GiB" free="5.0 GiB" swap="0 B"
2024-07-18T16:36:02.184995370Z time=2024-07-18T16:36:02.184Z level=INFO source=sched.go:416 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a error="model requires more system memory (5.6 GiB) than is available (5.0 GiB)"
2024-07-18T16:36:02.184997578Z [GIN] 2024/07/18 - 16:36:02 | 500 |   15.662614ms |       127.0.0.1 | POST     "/api/chat"
2024-07-18T16:36:02.185393531Z Error: model requires more system memory (5.6 GiB) than is available (5.0 GiB)
<!-- gh-comment-id:2237088906 --> @TomMalow commented on GitHub (Jul 18, 2024): That makes sense but I did not include them initially as they don't seem to hold any extra details. But I did find something looking them over again. So thanks for the reminder to add logs. I noticed the following log message `updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1`. So decided to see what would happen if I set the `OLLAMA_MAX_LOADED_MODELS=1`. Nothing, but setting `OLLAMA_NUM_PARALLEL=1` did help and now it consumes the same amount of memory as before 0.2.x. Which makes sense as the 0.2.x did introduce parallel models. Played around with different values of `OLLAMA_NUM_PARALLEL` and each increment consumes more amount of RAM. The default value of 0 seems to actually mean 4 as that consumes the same amount of RAM as initially reported. This is probably the intended behaviour, but could argue that a default value of 4 is a bit unnecessary as I assume most people will only have 1 model running at any given time. Also since as 0.1.x did not have any parallelism? Debug log with default values shown below: ``` 2024/07/18 16:35:57 routes.go:1096: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: OLLAMA_DEBUG:true OLLAMA_FLASH_ATTENTION:false OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MAX_VRAM:0 OLLAMA_MODELS:/root/.ollama/models OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_RUNNERS_DIR: OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES:]" 2024-07-18T16:35:57.153195894Z time=2024-07-18T16:35:57.153Z level=INFO source=images.go:778 msg="total blobs: 5" 2024-07-18T16:35:57.154634086Z time=2024-07-18T16:35:57.154Z level=INFO source=images.go:785 msg="total unused blobs removed: 0" 2024-07-18T16:35:57.156344318Z time=2024-07-18T16:35:57.156Z level=INFO source=routes.go:1143 msg="Listening on [::]:11434 (version 0.2.6)" 2024-07-18T16:35:57.156962019Z time=2024-07-18T16:35:57.156Z level=INFO source=payload.go:30 msg="extracting embedded files" dir=/tmp/ollama277969984/runners 2024-07-18T16:35:57.157094309Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cpu file=build/linux/arm64/cpu/bin/ollama_llama_server.gz 2024-07-18T16:35:57.157300390Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcublas.so.11.gz 2024-07-18T16:35:57.157303890Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcublasLt.so.11.gz 2024-07-18T16:35:57.157946175Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/libcudart.so.11.0.gz 2024-07-18T16:35:57.157948592Z time=2024-07-18T16:35:57.157Z level=DEBUG source=payload.go:182 msg=extracting variant=cuda_v11 file=build/linux/arm64/cuda_v11/bin/ollama_llama_server.gz 2024-07-18T16:35:59.631799045Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cpu/ollama_llama_server 2024-07-18T16:35:59.631810753Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cuda_v11/ollama_llama_server 2024-07-18T16:35:59.631812087Z time=2024-07-18T16:35:59.631Z level=INFO source=payload.go:44 msg="Dynamic LLM libraries [cpu cuda_v11]" 2024-07-18T16:35:59.631812962Z time=2024-07-18T16:35:59.631Z level=DEBUG source=payload.go:45 msg="Override detection logic by setting OLLAMA_LLM_LIBRARY" 2024-07-18T16:35:59.631813795Z time=2024-07-18T16:35:59.631Z level=DEBUG source=sched.go:102 msg="starting llm scheduler" 2024-07-18T16:35:59.631814503Z time=2024-07-18T16:35:59.631Z level=INFO source=gpu.go:205 msg="looking for compatible GPUs" 2024-07-18T16:35:59.631815420Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:91 msg="searching for GPU discovery libraries for NVIDIA" 2024-07-18T16:35:59.631816170Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcuda.so* 2024-07-18T16:35:59.631860961Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcuda.so** /usr/local/nvidia/lib64/libcuda.so** /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" 2024-07-18T16:35:59.632050584Z time=2024-07-18T16:35:59.631Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[] 2024-07-18T16:35:59.632052501Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:468 msg="Searching for GPU library" name=libcudart.so* 2024-07-18T16:35:59.632054084Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:487 msg="gpu library search" globs="[/usr/local/nvidia/lib/libcudart.so** /usr/local/nvidia/lib64/libcudart.so** /tmp/ollama277969984/runners/cuda*/libcudart.so* /usr/local/cuda/lib64/libcudart.so* /usr/lib/x86_64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/x86_64-linux-gnu/libcudart.so* /usr/lib/wsl/lib/libcudart.so* /usr/lib/wsl/drivers/*/libcudart.so* /opt/cuda/lib64/libcudart.so* /usr/local/cuda*/targets/aarch64-linux/lib/libcudart.so* /usr/lib/aarch64-linux-gnu/nvidia/current/libcudart.so* /usr/lib/aarch64-linux-gnu/libcudart.so* /usr/local/cuda/lib*/libcudart.so* /usr/lib*/libcudart.so* /usr/local/lib*/libcudart.so*]" 2024-07-18T16:35:59.632286123Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:521 msg="discovered GPU libraries" paths=[/tmp/ollama277969984/runners/cuda_v11/libcudart.so.11.0] 2024-07-18T16:35:59.632654035Z cudaSetDevice err: 35 2024-07-18T16:35:59.632705201Z time=2024-07-18T16:35:59.632Z level=DEBUG source=gpu.go:533 msg="Unable to load cudart" library=/tmp/ollama277969984/runners/cuda_v11/libcudart.so.11.0 error="your nvidia driver is too old or missing. If you have a CUDA GPU please upgrade to run ollama" 2024-07-18T16:35:59.632706993Z time=2024-07-18T16:35:59.632Z level=DEBUG source=amd_linux.go:356 msg="amdgpu driver not detected /sys/module/amdgpu" 2024-07-18T16:35:59.632707826Z time=2024-07-18T16:35:59.632Z level=INFO source=gpu.go:346 msg="no compatible GPUs were discovered" 2024-07-18T16:35:59.632753700Z time=2024-07-18T16:35:59.632Z level=INFO source=types.go:105 msg="inference compute" id=0 library=cpu compute="" driver=0.0 name="" total="5.8 GiB" available="5.0 GiB" 2024-07-18T16:36:02.156722849Z [GIN] 2024/07/18 - 16:36:02 | 200 | 71.582µs | 127.0.0.1 | HEAD "/" 2024-07-18T16:36:02.168472601Z [GIN] 2024/07/18 - 16:36:02 | 200 | 11.21226ms | 127.0.0.1 | POST "/api/show" 2024-07-18T16:36:02.179252034Z time=2024-07-18T16:36:02.179Z level=DEBUG source=gpu.go:358 msg="updating system memory data" before.total="5.8 GiB" before.free="5.0 GiB" before.free_swap="0 B" now.total="5.8 GiB" now.free="5.0 GiB" now.free_swap="0 B" 2024-07-18T16:36:02.179296116Z time=2024-07-18T16:36:02.179Z level=DEBUG source=sched.go:177 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 2024-07-18T16:36:02.184381253Z time=2024-07-18T16:36:02.184Z level=DEBUG source=sched.go:201 msg="cpu mode with first model, loading" 2024-07-18T16:36:02.184981578Z time=2024-07-18T16:36:02.184Z level=DEBUG source=server.go:100 msg="system memory" total="5.8 GiB" free="5.0 GiB" free_swap="0 B" 2024-07-18T16:36:02.184987245Z time=2024-07-18T16:36:02.184Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cpu/ollama_llama_server 2024-07-18T16:36:02.184989662Z time=2024-07-18T16:36:02.184Z level=DEBUG source=payload.go:71 msg="availableServers : found" file=/tmp/ollama277969984/runners/cuda_v11/ollama_llama_server 2024-07-18T16:36:02.184991537Z time=2024-07-18T16:36:02.184Z level=DEBUG source=memory.go:101 msg=evaluating library=cpu gpu_count=1 available="[5.0 GiB]" 2024-07-18T16:36:02.184993287Z time=2024-07-18T16:36:02.184Z level=WARN source=server.go:132 msg="model request too large for system" requested="5.6 GiB" available=5330853888 total="5.8 GiB" free="5.0 GiB" swap="0 B" 2024-07-18T16:36:02.184995370Z time=2024-07-18T16:36:02.184Z level=INFO source=sched.go:416 msg="NewLlamaServer failed" model=/root/.ollama/models/blobs/sha256-3e38718d00bb0007ab7c0cb4a038e7718c07b54f486a7810efd03bb4e894592a error="model requires more system memory (5.6 GiB) than is available (5.0 GiB)" 2024-07-18T16:36:02.184997578Z [GIN] 2024/07/18 - 16:36:02 | 500 | 15.662614ms | 127.0.0.1 | POST "/api/chat" 2024-07-18T16:36:02.185393531Z Error: model requires more system memory (5.6 GiB) than is available (5.0 GiB) ```
Author
Owner

@rick-github commented on GitHub (Jul 18, 2024):

The faq says The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory, so it seems to be up to the user to set this appropriately for deterministic behaviour.

<!-- gh-comment-id:2237107384 --> @rick-github commented on GitHub (Jul 18, 2024): The [faq](https://github.com/ollama/ollama/blob/main/docs/faq.md#how-does-ollama-handle-concurrent-requests) says `The maximum number of parallel requests each model will process at the same time. The default will auto-select either 4 or 1 based on available memory`, so it seems to be up to the user to set this appropriately for deterministic behaviour.
Author
Owner

@TomMalow commented on GitHub (Jul 18, 2024):

That explains the behavior I see. But also raises the question if the auto-scaling behavior should be adjusted a bit.

5 GB available memory is on the low end for any model. So going for 4 maximum number of parallel models in this case is a bit much as it will currently eat up over half the memory just to allow for it. Thereby blocking even one of the smallest models to be loaded.

But as you wrote, the user can define this for scenario such as this and it solves my immediate problem. If you find that there is no need for any changes, then I can go ahead and close this issue.

<!-- gh-comment-id:2237164014 --> @TomMalow commented on GitHub (Jul 18, 2024): That explains the behavior I see. But also raises the question if the auto-scaling behavior should be adjusted a bit. 5 GB available memory is on the low end for any model. So going for 4 maximum number of parallel models in this case is a bit much as it will currently eat up over half the memory just to allow for it. Thereby blocking even one of the smallest models to be loaded. But as you wrote, the user can define this for scenario such as this and it solves my immediate problem. If you find that there is no need for any changes, then I can go ahead and close this issue.
Author
Owner

@dhiltgen commented on GitHub (Jul 22, 2024):

@TomMalow if we don't have enough memory to fully load the model with 4, we will fall back to 1, so it should never "block" loading a model. The only case where this can get a little tricky is if you want to load 2+ models that exactly fit with parallel set to 1 for all of them - our default algorithm would load the first with 4, then fail to load subsequent models. This is why we expose the OLLAMA_NUM_PARALLEL setting so users can have more control over the behavior if our default algorithm isn't optimal for their usecase.

<!-- gh-comment-id:2243990288 --> @dhiltgen commented on GitHub (Jul 22, 2024): @TomMalow if we don't have enough memory to fully load the model with 4, we will fall back to 1, so it should never "block" loading a model. The only case where this can get a little tricky is if you want to load 2+ models that *exactly* fit with parallel set to 1 for all of them - our default algorithm would load the first with 4, then fail to load subsequent models. This is why we expose the OLLAMA_NUM_PARALLEL setting so users can have more control over the behavior if our default algorithm isn't optimal for their usecase.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#3591