[GH-ISSUE #9457] Preview 0.5.13-rc2 uses 5 times more ram #52676

Closed
opened 2026-04-29 00:03:16 -05:00 by GiteaMirror · 92 comments
Owner

Originally created by @Abdulrahman392011 on GitHub (Mar 2, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9457

What is the issue?

When I load the granite vision model that is 2.5 gigabit the ram and the ps command show 11 gigabit model running.

Also when I run the fp16 of granite vision (6 gigabit) it shows 15 gigabit in ram and in ollama ps command

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @Abdulrahman392011 on GitHub (Mar 2, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9457 ### What is the issue? When I load the granite vision model that is 2.5 gigabit the ram and the ps command show 11 gigabit model running. Also when I run the fp16 of granite vision (6 gigabit) it shows 15 gigabit in ram and in ollama ps command ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-29 00:03:16 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 2, 2025):

2.5G (or 6G) is just the size of the model weights. ollama also needs memory for context buffer, model graph, etc. The larger the context buffer, the more memory is required. The default context size of 204816384 tokens results in a memory footprint of 5.2G so if you have 11G, you have likely used a larger context size.

$ ollama run granite3.2-vision:2b-q4_K_M ''
$ ollama ps
NAME                           ID              SIZE      PROCESSOR    UNTIL   
granite3.2-vision:2b-q4_K_M    3be41a661804    5.2 GB    100% GPU     Forever    
<!-- gh-comment-id:2692703839 --> @rick-github commented on GitHub (Mar 2, 2025): 2.5G (or 6G) is just the size of the model weights. ollama also needs memory for context buffer, model graph, etc. The larger the context buffer, the more memory is required. The default context size of ~2048~16384 tokens results in a memory footprint of 5.2G so if you have 11G, you have likely used a larger context size. ```console $ ollama run granite3.2-vision:2b-q4_K_M '' $ ollama ps NAME ID SIZE PROCESSOR UNTIL granite3.2-vision:2b-q4_K_M 3be41a661804 5.2 GB 100% GPU Forever ```
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 2, 2025):

is there a way that I can check and see what the context size is?
I didn't tamper with any of the settings on purpose. but it's worth checking.

<!-- gh-comment-id:2692944573 --> @Abdulrahman392011 commented on GitHub (Mar 2, 2025): is there a way that I can check and see what the context size is? I didn't tamper with any of the settings on purpose. but it's worth checking.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 2, 2025):

NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b-q4_K_M 3be41a661804 7.0 GB 100% CPU 4 minutes from now

<!-- gh-comment-id:2692945565 --> @Abdulrahman392011 commented on GitHub (Mar 2, 2025): NAME ID SIZE PROCESSOR UNTIL granite3.2-vision:2b-q4_K_M 3be41a661804 7.0 GB 100% CPU 4 minutes from now
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 2, 2025):

NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b 3be41a661804 11 GB 100% CPU 4 minutes from now

<!-- gh-comment-id:2692946247 --> @Abdulrahman392011 commented on GitHub (Mar 2, 2025): NAME ID SIZE PROCESSOR UNTIL granite3.2-vision:2b 3be41a661804 11 GB 100% CPU 4 minutes from now
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 2, 2025):

mind you those two are the same model and same quantization.

Image

<!-- gh-comment-id:2692946582 --> @Abdulrahman392011 commented on GitHub (Mar 2, 2025): mind you those two are the same model and same quantization. ![Image](https://github.com/user-attachments/assets/97f561df-da36-43cf-b666-3b0b6c9fe100)
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 2, 2025):

NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b-fp16 17ca6aa97bd9 15 GB 100% CPU 4 minutes from now

<!-- gh-comment-id:2692947626 --> @Abdulrahman392011 commented on GitHub (Mar 2, 2025): NAME ID SIZE PROCESSOR UNTIL granite3.2-vision:2b-fp16 17ca6aa97bd9 15 GB 100% CPU 4 minutes from now
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

Server logs will show the size of the context that the runner is started with, look for --ctx-size and divide by --parallel. If you haven't modified settings, then your clients are passing num_ctx in their API calls. Since granite3.2-vision:2b-q4_K_M was loaded with two different sizes, two of the clients are setting num_ctx to different values.

<!-- gh-comment-id:2693426838 --> @rick-github commented on GitHub (Mar 3, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) will show the size of the context that the runner is started with, look for `--ctx-size` and divide by `--parallel`. If you haven't modified settings, then your clients are passing `num_ctx` in their API calls. Since granite3.2-vision:2b-q4_K_M was loaded with two different sizes, two of the clients are setting `num_ctx` to different values.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

server_log_ollama.txt

I've tried checking but I can't figure it out.

<!-- gh-comment-id:2693924851 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): [server_log_ollama.txt](https://github.com/user-attachments/files/19051814/server_log_ollama.txt) I've tried checking but I can't figure it out.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

I am not sure if this is related but I had a problem with moondream model. it returned a weird error. on the other hand LLava works fine. so it's not like an issue with all vision models or something like that.

it's also worth mentioning that I am running on cpu and the gpu in the laptop is old nvidia that doesn't have support for ollama.

also there is only 8 gigabytes of ram and I have like 50 gigabytes of swap-memory on the disk.

another thing is that I haven't confirmed that the granite-vision models actually work. when they load into the disk-ram, it becomes too slow and I lose patience and stop them.

note: the disk ram is for building apps from source. without them the device freeze and crash. but I don't actually use it for running LLMs (too slow)

<!-- gh-comment-id:2693961577 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): I am not sure if this is related but I had a problem with moondream model. it returned a weird error. on the other hand LLava works fine. so it's not like an issue with all vision models or something like that. it's also worth mentioning that I am running on cpu and the gpu in the laptop is old nvidia that doesn't have support for ollama. also there is only 8 gigabytes of ram and I have like 50 gigabytes of swap-memory on the disk. another thing is that I haven't confirmed that the granite-vision models actually work. when they load into the disk-ram, it becomes too slow and I lose patience and stop them. note: the disk ram is for building apps from source. without them the device freeze and crash. but I don't actually use it for running LLMs (too slow)
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

Mar 03 07:00:09 box ollama[2691]: time=2025-03-03T07:00:09.912-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-1aefcd9a8a15091b670951963b5f8a7e6653bb1350345e9621e179685ac9bc5f --ctx-size 65536 --batch-size 512 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-4d464be24899cf8dc1862945432e0cef4366c4181fa38b14754cc9279b727608 --threads 2 --no-mmap --parallel 4 --port 33817"

ctx-size is 65536, parallel is 4, so context size is 16384. This is the default, so the clients aren't overriding num_ctx. What's causing the large VRAM footprint is allocation of extra buffers for parallel completions, controlled by OLLAMA_NUM_PARALLEL. This is unset so ollama is using the default value of 4. You can reduce the VRAM footprint by setting OLLAMA_NUM_PARALLEL=1 in the server environment.

<!-- gh-comment-id:2693967258 --> @rick-github commented on GitHub (Mar 3, 2025): ``` Mar 03 07:00:09 box ollama[2691]: time=2025-03-03T07:00:09.912-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-1aefcd9a8a15091b670951963b5f8a7e6653bb1350345e9621e179685ac9bc5f --ctx-size 65536 --batch-size 512 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-4d464be24899cf8dc1862945432e0cef4366c4181fa38b14754cc9279b727608 --threads 2 --no-mmap --parallel 4 --port 33817" ``` `ctx-size` is 65536, `parallel` is 4, so context size is 16384. This is the default, so the clients aren't overriding `num_ctx`. What's causing the large VRAM footprint is allocation of extra buffers for parallel completions, controlled by `OLLAMA_NUM_PARALLEL`. This is unset so ollama is using the default value of 4. You can reduce the VRAM footprint by setting `OLLAMA_NUM_PARALLEL=1` in the server environment.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

that's with sudo nano /etc/systemd/system/ollama.service

then under the [service] section write:
Environment="OLLAMA_NUM_PARALLEL=1"

<!-- gh-comment-id:2693974521 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): that's with sudo nano /etc/systemd/system/ollama.service then under the [service] section write: Environment="OLLAMA_NUM_PARALLEL=1"
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

The recommended way is:

sudo systemctl edit ollama.service

This will create an overrides file, and if ollama is upgraded the changes will be preserved. If you edit the service file directly, changes will be lost on the next upgrade.

<!-- gh-comment-id:2693990917 --> @rick-github commented on GitHub (Mar 3, 2025): The [recommended way](https://github.com/ollama/ollama/blob/main/docs/faq.md#setting-environment-variables-on-linux) is: ``` sudo systemctl edit ollama.service ``` This will create an overrides file, and if ollama is upgraded the changes will be preserved. If you edit the service file directly, changes will be lost on the next upgrade.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

it worked after reboot.

NAME ID SIZE PROCESSOR UNTIL
granite3.2-vision:2b-q4_K_M 3be41a661804 4.7 GB 100% CPU 4 minutes from now

but this will affect all the other models. is there a way that I can make it specific to this model.
another thing, it's still too big. LLava7b on the same quantization level is about 6 gigabytes in ram, and faster. I don't know why but there is something that we're missing here.

just for comparison sake, I gave it an image and it's still haven't output. it's been about 20 minutes, and still nothing. Llave takes less than 5 minutes.

so in conclusion it's not working even after fixing the ram issue. there is something else we need to fix.

it's not really worth the time and effort but I am willing to stay on this out of curiosity if you are.
any ideas as to what is causing this? the cpu is running on full blast but there is repeating itself that I see in the system monitor. also a slight up and down in the ram corresponding to the cpu pattern changes. almost as if it's failing to load something, but not quite.

<!-- gh-comment-id:2694085417 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): it worked after reboot. NAME ID SIZE PROCESSOR UNTIL granite3.2-vision:2b-q4_K_M 3be41a661804 4.7 GB 100% CPU 4 minutes from now but this will affect all the other models. is there a way that I can make it specific to this model. another thing, it's still too big. LLava7b on the same quantization level is about 6 gigabytes in ram, and faster. I don't know why but there is something that we're missing here. just for comparison sake, I gave it an image and it's still haven't output. it's been about 20 minutes, and still nothing. Llave takes less than 5 minutes. so in conclusion it's not working even after fixing the ram issue. there is something else we need to fix. it's not really worth the time and effort but I am willing to stay on this out of curiosity if you are. any ideas as to what is causing this? the cpu is running on full blast but there is repeating itself that I see in the system monitor. also a slight up and down in the ram corresponding to the cpu pattern changes. almost as if it's failing to load something, but not quite.
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

It's possible the model has lost cohesion and is just outputting a stream of tokens without hitting an end-of sequence token. You can make it exit this state by limiting the number of tokens it can generate with num_predict.

<!-- gh-comment-id:2694099019 --> @rick-github commented on GitHub (Mar 3, 2025): It's possible the model has lost cohesion and is just outputting a stream of tokens without hitting an end-of sequence token. You can make it exit this state by limiting the number of tokens it can generate with [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#valid-parameters-and-values:~:text=stop%20%22AI%20assistant%3A%22-,num_predict,-Maximum%20number%20of).
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

hang on man, I am trying to run the same file on LLava to make sure that the issue isn't in the update of ollama version and also to make sure that the issue isn't in the picture I am using

<!-- gh-comment-id:2694104078 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): hang on man, I am trying to run the same file on LLava to make sure that the issue isn't in the update of ollama version and also to make sure that the issue isn't in the picture I am using
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

alright so I changed the picture to png as that what i used in the past and tried llava and I am timing the response but I am up to 24 minutes now and nothing.

I will try reverting to the older version of ollama and run the same thing again. hopefully it will work again and we would have narrowed it down to the version update and you guys check what changed to have granite running that can affect Llava as well.

<!-- gh-comment-id:2694212947 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): alright so I changed the picture to png as that what i used in the past and tried llava and I am timing the response but I am up to 24 minutes now and nothing. I will try reverting to the older version of ollama and run the same thing again. hopefully it will work again and we would have narrowed it down to the version update and you guys check what changed to have granite running that can affect Llava as well.
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

I did a quick check using 0.5.13-rc4 and didn't see any problems:

$ for i in minicpm-v llava llama3.2-vision moondream granite3.2-vision ; do echo -n "$i: " ; echo '{"model": "'$i'",
         "messages":[{
            "role":"user","content":"Describe this image.",
            "images": [
              "'"$(base64 puppy.jpg)"'"
            ]
          }],
         "stream":false}' | curl -s http://localhost:11434/api/chat -d @- | jq  .message.content; done
minicpm-v: "The image depicts a cute, small white puppy sitting on what appears to be a concrete step or pavement. The puppy is wearing a vibrant red collar adorned with decorative elements such as bells and possibly beads. Its fur looks soft and fluffy, typical of many young dogs.\n\nIn the background, there's an out-of-focus area that could indicate it might be taken in an outdoor setting like a backyard or park. There are no other significant objects or animals visible in this scene, making the puppy the focal point of the image. The overall mood conveyed is one of innocence and curiosity as the little dog seems to gaze into its surroundings with interest.\n\nThe color contrast between the white fur of the puppy and the red collar against a potentially muted background adds visual appeal to the photo."
llava: " This is a color photograph featuring an adorable puppy sitting on what appears to be the edge of a concrete step or curb. The puppy has white fur and seems to be looking towards the camera with a curious expression. It's wearing a red collar with a small bell attached, suggesting it might belong to someone nearby.\n\nThe background is out of focus but suggests an urban environment with a blurred image that looks like a building facade, giving context for the location where the photo was taken. The puppy's position and the angle at which the photo was taken create a sense of depth and perspective. There are no visible texts or distinguishing marks in this image. "
llama3.2-vision: "Here is a concise summary of the image: A small white puppy with a red collar and bell sits on a concrete surface, facing right.\n\nIn the foreground, the puppy's fluffy fur and adorable features are prominent, including its short snout, dark nose, and dark eyes. The red collar around its neck adds a pop of color to the otherwise monochromatic scene. The puppy is positioned on a light-colored, speckled concrete surface, which provides a clean and neutral background for the subject.\n\nThe image's background is minimalistic, with no other objects or distractions present. Overall, the composition effectively focuses attention on the endearing puppy, creating an intimate and charming atmosphere."
moondream: "\nA small white puppy with a red collar is sitting on the edge of a concrete step, looking directly at the camera. The puppy's collar has a bell attached to it."
granite3.2-vision: "\nThe image depicts a small white dog sitting on what appears to be a stone or concrete surface. The dog is positioned in such a way that it faces slightly towards the right side of the frame, giving an impression of attentiveness and curiosity. Its fur is fluffy and well-groomed, indicating that it might be a young puppy.\n\nThe dog's collar is red with a golden bell attached to it. The bell is hanging down from the collar, suggesting that it may have been used for training or as a means of alerting people in case the dog was lost. The collar also has a small tag attached to it, which could contain information about the dog's name and owner.\n\nThe background of the image is blurred but appears to be an outdoor setting with some greenery visible, possibly indicating that the photograph was taken during daylight hours in an area with trees or plants nearby. The lighting suggests it might be natural sunlight, casting soft shadows on the dog and the surface beneath it.\n\nThe overall composition of the image is simple yet striking, focusing on the puppy as the central subject. The contrast between the white fur of the dog and the darker background elements draws attention to the animal, making it the focal point of the photograph.\n\nGiven this detailed description, a pure text model can answer various questions related to the image:\n\n1. **What is the main subject of the image?**\n   - The main subject of the image is a small white dog sitting on a stone or concrete surface.\n\n2. **Describe the dog's appearance.**\n   - The dog has fluffy, well-groomed fur and is wearing a red collar with a golden bell attached to it. There is also a tag visible on the collar.\n\n3. **What can be inferred about the setting of the image?**\n   - The setting appears to be outdoors during daylight hours, possibly in an area with greenery nearby.\n\n4. **What does the dog's posture suggest?**\n   - The dog's posture suggests attentiveness and curiosity as it faces slightly towards the right side of the frame.\n\n5. **Is there any additional detail that can be inferred from the image?**\n   - Yes, the presence of a bell on the collar indicates that the dog might have been used for training or to alert people in case it was lost. The tag attached to the collar could contain information about the dog's name and owner.\n\nBy providing this detailed description, a pure text model can effectively answer questions related to the image based on the visual content presented."

<!-- gh-comment-id:2694278115 --> @rick-github commented on GitHub (Mar 3, 2025): I did a quick check using 0.5.13-rc4 and didn't see any problems: ```sh $ for i in minicpm-v llava llama3.2-vision moondream granite3.2-vision ; do echo -n "$i: " ; echo '{"model": "'$i'", "messages":[{ "role":"user","content":"Describe this image.", "images": [ "'"$(base64 puppy.jpg)"'" ] }], "stream":false}' | curl -s http://localhost:11434/api/chat -d @- | jq .message.content; done minicpm-v: "The image depicts a cute, small white puppy sitting on what appears to be a concrete step or pavement. The puppy is wearing a vibrant red collar adorned with decorative elements such as bells and possibly beads. Its fur looks soft and fluffy, typical of many young dogs.\n\nIn the background, there's an out-of-focus area that could indicate it might be taken in an outdoor setting like a backyard or park. There are no other significant objects or animals visible in this scene, making the puppy the focal point of the image. The overall mood conveyed is one of innocence and curiosity as the little dog seems to gaze into its surroundings with interest.\n\nThe color contrast between the white fur of the puppy and the red collar against a potentially muted background adds visual appeal to the photo." llava: " This is a color photograph featuring an adorable puppy sitting on what appears to be the edge of a concrete step or curb. The puppy has white fur and seems to be looking towards the camera with a curious expression. It's wearing a red collar with a small bell attached, suggesting it might belong to someone nearby.\n\nThe background is out of focus but suggests an urban environment with a blurred image that looks like a building facade, giving context for the location where the photo was taken. The puppy's position and the angle at which the photo was taken create a sense of depth and perspective. There are no visible texts or distinguishing marks in this image. " llama3.2-vision: "Here is a concise summary of the image: A small white puppy with a red collar and bell sits on a concrete surface, facing right.\n\nIn the foreground, the puppy's fluffy fur and adorable features are prominent, including its short snout, dark nose, and dark eyes. The red collar around its neck adds a pop of color to the otherwise monochromatic scene. The puppy is positioned on a light-colored, speckled concrete surface, which provides a clean and neutral background for the subject.\n\nThe image's background is minimalistic, with no other objects or distractions present. Overall, the composition effectively focuses attention on the endearing puppy, creating an intimate and charming atmosphere." moondream: "\nA small white puppy with a red collar is sitting on the edge of a concrete step, looking directly at the camera. The puppy's collar has a bell attached to it." granite3.2-vision: "\nThe image depicts a small white dog sitting on what appears to be a stone or concrete surface. The dog is positioned in such a way that it faces slightly towards the right side of the frame, giving an impression of attentiveness and curiosity. Its fur is fluffy and well-groomed, indicating that it might be a young puppy.\n\nThe dog's collar is red with a golden bell attached to it. The bell is hanging down from the collar, suggesting that it may have been used for training or as a means of alerting people in case the dog was lost. The collar also has a small tag attached to it, which could contain information about the dog's name and owner.\n\nThe background of the image is blurred but appears to be an outdoor setting with some greenery visible, possibly indicating that the photograph was taken during daylight hours in an area with trees or plants nearby. The lighting suggests it might be natural sunlight, casting soft shadows on the dog and the surface beneath it.\n\nThe overall composition of the image is simple yet striking, focusing on the puppy as the central subject. The contrast between the white fur of the dog and the darker background elements draws attention to the animal, making it the focal point of the photograph.\n\nGiven this detailed description, a pure text model can answer various questions related to the image:\n\n1. **What is the main subject of the image?**\n - The main subject of the image is a small white dog sitting on a stone or concrete surface.\n\n2. **Describe the dog's appearance.**\n - The dog has fluffy, well-groomed fur and is wearing a red collar with a golden bell attached to it. There is also a tag visible on the collar.\n\n3. **What can be inferred about the setting of the image?**\n - The setting appears to be outdoors during daylight hours, possibly in an area with greenery nearby.\n\n4. **What does the dog's posture suggest?**\n - The dog's posture suggests attentiveness and curiosity as it faces slightly towards the right side of the frame.\n\n5. **Is there any additional detail that can be inferred from the image?**\n - Yes, the presence of a bell on the collar indicates that the dog might have been used for training or to alert people in case it was lost. The tag attached to the collar could contain information about the dog's name and owner.\n\nBy providing this detailed description, a pure text model can effectively answer questions related to the image based on the visual content presented." ```
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

0.5.13-rc2 also no problems using the above script.

<!-- gh-comment-id:2694284152 --> @rick-github commented on GitHub (Mar 3, 2025): 0.5.13-rc2 also no problems using the above script.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

so I installed the stable version. and i think I understand what I done wrong.

when I was tring to install the preview version. I done a mistake and tried to manually download the version I needed and then """sudo tar -C /usr -xzf ollama-linux-amd64.tgz""" this updated the client only without the version itself. then I went and """curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.13-rc2 sh""" this updated the version but I think I broke something in the process. now when I downloaded the stable version and command ollama --version it tells me the version is 0.5.12 but the client is 0.5.13-rc2 , so I am uninstalling ollama completely and installing again.

I'll keep you updated

<!-- gh-comment-id:2694296377 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): so I installed the stable version. and i think I understand what I done wrong. when I was tring to install the preview version. I done a mistake and tried to manually download the version I needed and then """sudo tar -C /usr -xzf ollama-linux-amd64.tgz""" this updated the client only without the version itself. then I went and """curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.13-rc2 sh""" this updated the version but I think I broke something in the process. now when I downloaded the stable version and command ollama --version it tells me the version is 0.5.12 but the client is 0.5.13-rc2 , so I am uninstalling ollama completely and installing again. I'll keep you updated
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

Llava7b works in 4 minutes it gave out response.

by the way the client thing isn't relevant. I changed the version only to 0.5.12 and the client is 0.5.13-rc2 and Lava is working fine despite the client is higher. also thing means that the change that caused the issue is in the version not the client.

I have to emphasize that I am running on cpu and that is probably the reason why you are not experiencing the issue, you know cause you're running on gpu.

<!-- gh-comment-id:2694335625 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): Llava7b works in 4 minutes it gave out response. by the way the client thing isn't relevant. I changed the version only to 0.5.12 and the client is 0.5.13-rc2 and Lava is working fine despite the client is higher. also thing means that the change that caused the issue is in the version not the client. I have to emphasize that I am running on cpu and that is probably the reason why you are not experiencing the issue, you know cause you're running on gpu.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

a good strategy here is for me to update again and try to get llava to work. as we know what has changed for it in the code and try to compare it to the stable version.

again I have to say that it's not worth your time and energy and my laptop is old and it could be that there is something wrong with my laptop not necessarily ollama. but I am willing to stay at it till we get it to work, as long as you're interested.

I will update again and check the cohesion thing you mentioned earlier. but for now I have to take my father to the doctor and I'll be back in a couple of hours.

<!-- gh-comment-id:2694352874 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): a good strategy here is for me to update again and try to get llava to work. as we know what has changed for it in the code and try to compare it to the stable version. again I have to say that it's not worth your time and energy and my laptop is old and it could be that there is something wrong with my laptop not necessarily ollama. but I am willing to stay at it till we get it to work, as long as you're interested. I will update again and check the cohesion thing you mentioned earlier. but for now I have to take my father to the doctor and I'll be back in a couple of hours.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

I am back, my brother took my father to the doctor. so I tried to install using the same command as earlier """curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.13-rc2 sh""" but it won't install. have you guys rolled it back to work on it?

<!-- gh-comment-id:2694492767 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): I am back, my brother took my father to the doctor. so I tried to install using the same command as earlier """curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.13-rc2 sh""" but it won't install. have you guys rolled it back to work on it?
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

Only the most recent rc is made available, that's currently 0.5.13-rc5.

<!-- gh-comment-id:2694497781 --> @rick-github commented on GitHub (Mar 3, 2025): Only the most recent rc is made available, that's currently 0.5.13-rc5.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

I am downloading it and we'll see this through.

if the issue is in Llava as well as granite-vision, this means that the issue is not model specific and it's about the way ollama run vision models on cpu.

does that follow along with the model cohesion theory that we are investigating?

hopefully the rc5 version will already have solved the issue.

<!-- gh-comment-id:2694516886 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): I am downloading it and we'll see this through. if the issue is in Llava as well as granite-vision, this means that the issue is not model specific and it's about the way ollama run vision models on cpu. does that follow along with the model cohesion theory that we are investigating? hopefully the rc5 version will already have solved the issue.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

nope, downloaded the rc5 version, same problem. up to 13 minutes for Llava and it won't output.

can you run it on cpu on your machine to rule that out. It's probably the cpu thing that goes under the radar. usually always developer have strong machines for LLMs and that means they always use gpu.

<!-- gh-comment-id:2694591305 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): nope, downloaded the rc5 version, same problem. up to 13 minutes for Llava and it won't output. can you run it on cpu on your machine to rule that out. It's probably the cpu thing that goes under the radar. usually always developer have strong machines for LLMs and that means they always use gpu.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

0.5.13-rc2 also no problems using the above script.

just seen that script, I didn't notice it before. did you run it on gpu or cpu?

<!-- gh-comment-id:2694631847 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): > 0.5.13-rc2 also no problems using the above script. just seen that script, I didn't notice it before. did you run it on gpu or cpu?
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

GPU for the above runs. Re-ran the test, CPU only, with 0.5.13-rc2, 0.5.13-rc4, 0.5.13-rc5. No issues other than longer processing time.

<!-- gh-comment-id:2694721638 --> @rick-github commented on GitHub (Mar 3, 2025): GPU for the above runs. Re-ran the test, CPU only, with 0.5.13-rc2, 0.5.13-rc4, 0.5.13-rc5. No issues other than longer processing time.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

I tried limiting the num_predict but nothing. same thing loads forever.

maybe I didn't do it right.

import subprocess
import base64
import requests
import json

models = [ "llava:7b"]
image_path = "/home/abdelrahman/Pictures/Screenshots/p.png" # Make sure puppy.jpg is in the same directory or provide full path
api_url = "http://localhost:11434/api/chat"

try:
with open(image_path, "rb") as image_file:
image_data = image_file.read()
base64_image = base64.b64encode(image_data).decode('utf-8') # Encode to base64 and decode to string
except FileNotFoundError:
print(f"Error: {image_path} not found. Please make sure the image file exists in the same directory or provide the correct path.")
exit(1)

for model in models:
print(f"{model}: ", end="") # Print model name without newline

payload = {
    "model": model,
    "messages": [
        {
            "role": "user",
            "content": "Describe this image.",
            "images": [base64_image]
        }
    ],
    "stream": False
    ,"options":{"num_predict":5}
}

try:
    headers = {'Content-Type': 'application/json'}
    response = requests.post(api_url, headers=headers, data=json.dumps(payload))
    response.raise_for_status()  # Raise HTTPError for bad responses (4xx or 5xx)
    data = response.json()

    if "message" in data and "content" in data["message"]:
        print(data["message"]["content"])
    else:
        print("Error: 'message.content' not found in API response.")

except requests.exceptions.RequestException as e:
    print(f"Error communicating with the API for model {model}: {e}")
except json.JSONDecodeError:
    print(f"Error decoding JSON response from API for model {model}.")
<!-- gh-comment-id:2694730981 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): I tried limiting the num_predict but nothing. same thing loads forever. maybe I didn't do it right. import subprocess import base64 import requests import json models = [ "llava:7b"] image_path = "/home/abdelrahman/Pictures/Screenshots/p.png" # Make sure puppy.jpg is in the same directory or provide full path api_url = "http://localhost:11434/api/chat" try: with open(image_path, "rb") as image_file: image_data = image_file.read() base64_image = base64.b64encode(image_data).decode('utf-8') # Encode to base64 and decode to string except FileNotFoundError: print(f"Error: {image_path} not found. Please make sure the image file exists in the same directory or provide the correct path.") exit(1) for model in models: print(f"{model}: ", end="") # Print model name without newline payload = { "model": model, "messages": [ { "role": "user", "content": "Describe this image.", "images": [base64_image] } ], "stream": False ,"options":{"num_predict":5} } try: headers = {'Content-Type': 'application/json'} response = requests.post(api_url, headers=headers, data=json.dumps(payload)) response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx) data = response.json() if "message" in data and "content" in data["message"]: print(data["message"]["content"]) else: print("Error: 'message.content' not found in API response.") except requests.exceptions.RequestException as e: print(f"Error communicating with the API for model {model}: {e}") except json.JSONDecodeError: print(f"Error decoding JSON response from API for model {model}.")
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

is there a way i can list the options values to confirm that It received the change?

<!-- gh-comment-id:2694735384 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): is there a way i can list the options values to confirm that It received the change?
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

in the past script, I used gemini to change the code from Bash to python and then added the option change manually

<!-- gh-comment-id:2694739451 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): in the past script, I used gemini to change the code from Bash to python and then added the option change manually
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

in the past I had an issue with moondream running on ollama. and I think that this is an extension of that.

moondream didn't load forever. but it gave an error that I can't remember now.

ironically when I install pip moondream and use the code they provide on their website it runs normally. so it is definitely something related to ollama and how the model is handled.

also if it runs on your machine then it's probably has something to do with my old laptop and different kernals of ubuntu corresponding to the older machine.

i am talking to you from a 2015 laptop. it's not ancient but still a decade old.

<!-- gh-comment-id:2694751073 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): in the past I had an issue with moondream running on ollama. and I think that this is an extension of that. moondream didn't load forever. but it gave an error that I can't remember now. ironically when I install pip moondream and use the code they provide on their website it runs normally. so it is definitely something related to ollama and how the model is handled. also if it runs on your machine then it's probably has something to do with my old laptop and different kernals of ubuntu corresponding to the older machine. i am talking to you from a 2015 laptop. it's not ancient but still a decade old.
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

,"options":{"num_predict":5}

This is correct.

i am talking to you from a 2015 laptop. it's not ancient but still a decade old.

I went back and had a look at your log.

Mar 03 07:00:10 box ollama[2691]: time=2025-03-03T07:00:10.007-05:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=2

Your CPU has no vector extensions. It's not that the model is not generating output, it's just that your CPU is not suited for the matrix operations that LLM inference uses.

Try this instead of the script:

ollama run llava:7b Describe this image. /home/abdelrahman/Pictures/Screenshots/p.png

There will be pause (perhaps several minutes) as the image is processed, then the model will start to generate tokens. Since this is running in streaming mode, you will see the tokens as they are generated, rather than waiting for the complete output as with the script.

<!-- gh-comment-id:2694797054 --> @rick-github commented on GitHub (Mar 3, 2025): > ,"options":{"num_predict":5} This is correct. > i am talking to you from a 2015 laptop. it's not ancient but still a decade old. I went back and had a look at your log. ``` Mar 03 07:00:10 box ollama[2691]: time=2025-03-03T07:00:10.007-05:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=2 ``` Your CPU has no vector extensions. It's not that the model is not generating output, it's just that your CPU is not suited for the matrix operations that LLM inference uses. Try this instead of the script: ```sh ollama run llava:7b Describe this image. /home/abdelrahman/Pictures/Screenshots/p.png ``` There will be pause (perhaps several minutes) as the image is processed, then the model will start to generate tokens. Since this is running in streaming mode, you will see the tokens as they are generated, rather than waiting for the complete output as with the script.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

i ran the command but I still don't understand why does this particular version of ollama isn't running llava and the previous versions run it. may be the inference method has been changed to suite the new models on the premise of this should work for the old models as well?

<!-- gh-comment-id:2694808225 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): i ran the command but I still don't understand why does this particular version of ollama isn't running llava and the previous versions run it. may be the inference method has been changed to suite the new models on the premise of this should work for the old models as well?
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

11 minutes and nothing. in the 0.5.12 version it output after about 4 minutes. I will leave it till it's 20 minutes

<!-- gh-comment-id:2694833327 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): 11 minutes and nothing. in the 0.5.12 version it output after about 4 minutes. I will leave it till it's 20 minutes
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

I will redownload the 0.5.12 version and run the command and give you the server lot again.

<!-- gh-comment-id:2694846735 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): I will redownload the 0.5.12 version and run the command and give you the server lot again.
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

This version of ollama does run llava:7b. What could be happening is that the right CPU backend is not being selected. That would explain why your log shows no vector extensions and runs much slower than 0.5.12. If you rollback to 0.5.12 and examine the logs after running the model, what does the line with msg=system info= show?

<!-- gh-comment-id:2694848412 --> @rick-github commented on GitHub (Mar 3, 2025): This version of ollama does run llava:7b. What could be happening is that the right CPU backend is not being selected. That would explain why your log shows no vector extensions and runs much slower than 0.5.12. If you rollback to 0.5.12 and examine the logs after running the model, what does the line with `msg=system info=` show?
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

server_log_ollama_2.txt

<!-- gh-comment-id:2694993523 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): [server_log_ollama_2.txt](https://github.com/user-attachments/files/19056683/server_log_ollama_2.txt)
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

0.5.12:

Mar 03 13:43:06 box ollama[23369]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 03 13:43:06 box ollama[23369]: time=2025-03-03T13:43:06.441-05:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=2

0.5.13-rc5

Mar 03 12:40:32 box ollama[2601]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so
Mar 03 12:40:32 box ollama[2601]: time=2025-03-03T12:40:32.681-05:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=2

Looks like a build issue. The same CPU backend has been selected in both cases but 0.5.13-rc5 has no vector extensions. https://github.com/ollama/ollama/pull/9425 was merged a couple of days ago for a similar issue. How sure are you that you have completely deleted the previous ollama versions? What's the output of:

ls -l $(dirname $(dirname $(command -v ollama)))/lib/ollama
ls -l /usr/local/lib/ollama/libggml-cpu-haswell.so
<!-- gh-comment-id:2695020008 --> @rick-github commented on GitHub (Mar 3, 2025): 0.5.12: ``` Mar 03 13:43:06 box ollama[23369]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 03 13:43:06 box ollama[23369]: time=2025-03-03T13:43:06.441-05:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=2 ``` 0.5.13-rc5 ``` Mar 03 12:40:32 box ollama[2601]: load_backend: loaded CPU backend from /usr/local/lib/ollama/libggml-cpu-haswell.so Mar 03 12:40:32 box ollama[2601]: time=2025-03-03T12:40:32.681-05:00 level=INFO source=runner.go:934 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=2 ``` Looks like a build issue. The same CPU backend has been selected in both cases but 0.5.13-rc5 has no vector extensions. https://github.com/ollama/ollama/pull/9425 was merged a couple of days ago for a similar issue. How sure are you that you have completely deleted the previous ollama versions? What's the output of: ``` ls -l $(dirname $(dirname $(command -v ollama)))/lib/ollama ls -l /usr/local/lib/ollama/libggml-cpu-haswell.so ```
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

abdelrahman@box:~$ ls -l $(dirname $(dirname $(command -v ollama)))/lib/ollama
total 3092
drwxr-xr-x 2 root root 4096 Feb 28 19:35 cuda_v11
drwxr-xr-x 2 root root 4096 Feb 28 19:35 cuda_v12
-rwxr-xr-x 1 root root 587424 Feb 28 19:19 libggml-base.so
-rwxr-xr-x 1 root root 470984 Feb 28 19:19 libggml-cpu-alderlake.so
-rwxr-xr-x 1 root root 470984 Feb 28 19:19 libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 573384 Feb 28 19:19 libggml-cpu-icelake.so
-rwxr-xr-x 1 root root 479176 Feb 28 19:19 libggml-cpu-sandybridge.so
-rwxr-xr-x 1 root root 573384 Feb 28 19:19 libggml-cpu-skylakex.so

####################################################################

abdelrahman@box:~$ ls -l /usr/local/lib/ollama/libggml-cpu-haswell.so
-rwxr-xr-x 1 root root 466768 Feb 23 22:20 /usr/local/lib/ollama/libggml-cpu-haswell.so

<!-- gh-comment-id:2695025826 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): abdelrahman@box:~$ ls -l $(dirname $(dirname $(command -v ollama)))/lib/ollama total 3092 drwxr-xr-x 2 root root 4096 Feb 28 19:35 cuda_v11 drwxr-xr-x 2 root root 4096 Feb 28 19:35 cuda_v12 -rwxr-xr-x 1 root root 587424 Feb 28 19:19 libggml-base.so -rwxr-xr-x 1 root root 470984 Feb 28 19:19 libggml-cpu-alderlake.so -rwxr-xr-x 1 root root 470984 Feb 28 19:19 libggml-cpu-haswell.so -rwxr-xr-x 1 root root 573384 Feb 28 19:19 libggml-cpu-icelake.so -rwxr-xr-x 1 root root 479176 Feb 28 19:19 libggml-cpu-sandybridge.so -rwxr-xr-x 1 root root 573384 Feb 28 19:19 libggml-cpu-skylakex.so #################################################################### abdelrahman@box:~$ ls -l /usr/local/lib/ollama/libggml-cpu-haswell.so -rwxr-xr-x 1 root root 466768 Feb 23 22:20 /usr/local/lib/ollama/libggml-cpu-haswell.so
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

but this is with the 0.5.12 installed and the 0.5.13-rc2 client installed

abdelrahman@box:~$ ollama --version
ollama version is 0.5.12
Warning: client version is 0.5.13-rc2

<!-- gh-comment-id:2695028124 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): but this is with the 0.5.12 installed and the 0.5.13-rc2 client installed abdelrahman@box:~$ ollama --version ollama version is 0.5.12 Warning: client version is 0.5.13-rc2
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

I will try to uninstall and install again.
i done that before but I was afraid it will remove all the models so I didn't follow the instructions precisely. I will try again without removing the models but follow all the instructions, skipping the ones under removing the model in the main documentation of ollama
https://github.com/ollama/ollama/blob/main/docs/linux.md

<!-- gh-comment-id:2695044320 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): I will try to uninstall and install again. i done that before but I was afraid it will remove all the models so I didn't follow the instructions precisely. I will try again without removing the models but follow all the instructions, skipping the ones under removing the model in the main documentation of ollama https://github.com/ollama/ollama/blob/main/docs/linux.md
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

What's the output of

type ollama
command -v ollama
which ollama
ls -l $(command -v ollama)
<!-- gh-comment-id:2695045029 --> @rick-github commented on GitHub (Mar 3, 2025): What's the output of ``` type ollama command -v ollama which ollama ls -l $(command -v ollama) ```
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

hang on, I uninstalled it already and installing the new version 0.5.13-rc6 again. I done the following commands to uninstall:

sudo systemctl stop ollama
sudo systemctl disable ollama
sudo rm /etc/systemd/system/ollama.service
sudo rm $(which ollama)
sudo rm -rf /usr/local/lib/ollama

i didn't do the other commands:
sudo rm -r /usr/share/ollama
sudo userdel ollama
sudo groupdel ollama

install is at 64% now

<!-- gh-comment-id:2695052338 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): hang on, I uninstalled it already and installing the new version 0.5.13-rc6 again. I done the following commands to uninstall: sudo systemctl stop ollama sudo systemctl disable ollama sudo rm /etc/systemd/system/ollama.service sudo rm $(which ollama) sudo rm -rf /usr/local/lib/ollama i didn't do the other commands: sudo rm -r /usr/share/ollama sudo userdel ollama sudo groupdel ollama install is at 64% now
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

abdelrahman@box:~$ type ollama
ollama is /usr/local/bin/ollama

abdelrahman@box:~$ command -v ollama
/usr/local/bin/ollama

abdelrahman@box:~$ which ollama
/usr/local/bin/ollama

abdelrahman@box:~$ ls -l $(command -v ollama)
-rwxr-xr-x 1 root root 31575552 Mar 3 04:27 /usr/local/bin/ollama

abdelrahman@box:~$ ollama --version
ollama version is 0.5.13-rc6

I also started another test after installing the new 0.5.13-rc6 , I will tell you the results.

<!-- gh-comment-id:2695085312 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): abdelrahman@box:~$ type ollama ollama is /usr/local/bin/ollama abdelrahman@box:~$ command -v ollama /usr/local/bin/ollama abdelrahman@box:~$ which ollama /usr/local/bin/ollama abdelrahman@box:~$ ls -l $(command -v ollama) -rwxr-xr-x 1 root root 31575552 Mar 3 04:27 /usr/local/bin/ollama abdelrahman@box:~$ ollama --version ollama version is 0.5.13-rc6 I also started another test after installing the new 0.5.13-rc6 , I will tell you the results.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

guess what, it worked

<!-- gh-comment-id:2695086355 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): guess what, it worked
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

sorry for wasting all that time and effort.

so in conclusion, all I really needed to do is uninstall ollama and then install it again using the command lines above.

<!-- gh-comment-id:2695089837 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): sorry for wasting all that time and effort. so in conclusion, all I really needed to do is uninstall ollama and then install it again using the command lines above.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

thanks for your help and sorry again.

<!-- gh-comment-id:2695090914 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): thanks for your help and sorry again.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

hey, before you go. I tested it with llava like I told you above but now I am testing granite-vision and it's back to 11 gigabytes cause I changed it with

sudo nano /etc/systemd/system/ollama.service

then under the [service] section write:
Environment="OLLAMA_NUM_PARALLEL=1"

and it revert after the update like you said. but I redone the above steps so now it doesn't take 11 gigabytes

so it's been about 8 minutes now and no output from granite-vision.

the ollama ps shows the model taking 4.7 gigabytes in ram.

<!-- gh-comment-id:2695140878 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): hey, before you go. I tested it with llava like I told you above but now I am testing granite-vision and it's back to 11 gigabytes cause I changed it with sudo nano /etc/systemd/system/ollama.service then under the [service] section write: Environment="OLLAMA_NUM_PARALLEL=1" and it revert after the update like you said. but I redone the above steps so now it doesn't take 11 gigabytes so it's been about 8 minutes now and no output from granite-vision. the ollama ps shows the model taking 4.7 gigabytes in ram.
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

granite3.2-vision is much more verbose, it will take longer to generate a response.

<!-- gh-comment-id:2695150947 --> @rick-github commented on GitHub (Mar 3, 2025): granite3.2-vision is much more verbose, it will take longer to generate a response.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

yeah you are right it just started output. it took about 15 minutes

<!-- gh-comment-id:2695153050 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): yeah you are right it just started output. it took about 15 minutes
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

but isn't it weird that a 2 billion parameter model is taking just as much as a 7 billion parameter model in ram and also 3 times as long to output

<!-- gh-comment-id:2695155362 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): but isn't it weird that a 2 billion parameter model is taking just as much as a 7 billion parameter model in ram and also 3 times as long to output
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

I guess that is an issue with the model not with ollama, right?

<!-- gh-comment-id:2695158250 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): I guess that is an issue with the model not with ollama, right?
Author
Owner

@rick-github commented on GitHub (Mar 3, 2025):

The default context window for most models is 2048 tokens. The default for granite-3.2-vision is 16384 tokens, so it needs 8 times more VRAM for the context buffer than most other models. It was probably configured this way precisely because it is more verbose than other models.

<!-- gh-comment-id:2695163967 --> @rick-github commented on GitHub (Mar 3, 2025): The default context window for most models is 2048 tokens. The default for granite-3.2-vision is 16384 tokens, so it needs 8 times more VRAM for the context buffer than most other models. It was probably configured this way precisely because it is more verbose than other models.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

thanks man, everything is running the way it should.

<!-- gh-comment-id:2695193117 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): thanks man, everything is running the way it should.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

hey another good news. the uninstall and then reinstall fixed the moondream model. now it's running on ollama. so if someone else is complaining about errors, uninstalling and reinstalling will fix it.

<!-- gh-comment-id:2695271755 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): hey another good news. the uninstall and then reinstall fixed the moondream model. now it's running on ollama. so if someone else is complaining about errors, uninstalling and reinstalling will fix it.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 3, 2025):

Image

<!-- gh-comment-id:2695273659 --> @Abdulrahman392011 commented on GitHub (Mar 3, 2025): ![Image](https://github.com/user-attachments/assets/3aad5fb6-230b-4449-97ea-a487f3ddacac)
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 4, 2025):

hey, I have been reading a bit online about ollama parallel number. and I was wondering why do ollama reserve the memory from the start and not wait till there is another parallel request and then increase the parallel number to two and then after the request is done return it back to one.(for example)

I am no expert but it seems like a the logical thing to do. and it shouldn't be all that hard to implement and will actually decrease the memory footprint of ollama allowing the system to use it for something else while ollama is running in the background.

kinda like how microstat is to temperature.

<!-- gh-comment-id:2699025241 --> @Abdulrahman392011 commented on GitHub (Mar 4, 2025): hey, I have been reading a bit online about ollama parallel number. and I was wondering why do ollama reserve the memory from the start and not wait till there is another parallel request and then increase the parallel number to two and then after the request is done return it back to one.(for example) I am no expert but it seems like a the logical thing to do. and it shouldn't be all that hard to implement and will actually decrease the memory footprint of ollama allowing the system to use it for something else while ollama is running in the background. kinda like how microstat is to temperature.
Author
Owner

@rick-github commented on GitHub (Mar 4, 2025):

If there's not enough VRAM (because the current completion instance has allocated temporary VRAM) the runner will crash.

<!-- gh-comment-id:2699032793 --> @rick-github commented on GitHub (Mar 4, 2025): If there's not enough VRAM (because the current completion instance has allocated temporary VRAM) the runner will crash.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 4, 2025):

clarify a bit more, I don't get it.

you're saying that it's dangerous to make it parallel number to 1 at the loading of the model. and if then another request is given it will simply crash instead of increasing the parallel number to 2.

if so then why don't ollama make some sort of a handler that handle the request and if another request is add instead of directly give it to the runner and crash it. waits for the runner is done with the request it has and then modify the runner even if it means loading the model again.

then handler version 2 will be able to modify the runner configurations without reloading the model from scratch

<!-- gh-comment-id:2699047381 --> @Abdulrahman392011 commented on GitHub (Mar 4, 2025): clarify a bit more, I don't get it. you're saying that it's dangerous to make it parallel number to 1 at the loading of the model. and if then another request is given it will simply crash instead of increasing the parallel number to 2. if so then why don't ollama make some sort of a handler that handle the request and if another request is add instead of directly give it to the runner and crash it. waits for the runner is done with the request it has and then modify the runner even if it means loading the model again. then handler version 2 will be able to modify the runner configurations without reloading the model from scratch
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 4, 2025):

pardon my English it is not my first language

<!-- gh-comment-id:2699050125 --> @Abdulrahman392011 commented on GitHub (Mar 4, 2025): pardon my English it is not my first language
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 4, 2025):

another thing to consider here is that most people that use ollama are using it locally for personal use and that interpret to usually one request at a time. so for most use cases, starting with max parallel number of one will be all what most users use about 90% of the time. and for the other 10% there will be no harm in reloading the model for that use case then reverting back to normal.

and again, eventually a way will be found to reconfigure the runner without reloading the model.

<!-- gh-comment-id:2699140337 --> @Abdulrahman392011 commented on GitHub (Mar 4, 2025): another thing to consider here is that most people that use ollama are using it locally for personal use and that interpret to usually one request at a time. so for most use cases, starting with max parallel number of one will be all what most users use about 90% of the time. and for the other 10% there will be no harm in reloading the model for that use case then reverting back to normal. and again, eventually a way will be found to reconfigure the runner without reloading the model.
Author
Owner

@rick-github commented on GitHub (Mar 4, 2025):

The ollama request handler does request queuing when the runners are busy. But what's the advantage of dynamically allocating context buffers over just allocating when the runner starts? It adds unnecessary overhead. The runners have to start dealing with memory fragmentation, over-commit, OOM scenarios, competition from other VRAM users, etc.

<!-- gh-comment-id:2699216630 --> @rick-github commented on GitHub (Mar 4, 2025): The ollama request handler does request queuing when the runners are busy. But what's the advantage of dynamically allocating context buffers over just allocating when the runner starts? It adds unnecessary overhead. The runners have to start dealing with memory fragmentation, over-commit, OOM scenarios, competition from other VRAM users, etc.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 4, 2025):

I put your comment in gemini and it's favoring my opinion. take a look:

Dynamic allocation of context buffers in request handlers, like in Ollama, offers several advantages over allocating them statically when the runner starts, despite the overhead you mentioned. Let's explore these benefits:

Resource Optimization and Efficiency:

Memory Savings: Dynamic allocation allows context buffers to be created only when needed, and only for the size required by the incoming request. If runners were to pre-allocate context buffers statically, they would need to reserve memory for the maximum possible context size for every runner, regardless of whether those resources are actually used. This can lead to significant memory waste, especially when dealing with numerous runners and variable request sizes. Dynamic allocation ensures that memory is used more efficiently, allocating only what's necessary at the moment.
Scalability and Concurrency: In a system designed to handle concurrent requests, dynamic allocation is crucial for scalability. It enables the system to efficiently manage resources as the number of concurrent requests fluctuates. Static allocation would limit the system's ability to adapt to varying workloads. If the number of concurrent requests exceeds the pre-allocated buffers, the system would become unresponsive or require complex mechanisms to handle overflow. Dynamic allocation allows the system to scale more gracefully, accommodating more requests within the available resources.
Flexibility in Context Size: Large language models (LLMs) often support different context lengths. Dynamic allocation allows the system to adapt to requests with varying context sizes. If a runner starts with a static buffer size, it might be inefficient for requests with smaller contexts (wasting memory) and insufficient for requests with larger contexts (leading to errors or rejection). Dynamic allocation ensures that each request gets a context buffer tailored to its specific needs.

Addressing Memory Fragmentation and Over-Commit:

While you are right to point out the challenges of memory fragmentation, over-commit, and OOM scenarios with dynamic allocation, these issues are often managed through memory management techniques and are considered acceptable trade-offs for the benefits of resource optimization and scalability.

Memory Management Techniques: Modern memory allocators are designed to mitigate fragmentation. Techniques like buddy systems, slab allocation, and compaction algorithms are used to manage dynamically allocated memory and reduce fragmentation.
Over-commit Control: Operating systems and memory management libraries provide mechanisms to control memory over-commit. Ollama and the underlying system can be configured to limit over-commit or use strategies like "oom-kill" to handle out-of-memory situations gracefully if they arise.
VRAM Management: For GPUs, VRAM management is critical. Frameworks and libraries used by Ollama often incorporate VRAM pooling and caching mechanisms to reuse allocated memory and reduce the overhead of frequent allocations and deallocations. Dynamic allocation can be combined with these techniques to optimize VRAM usage.

Why Dynamic Allocation Might Be Preferred Despite Overhead:

The "unnecessary overhead" you mentioned in dynamic allocation (allocation/deallocation operations) is generally considered to be less significant than the overhead and limitations associated with static allocation in scenarios where resource efficiency, scalability, and flexibility are paramount.

Trade-off between Overhead and Efficiency: The small overhead of dynamic allocation is usually outweighed by the gains in memory efficiency and the ability to handle variable workloads effectively.
Complexity of Static Allocation: Implementing static allocation that is both efficient and flexible can be complex. It might require sophisticated pre-calculation of resource needs, rigid limits on request characteristics, and potentially complex mechanisms to handle resource exhaustion.

In summary, while dynamic context buffer allocation introduces complexities related to memory management, it is often the preferred approach in request handlers like Ollama because it provides significant advantages in terms of resource optimization, scalability, and flexibility in handling varying workloads and context sizes. Modern memory management techniques are employed to mitigate the potential downsides of dynamic allocation, making it a practical and efficient choice for such systems.

<!-- gh-comment-id:2699300133 --> @Abdulrahman392011 commented on GitHub (Mar 4, 2025): I put your comment in gemini and it's favoring my opinion. take a look: Dynamic allocation of context buffers in request handlers, like in Ollama, offers several advantages over allocating them statically when the runner starts, despite the overhead you mentioned. Let's explore these benefits: Resource Optimization and Efficiency: Memory Savings: Dynamic allocation allows context buffers to be created only when needed, and only for the size required by the incoming request. If runners were to pre-allocate context buffers statically, they would need to reserve memory for the maximum possible context size for every runner, regardless of whether those resources are actually used. This can lead to significant memory waste, especially when dealing with numerous runners and variable request sizes. Dynamic allocation ensures that memory is used more efficiently, allocating only what's necessary at the moment. Scalability and Concurrency: In a system designed to handle concurrent requests, dynamic allocation is crucial for scalability. It enables the system to efficiently manage resources as the number of concurrent requests fluctuates. Static allocation would limit the system's ability to adapt to varying workloads. If the number of concurrent requests exceeds the pre-allocated buffers, the system would become unresponsive or require complex mechanisms to handle overflow. Dynamic allocation allows the system to scale more gracefully, accommodating more requests within the available resources. Flexibility in Context Size: Large language models (LLMs) often support different context lengths. Dynamic allocation allows the system to adapt to requests with varying context sizes. If a runner starts with a static buffer size, it might be inefficient for requests with smaller contexts (wasting memory) and insufficient for requests with larger contexts (leading to errors or rejection). Dynamic allocation ensures that each request gets a context buffer tailored to its specific needs. Addressing Memory Fragmentation and Over-Commit: While you are right to point out the challenges of memory fragmentation, over-commit, and OOM scenarios with dynamic allocation, these issues are often managed through memory management techniques and are considered acceptable trade-offs for the benefits of resource optimization and scalability. Memory Management Techniques: Modern memory allocators are designed to mitigate fragmentation. Techniques like buddy systems, slab allocation, and compaction algorithms are used to manage dynamically allocated memory and reduce fragmentation. Over-commit Control: Operating systems and memory management libraries provide mechanisms to control memory over-commit. Ollama and the underlying system can be configured to limit over-commit or use strategies like "oom-kill" to handle out-of-memory situations gracefully if they arise. VRAM Management: For GPUs, VRAM management is critical. Frameworks and libraries used by Ollama often incorporate VRAM pooling and caching mechanisms to reuse allocated memory and reduce the overhead of frequent allocations and deallocations. Dynamic allocation can be combined with these techniques to optimize VRAM usage. Why Dynamic Allocation Might Be Preferred Despite Overhead: The "unnecessary overhead" you mentioned in dynamic allocation (allocation/deallocation operations) is generally considered to be less significant than the overhead and limitations associated with static allocation in scenarios where resource efficiency, scalability, and flexibility are paramount. Trade-off between Overhead and Efficiency: The small overhead of dynamic allocation is usually outweighed by the gains in memory efficiency and the ability to handle variable workloads effectively. Complexity of Static Allocation: Implementing static allocation that is both efficient and flexible can be complex. It might require sophisticated pre-calculation of resource needs, rigid limits on request characteristics, and potentially complex mechanisms to handle resource exhaustion. In summary, while dynamic context buffer allocation introduces complexities related to memory management, it is often the preferred approach in request handlers like Ollama because it provides significant advantages in terms of resource optimization, scalability, and flexibility in handling varying workloads and context sizes. Modern memory management techniques are employed to mitigate the potential downsides of dynamic allocation, making it a practical and efficient choice for such systems.
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

LLMs do not understand complex systems.

Gemini is talking about slab allocators, scalability and resource efficiency. This is all fine and good in a general computing environment like your desktop computer. This is not the same environment as a GPU. A GPU has a single purpose - perform matrix calculations on a bunch of numbers. There's nothing to be gained from allocating gigabytes of memory and then freeing it 20 seconds later. It's just adding overhead to the completion.

<!-- gh-comment-id:2699329889 --> @rick-github commented on GitHub (Mar 5, 2025): LLMs do not understand complex systems. Gemini is talking about slab allocators, scalability and resource efficiency. This is all fine and good in a general computing environment like your desktop computer. This is not the same environment as a GPU. A GPU has a single purpose - perform matrix calculations on a bunch of numbers. There's nothing to be gained from allocating gigabytes of memory and then freeing it 20 seconds later. It's just adding overhead to the completion.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

I see.

<!-- gh-comment-id:2699331476 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): I see.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

about the trick we done with reducing the number of max parallel in system ollama file. and that reduced the size of the model to 5 gigabytes from 11. most graphics cards that is out there doesn't come with 16 gigabytes of vram. so it will not just be me that is struggling. we are still talking about a 2 billion parameter model.

also why do you say that it will be freed after 20 seconds. you are looking at the wrong side of the problem. it is not needed 90% of the time the user is using ollama.

so you will find a lot of people setting their settings to max parallel 1 , and when ever they try to run anything in parallel it will be queued, which is fine.

most people don't even know how to set the max parallel num to 1 , they will just leave it at 4 and then find that the system can't run a 2.5 gigabyte model cause they have an 8 gigabyte of vram and they need 11 at least to run the 2 billion parameter model.

what I am saying is, why reserve the memory if it's not needed. I made the settings into 1 and the model ran just fine with 5 gigabytes. so why would any one reserve 11 gigabytes for that model. for what? the possibility of having another request, while we know that ollama is used locally for one user. so one request is all we need and if another request is needed the handler should consider increasing the max parallel num if there is enough vram, if not then it should be queued.

also it could be made so that the max parallel num doesn't return to 1 untill a few minutes have passed instead of 20 seconds. that way the overhead problem won't be as prominent if the user is handling multiple requests at a regular interval.

<!-- gh-comment-id:2699367879 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): about the trick we done with reducing the number of max parallel in system ollama file. and that reduced the size of the model to 5 gigabytes from 11. most graphics cards that is out there doesn't come with 16 gigabytes of vram. so it will not just be me that is struggling. we are still talking about a 2 billion parameter model. also why do you say that it will be freed after 20 seconds. you are looking at the wrong side of the problem. it is not needed 90% of the time the user is using ollama. so you will find a lot of people setting their settings to max parallel 1 , and when ever they try to run anything in parallel it will be queued, which is fine. most people don't even know how to set the max parallel num to 1 , they will just leave it at 4 and then find that the system can't run a 2.5 gigabyte model cause they have an 8 gigabyte of vram and they need 11 at least to run the 2 billion parameter model. what I am saying is, why reserve the memory if it's not needed. I made the settings into 1 and the model ran just fine with 5 gigabytes. so why would any one reserve 11 gigabytes for that model. for what? the possibility of having another request, while we know that ollama is used locally for one user. so one request is all we need and if another request is needed the handler should consider increasing the max parallel num if there is enough vram, if not then it should be queued. also it could be made so that the max parallel num doesn't return to 1 untill a few minutes have passed instead of 20 seconds. that way the overhead problem won't be as prominent if the user is handling multiple requests at a regular interval.
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

ollama only uses 4 as the default for OLLAMA_NUM_PARALLEL if there's enough resources to do so. If it thinks there's not enough, it falls back to 1. So it seems that there is a problem with the logic there, maybe because this is a vision model which has an extra set of weights. I'll have a look.

<!-- gh-comment-id:2699428063 --> @rick-github commented on GitHub (Mar 5, 2025): ollama only uses 4 as the default for `OLLAMA_NUM_PARALLEL` if there's enough resources to do so. If it thinks there's not enough, it falls back to 1. So it seems that there is a problem with the logic there, maybe because this is a vision model which has an extra set of weights. I'll have a look.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

could it be that I have swap memory enabled and it thinks of it as regular ram.


memory.available="[5.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.4 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="5.6 GiB" memory.weights.repeating="5.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
Mar 04 22:01:21 box ollama[2541]: time=2025-03-04T22:01:21.247-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 8192 --batch-size 512 --threads 2 --no-mmap --parallel 4 --port 35331"


I used it with a normal model (nous-hermes2:10.7b-solar-q2_K). it didn't reduce the parallel number

<!-- gh-comment-id:2699444714 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): could it be that I have swap memory enabled and it thinks of it as regular ram. ___________________________________________________________________________________________________________ memory.available="[5.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.4 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[5.4 GiB]" memory.weights.total="5.6 GiB" memory.weights.repeating="5.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB" Mar 04 22:01:21 box ollama[2541]: time=2025-03-04T22:01:21.247-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 8192 --batch-size 512 --threads 2 --no-mmap --parallel 4 --port 35331" _____________________________________________________________________________________________________________ I used it with a normal model (nous-hermes2:10.7b-solar-q2_K). it didn't reduce the parallel number
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

No, what I think is happening is the fallback to 1 only kicks in if all of the model will fit in the available VRAM. That is, say the model takes 11G at parallel=4 and 5G at parallel=1 and you have 4.9G free on the GPU, ollama gives up because it can't fit everything on the GPU and goes with parallel=4.

When you have the model loaded with parallel=1, what's the output of ollama ps?

<!-- gh-comment-id:2699450036 --> @rick-github commented on GitHub (Mar 5, 2025): No, what I think is happening is the fallback to 1 only kicks in if all of the model will fit in the available VRAM. That is, say the model takes 11G at parallel=4 and 5G at parallel=1 and you have 4.9G free on the GPU, ollama gives up because it can't fit everything on the GPU and goes with parallel=4. When you have the model loaded with parallel=1, what's the output of `ollama ps`?
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

hang on, I just came back the cats in the garden was fighting, and I had to eat something for fasting tomorrow, it's the second day of Ramadan.

I changed it and will restart the laptop now it takes 6.9 GB

<!-- gh-comment-id:2699479333 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): hang on, I just came back the cats in the garden was fighting, and I had to eat something for fasting tomorrow, it's the second day of Ramadan. I changed it and will restart the laptop now it takes 6.9 GB
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

nous-hermes2:10.7b-solar-q2_K 2931d5c846b2 5.2 GB 100% CPU

<!-- gh-comment-id:2699485282 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): nous-hermes2:10.7b-solar-q2_K 2931d5c846b2 5.2 GB 100% CPU
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

memory.available="[5.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.8 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[4.8 GiB]" memory.weights.total="4.5 GiB" memory.weights.repeating="4.4 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB"
Mar 04 22:32:53 box ollama[2497]: time=2025-03-04T22:32:53.761-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 2048 --batch-size 512 --threads 2 --no-mmap --parallel 1 --port 42335"

<!-- gh-comment-id:2699486641 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): memory.available="[5.4 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.8 GiB" memory.required.partial="0 B" memory.required.kv="384.0 MiB" memory.required.allocations="[4.8 GiB]" memory.weights.total="4.5 GiB" memory.weights.repeating="4.4 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="164.0 MiB" memory.graph.partial="181.0 MiB" Mar 04 22:32:53 box ollama[2497]: time=2025-03-04T22:32:53.761-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 2048 --batch-size 512 --threads 2 --no-mmap --parallel 1 --port 42335"
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

memory.available="[5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.4 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="5.6 GiB" memory.weights.repeating="5.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB"
Mar 04 22:27:39 box ollama[2541]: time=2025-03-04T22:27:39.313-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 8192 --batch-size 512 --threads 2 --no-mmap --parallel 4 --port 43149"

<!-- gh-comment-id:2699487323 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): memory.available="[5.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="6.4 GiB" memory.required.partial="0 B" memory.required.kv="1.5 GiB" memory.required.allocations="[5.5 GiB]" memory.weights.total="5.6 GiB" memory.weights.repeating="5.5 GiB" memory.weights.nonrepeating="102.6 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="585.0 MiB" Mar 04 22:27:39 box ollama[2541]: time=2025-03-04T22:27:39.313-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-3953920e5adb604f45914bd4c30e2f1df1fde7456a8e0471b2086577da1d46fd --ctx-size 8192 --batch-size 512 --threads 2 --no-mmap --parallel 4 --port 43149"
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

I am running everything on cpu. the gpu in the laptop is old nvidia and has 4 gigabytes of vram. but ollama doesn't use it cause it has compute capability of 3.4 , I think or something like that. it's old.

<!-- gh-comment-id:2699491864 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): I am running everything on cpu. the gpu in the laptop is old nvidia and has 4 gigabytes of vram. but ollama doesn't use it cause it has compute capability of 3.4 , I think or something like that. it's old.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

No, what I think is happening is the fallback to 1 only kicks in if all of the model will fit in the available VRAM. That is, say the model takes 11G at parallel=4 and 5G at parallel=1 and you have 4.9G free on the GPU, ollama gives up because it can't fit everything on the GPU and goes with parallel=4.

When you have the model loaded with parallel=1, what's the output of ollama ps?

so what you're saying is that if the model will fit in the memory with parallel 1 it will do that automatically but if it's gonna end up on swap memory anyway it goes with parallel 4.

I will try to find a model that is big enough to fit in the ram and but not too small that it fits with parallel 4.

tricky!

<!-- gh-comment-id:2699509710 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): > No, what I think is happening is the fallback to 1 only kicks in if all of the model will fit in the available VRAM. That is, say the model takes 11G at parallel=4 and 5G at parallel=1 and you have 4.9G free on the GPU, ollama gives up because it can't fit everything on the GPU and goes with parallel=4. > > When you have the model loaded with parallel=1, what's the output of `ollama ps`? so what you're saying is that if the model will fit in the memory with parallel 1 it will do that automatically but if it's gonna end up on swap memory anyway it goes with parallel 4. I will try to find a model that is big enough to fit in the ram and but not too small that it fits with parallel 4. tricky!
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

so what you're saying is that if the model will fit in the memory with parallel 1 it will do that automatically but if it's gonna end up on swap memory anyway it goes with parallel 4.

That's my guess.

<!-- gh-comment-id:2699512308 --> @rick-github commented on GitHub (Mar 5, 2025): > so what you're saying is that if the model will fit in the memory with parallel 1 it will do that automatically but if it's gonna end up on swap memory anyway it goes with parallel 4. That's my guess.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

memory.available="[4.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.4 GiB" memory.required.partial="0 B" memory.required.kv="1.2 GiB" memory.required.allocations="[4.4 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.5 GiB" memory.weights.nonrepeating="78.8 MiB" memory.graph.full="853.3 MiB" memory.graph.partial="853.3 MiB" projector.weights="851.2 MiB" projector.graph="0 B"
Mar 03 07:58:33 box ollama[2578]: time=2025-03-03T07:58:33.194-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-1aefcd9a8a15091b670951963b5f8a7e6653bb1350345e9621e179685ac9bc5f --ctx-size 16384 --batch-size 512 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-4d464be24899cf8dc1862945432e0cef4366c4181fa38b14754cc9279b727608 --threads 2 --no-mmap --parallel 1 --port 33301"


I was looking in the server log. trying to go back in time where I didn't set the max parallel to 1 to see the pattern of it setting max parallel to 1 on it's own and I think this is an example to confirm what you're saying

<!-- gh-comment-id:2699516803 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): memory.available="[4.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="4.4 GiB" memory.required.partial="0 B" memory.required.kv="1.2 GiB" memory.required.allocations="[4.4 GiB]" memory.weights.total="2.6 GiB" memory.weights.repeating="2.5 GiB" memory.weights.nonrepeating="78.8 MiB" memory.graph.full="853.3 MiB" memory.graph.partial="853.3 MiB" projector.weights="851.2 MiB" projector.graph="0 B" Mar 03 07:58:33 box ollama[2578]: time=2025-03-03T07:58:33.194-05:00 level=INFO source=server.go:380 msg="starting llama server" cmd="/usr/local/bin/ollama runner --model /usr/share/ollama/.ollama/models/blobs/sha256-1aefcd9a8a15091b670951963b5f8a7e6653bb1350345e9621e179685ac9bc5f --ctx-size 16384 --batch-size 512 --mmproj /usr/share/ollama/.ollama/models/blobs/sha256-4d464be24899cf8dc1862945432e0cef4366c4181fa38b14754cc9279b727608 --threads 2 --no-mmap --parallel 1 --port 33301" ___________________________________________________________________________________________________________ I was looking in the server log. trying to go back in time where I didn't set the max parallel to 1 to see the pattern of it setting max parallel to 1 on it's own and I think this is an example to confirm what you're saying
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

Yeah, it calculated that at parallel=1 it could use 4.4G of the 4.9G available, so it went with that instead of parallel=4 which would have resulted in spilling the model to system RAM.

<!-- gh-comment-id:2699556065 --> @rick-github commented on GitHub (Mar 5, 2025): Yeah, it calculated that at parallel=1 it could use 4.4G of the 4.9G available, so it went with that instead of parallel=4 which would have resulted in spilling the model to system RAM.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

no man, I think it's not that. I was trying to replicate the same principle to get it to automatically revert to 1 parallel and I couldn't do it. however, I looked again in the server log and I found that the date of the first parallel 1 log was 3rd of march. that was basically when we started talking.

I am not sure if that is enough to consider that there is an issue with ollama or my system and install. or simply that I wasn't exposed to the right situation to trigger the automatic change into parallel 1.

what do you think?

<!-- gh-comment-id:2699579791 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): no man, I think it's not that. I was trying to replicate the same principle to get it to automatically revert to 1 parallel and I couldn't do it. however, I looked again in the server log and I found that the date of the first parallel 1 log was 3rd of march. that was basically when we started talking. I am not sure if that is enough to consider that there is an issue with ollama or my system and install. or simply that I wasn't exposed to the right situation to trigger the automatic change into parallel 1. what do you think?
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

I mean have you tested this feature (reverting to 1 parallel when needed) yourself. do you know for a fact that it is functioning the way it should?

<!-- gh-comment-id:2699582791 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): I mean have you tested this feature (reverting to 1 parallel when needed) yourself. do you know for a fact that it is functioning the way it should?
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

It's usually not a wise idea to doubt the source code that actively serve thousands of users but mistakes and bugs can happen from one release to the other.

anyway ollama just released the 0.5.13 version and people are gonna use the granite-vision model and they should have the same issue that I had if the problem is in the code not my device and system.

so time will tell.

<!-- gh-comment-id:2699600352 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): It's usually not a wise idea to doubt the source code that actively serve thousands of users but mistakes and bugs can happen from one release to the other. anyway ollama just released the 0.5.13 version and people are gonna use the granite-vision model and they should have the same issue that I had if the problem is in the code not my device and system. so time will tell.
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

do you know for a fact that it is functioning the way it should?

It appears to function as surmised.

$ ollama run nous-hermes2:10.7b-solar-q2_K
>>> /set parameter num_ctx 4096
Set parameter 'num_ctx' to '4096'
>>> hello
Hello! How can I assist you today? I'm here to help with any questions or tasks you may need assistance with.

>>> /set parameter num_ctx 8192
Set parameter 'num_ctx' to '8192'
>>> hello
Hello again! It's great to see you. Remember that I'm always here to help and provide the support you require. If you have any questions or need assistance, please don't hesitate to ask.
$ docker compose logs ollama | sed -ne 's/.*--ctx/--ctx/p'
--ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --parallel 4 --port 40797"
--ctx-size 16384 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --parallel 4 --port 46207"
--ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --parallel 1 --port 42009"
<!-- gh-comment-id:2700753801 --> @rick-github commented on GitHub (Mar 5, 2025): > do you know for a fact that it is functioning the way it should? It appears to function as surmised. ```sh $ ollama run nous-hermes2:10.7b-solar-q2_K >>> /set parameter num_ctx 4096 Set parameter 'num_ctx' to '4096' >>> hello Hello! How can I assist you today? I'm here to help with any questions or tasks you may need assistance with. >>> /set parameter num_ctx 8192 Set parameter 'num_ctx' to '8192' >>> hello Hello again! It's great to see you. Remember that I'm always here to help and provide the support you require. If you have any questions or need assistance, please don't hesitate to ask. ``` ``` $ docker compose logs ollama | sed -ne 's/.*--ctx/--ctx/p' --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --parallel 4 --port 40797" --ctx-size 16384 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --parallel 4 --port 46207" --ctx-size 8192 --batch-size 512 --n-gpu-layers 49 --verbose --threads 8 --parallel 1 --port 42009" ```
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

$ ollama run granite3.2-vision:2b-q4_K_M
>>> /set parameter num_ctx 20000
Set parameter 'num_ctx' to '20000'
>>> hello

Hello! How can I assist you today?
$ docker compose logs ollama | sed -ne 's/--mmproj [^ ]* //' -e 's/.*--ctx/--ctx/p'
--ctx-size 65536 --batch-size 512 --n-gpu-layers 41 --verbose --threads 8 --parallel 4 --port 38967"
--ctx-size 20000 --batch-size 512 --n-gpu-layers 41 --verbose --threads 8 --parallel 1 --port 40289"
<!-- gh-comment-id:2700765034 --> @rick-github commented on GitHub (Mar 5, 2025): ```console $ ollama run granite3.2-vision:2b-q4_K_M >>> /set parameter num_ctx 20000 Set parameter 'num_ctx' to '20000' >>> hello Hello! How can I assist you today? ``` ```console $ docker compose logs ollama | sed -ne 's/--mmproj [^ ]* //' -e 's/.*--ctx/--ctx/p' --ctx-size 65536 --batch-size 512 --n-gpu-layers 41 --verbose --threads 8 --parallel 4 --port 38967" --ctx-size 20000 --batch-size 512 --n-gpu-layers 41 --verbose --threads 8 --parallel 1 --port 40289" ```
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

could it be that this only works for gpu.

try it again while running it on cpu.

<!-- gh-comment-id:2701457591 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): could it be that this only works for gpu. try it again while running it on cpu.
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

Why? The change to the default depends on free VRAM.

<!-- gh-comment-id:2701539940 --> @rick-github commented on GitHub (Mar 5, 2025): Why? The change to the default depends on free VRAM.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

dude lol. I am running on cpu. the laptop that I have got an nvidia card but it's old compute capability and ollama doesn't use it.

that being said, ollama does say that there is an nvidia card when I install ollama. so it could be something like ollama is confused regarding the card and the cpu ram.

<!-- gh-comment-id:2701547875 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): dude lol. I am running on cpu. the laptop that I have got an nvidia card but it's old compute capability and ollama doesn't use it. that being said, ollama does say that there is an nvidia card when I install ollama. so it could be something like ollama is confused regarding the card and the cpu ram.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

I will try and do the same experiment you done with increasing the ctx-size and monitor the parallel num

<!-- gh-comment-id:2701561467 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): I will try and do the same experiment you done with increasing the ctx-size and monitor the parallel num
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

so I tried to get it to work but no. I am not using docker and what I do is that I set the model ctx-size like you do "/set parameter num_ctx 16384" but the server log doesn't have any recollection of any change and I repeated it multiple times but the server log has only one log of the model being loaded with the initial value of 8192 .

<!-- gh-comment-id:2701638625 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): so I tried to get it to work but no. I am not using docker and what I do is that I set the model ctx-size like you do "/set parameter num_ctx 16384" but the server log doesn't have any recollection of any change and I repeated it multiple times but the server log has only one log of the model being loaded with the initial value of 8192 .
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

do you think using docker might help with this. I don't have docker installed. can you view your server log to see if there is any change when you change it with the previous method. maybe that's normal and the server log simply doesn't reflect the change even though it's happening!

<!-- gh-comment-id:2701644797 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): do you think using docker might help with this. I don't have docker installed. can you view your server log to see if there is any change when you change it with the previous method. maybe that's normal and the server log simply doesn't reflect the change even though it's happening!
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

The parallel adjustment is only done for VRAM. Since you are using system RAM, ollama will use the default of 4 for OLLAMA_NUM_PARALLEL.

<!-- gh-comment-id:2701652491 --> @rick-github commented on GitHub (Mar 5, 2025): The parallel adjustment is only done for VRAM. Since you are using system RAM, ollama will use the default of 4 for `OLLAMA_NUM_PARALLEL`.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

I knew I wasn't crazy, lol.

<!-- gh-comment-id:2701718112 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): I knew I wasn't crazy, lol.
Author
Owner

@Abdulrahman392011 commented on GitHub (Mar 5, 2025):

at least very few people will have this issue with the model. most people use gpu. no one will use cpu and wait around for 15 minutes for an image description.

<!-- gh-comment-id:2701721729 --> @Abdulrahman392011 commented on GitHub (Mar 5, 2025): at least very few people will have this issue with the model. most people use gpu. no one will use cpu and wait around for 15 minutes for an image description.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#52676