[GH-ISSUE #10800] When using gemma3:4b, the reply sometimes cost a lot of time and the answer it wired #32852

Closed
opened 2026-04-22 14:43:50 -05:00 by GiteaMirror · 1 comment
Owner

Originally created by @ywythu on GitHub (May 21, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/10800

What is the issue?

Image

the output is something like this, it costs 1479s for a single question "describe the image", and the answer is repeated a lot of time. What should I do, it there any env param I can add to limit the run time of the task , like if the task cost more than 10s, it willed be killed
by the way, this docker is on the aws L4 server, I deployed 3 ollama with gemma3:4b on it , the vram is 15G, it worked well until the task above append, gpu turned to 100% and 3 ollamas don't work at all

Relevant log output


OS

Linux

GPU

Nvidia

CPU

AMD

Ollama version

0.6.8

Originally created by @ywythu on GitHub (May 21, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/10800 ### What is the issue? ![Image](https://github.com/user-attachments/assets/eee88b89-be10-417d-aa28-00c66f52b91c) the output is something like this, it costs 1479s for a single question "describe the image", and the answer is repeated a lot of time. What should I do, it there any env param I can add to limit the run time of the task , like if the task cost more than 10s, it willed be killed by the way, this docker is on the aws L4 server, I deployed 3 ollama with gemma3:4b on it , the vram is 15G, it worked well until the task above append, gpu turned to 100% and 3 ollamas don't work at all ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU AMD ### Ollama version 0.6.8
GiteaMirror added the bug label 2026-04-22 14:43:50 -05:00
Author
Owner

@rick-github commented on GitHub (May 21, 2025):

Set num_predict in the API call, it will stop generating tokens when it reaches the limit.

I deployed 3 ollama with gemma3:4b

Do you mean you run 3 ollama servers, each with a copy of gemma3:4b? Why not set OLLAMA_NUM_PARALLEL=3 and just use one server?

<!-- gh-comment-id:2897985577 --> @rick-github commented on GitHub (May 21, 2025): Set [`num_predict`](https://github.com/ollama/ollama/blob/main/docs/api.md#request-8:~:text=seed%22%3A%2042%2C%0A%20%20%20%20%22-,num_predict,-%22%3A%20100%2C%0A%20%20%20%20%22top_k) in the API call, it will stop generating tokens when it reaches the limit. > I deployed 3 ollama with gemma3:4b Do you mean you run 3 ollama servers, each with a copy of gemma3:4b? Why not set `OLLAMA_NUM_PARALLEL=3` and just use one server?
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32852