[GH-ISSUE #6436] Reloading the Same Model During Consecutive Accesses to the ollama API Within OLLAMA_KEEP_ALIVE Duration #29806

Closed
opened 2026-04-22 09:03:52 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @DavidLetGo on GitHub (Aug 20, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6436

What is the issue?

Issue Summary:

Reloading the same model unnecessarily costs approximately 10 seconds per reload, significantly degrading performance.

Issue Found in Ollama Version 0.3.4:

During debugging, I added debug logging (slog.Debug) to the source code, built, and debugged the application. This led to some useful findings regarding the issue.

Possible Root Cause:

It appears that the normalization process is not functioning correctly. After a previous access to the ollama API, the runner.numParallel value unexpectedly resets to 1, despite the OLLAMA_NUM_PARALLEL environment variable being set to 3.
// Normalize the NumCtx for parallelism
optsExisting.NumCtx = optsExisting.NumCtx / runner.numParallel

Part of log:

  • time=2024-08-20T01:55:48.411-04:00 level=DEBUG source=sched.go:593 msg="Before Normalization" optsExisting.NumCtx=24576 optsExisting.NumCtx=24576 runner.numParallel=1
  • time=2024-08-20T01:55:48.411-04:00 level=DEBUG source=sched.go:596 msg="After Normalization" optsExisting.NumCtx=24576 optsExisting.NumCtx=24576 runner.numParallel=1
    DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=313 tid="140124911321088" timestamp=1724133348
  • time=2024-08-20T01:55:48.412-04:00 level=DEBUG source=sched.go:616 msg="Debugging paths and options" runner.model.AdapterPaths=[] req.model.AdapterPaths=[] runner.model.ProjectorPaths=[] req.model.ProjectorPaths=[] optsExisting="{NumCtx:24576 NumBatch:512 NumGPU:-1 MainGPU:0 LowVRAM:false F16KV:true LogitsAll:false VocabOnly:false UseMMap: UseMLock:false NumThread:0}" optsNew="{NumCtx:8192 NumBatch:512 NumGPU:-1 MainGPU:0 LowVRAM:false F16KV:true LogitsAll:false VocabOnly:false UseMMap: UseMLock:false NumThread:0}"

The complete log:
debug-ollama-locally.log

Accesses to Ollama API:
$ curl http://localhost:11434/api/generate -d '{
"model": "mixtral-8x7b-instruct",
"prompt": "how is the novel, the wind in the willow?",
"stream": false,
"options": {
"num_ctx": 8192
}
}'
$ curl http://localhost:11434/api/generate -d '{
"model": "mixtral-8x7b-instruct",
"prompt": "How is milk tea?",
"stream": false,
"options": {
"num_ctx": 8192
}
}'

The environment variables used:
"OLLAMA_DEBUG": "true",
"OLLAMA_NUM_PARALLEL": "3"
default "OLLAMA_KEEP_ALIVE "

Devices
Only one GPU: GeForce RTX 4090

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.4

Originally created by @DavidLetGo on GitHub (Aug 20, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6436 ### What is the issue? **Issue Summary:** Reloading the same model unnecessarily costs approximately 10 seconds per reload, significantly degrading performance. **Issue Found in Ollama Version 0.3.4:** During debugging, I added debug logging (slog.Debug) to the source code, built, and debugged the application. This led to some useful findings regarding the issue. **Possible Root Cause:** It appears that the normalization process is not functioning correctly. After a previous access to the ollama API, the runner.numParallel value unexpectedly resets to 1, despite the OLLAMA_NUM_PARALLEL environment variable being set to 3. // Normalize the NumCtx for parallelism optsExisting.NumCtx = optsExisting.NumCtx / runner.numParallel **Part of log:** - time=2024-08-20T01:55:48.411-04:00 level=DEBUG source=sched.go:593 msg="Before Normalization" optsExisting.NumCtx=24576 optsExisting.NumCtx=24576 runner.numParallel=1 - time=2024-08-20T01:55:48.411-04:00 level=DEBUG source=sched.go:596 msg="After Normalization" optsExisting.NumCtx=24576 optsExisting.NumCtx=24576 runner.numParallel=1 DEBUG [process_single_task] slot data | n_idle_slots=1 n_processing_slots=0 task_id=313 tid="140124911321088" timestamp=1724133348 - time=2024-08-20T01:55:48.412-04:00 level=DEBUG source=sched.go:616 msg="Debugging paths and options" runner.model.AdapterPaths=[] req.model.AdapterPaths=[] runner.model.ProjectorPaths=[] req.model.ProjectorPaths=[] optsExisting="{NumCtx:24576 NumBatch:512 NumGPU:-1 MainGPU:0 LowVRAM:false F16KV:true LogitsAll:false VocabOnly:false UseMMap:<nil> UseMLock:false NumThread:0}" optsNew="{NumCtx:8192 NumBatch:512 NumGPU:-1 MainGPU:0 LowVRAM:false F16KV:true LogitsAll:false VocabOnly:false UseMMap:<nil> UseMLock:false NumThread:0}" **The complete log:** [debug-ollama-locally.log](https://github.com/user-attachments/files/16670180/debug-ollama-locally.log) **Accesses to Ollama API:** $ curl http://localhost:11434/api/generate -d '{ "model": "mixtral-8x7b-instruct", "prompt": "how is the novel, the wind in the willow?", "stream": false, "options": { "num_ctx": 8192 } }' $ curl http://localhost:11434/api/generate -d '{ "model": "mixtral-8x7b-instruct", "prompt": "How is milk tea?", "stream": false, "options": { "num_ctx": 8192 } }' **The environment variables used:** "OLLAMA_DEBUG": "true", "OLLAMA_NUM_PARALLEL": "3" default "OLLAMA_KEEP_ALIVE " **Devices** Only one GPU: GeForce RTX 4090 ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.4
GiteaMirror added the bug label 2026-04-22 09:03:52 -05:00
Author
Owner

@rick-github commented on GitHub (Aug 20, 2024):

Fixed by https://github.com/ollama/ollama/pull/6402, next release will have this.

<!-- gh-comment-id:2298160194 --> @rick-github commented on GitHub (Aug 20, 2024): Fixed by https://github.com/ollama/ollama/pull/6402, next release will have this.
Author
Owner

@DavidLetGo commented on GitHub (Aug 20, 2024):

Fixed by #6402, next release will have this.

Thank you.

<!-- gh-comment-id:2298333708 --> @DavidLetGo commented on GitHub (Aug 20, 2024): > Fixed by #6402, next release will have this. Thank you.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#29806