[GH-ISSUE #9519] Cannot Increase num_ctx Beyond 2048 in Ollama #6206

Closed
opened 2026-04-12 17:36:01 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @yana-sklyanchuk on GitHub (Mar 5, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9519

What is the issue?

Hello, Ollama team! 👋

I am running Ollama 0.5.12 on Ubuntu and using the following setup:

Model: llama3.1:70b-instruct-q8_0
GPU: NVIDIA A100 80GB PCIe
API Endpoint: http://localhost:11434/api/generate
Ollama Service Running with Systemd
Environment Variables:

Environment="OLLAMA_KEEP_ALIVE=-1"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_CTX=4096"

Model Info:
Architecture: LLaMA 3.1
Parameters: 70.6B
Default Context Window: 131072 tokens
Embedding Size: 8192
Quantization: Q8_0
License: LLAMA 3.1 COMMUNITY LICENSE AGREEMENT
Issue:
Despite setting num_ctx=4096, my model still uses only a 2048-token context window.
I have verified that OLLAMA_NUM_CTX=4096 is properly set in systemctl show ollama | grep OLLAMA.
However, when making API requests, the model does not process more than 2048 tokens.

What I Have Tried:
Setting num_ctx=4096 via API:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:70b-instruct-q8_0",
  "prompt": "Long text...",
  "options": {"num_ctx": 4096}
}'

Setting OLLAMA_NUM_CTX=4096 in override.conf, restarting Ollama:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Checking if the variable is applied:

systemctl show ollama | grep OLLAMA

Running Ollama manually:

OLLAMA_NUM_CTX=4096 ollama run llama3.1:70b-instruct-q8_0

Testing via Python API:

import requests
requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "llama3.1:70b-instruct-q8_0",
        "messages": [{"role": "user", "content": "Very long text..."}],
        "options": {"num_ctx": 4096}
    }
)

🚨 But in every case, prompt_eval_count remains limited to 2048 tokens.

Question:
How can I properly increase the context window beyond 2048 tokens?
Given that the model (llama3.1:70b-instruct-q8_0) has a default context window of 131072 tokens, why is Ollama not allowing me to set num_ctx=4096?
Is there a limitation in Ollama, or do I need to adjust model-specific configurations?

Any guidance would be much appreciated! Thank you in advance. 🙌

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.5.12

Originally created by @yana-sklyanchuk on GitHub (Mar 5, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9519 ### What is the issue? Hello, Ollama team! 👋 I am running Ollama 0.5.12 on Ubuntu and using the following setup: Model: llama3.1:70b-instruct-q8_0 GPU: NVIDIA A100 80GB PCIe API Endpoint: http://localhost:11434/api/generate Ollama Service Running with Systemd Environment Variables: ```sh Environment="OLLAMA_KEEP_ALIVE=-1" Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_HOST=0.0.0.0:11434" Environment="OLLAMA_NUM_CTX=4096" ``` Model Info: Architecture: LLaMA 3.1 Parameters: 70.6B Default Context Window: 131072 tokens Embedding Size: 8192 Quantization: Q8_0 License: LLAMA 3.1 COMMUNITY LICENSE AGREEMENT Issue: Despite setting num_ctx=4096, my model still uses only a 2048-token context window. I have verified that OLLAMA_NUM_CTX=4096 is properly set in systemctl show ollama | grep OLLAMA. However, when making API requests, the model does not process more than 2048 tokens. What I Have Tried: ✅ Setting num_ctx=4096 via API: ```sh curl http://localhost:11434/api/generate -d '{ "model": "llama3.1:70b-instruct-q8_0", "prompt": "Long text...", "options": {"num_ctx": 4096} }' ``` ✅ Setting OLLAMA_NUM_CTX=4096 in override.conf, restarting Ollama: ```sh sudo systemctl daemon-reload sudo systemctl restart ollama ``` ✅ Checking if the variable is applied: ```sh systemctl show ollama | grep OLLAMA ``` ✅ Running Ollama manually: ```sh OLLAMA_NUM_CTX=4096 ollama run llama3.1:70b-instruct-q8_0 ``` ✅ Testing via Python API: ```python import requests requests.post( "http://localhost:11434/api/chat", json={ "model": "llama3.1:70b-instruct-q8_0", "messages": [{"role": "user", "content": "Very long text..."}], "options": {"num_ctx": 4096} } ) ``` 🚨 But in every case, prompt_eval_count remains limited to 2048 tokens. Question: How can I properly increase the context window beyond 2048 tokens? Given that the model (llama3.1:70b-instruct-q8_0) has a default context window of 131072 tokens, why is Ollama not allowing me to set num_ctx=4096? Is there a limitation in Ollama, or do I need to adjust model-specific configurations? Any guidance would be much appreciated! Thank you in advance. 🙌 ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.5.12
GiteaMirror added the bug label 2026-04-12 17:36:01 -05:00
Author
Owner

@rick-github commented on GitHub (Mar 5, 2025):

OLLAMA_CONTEXT_LENGTH, not OLLAMA_NUM_CTX.

Setting num_ctx works.

curl -s http://localhost:11434/api/generate -d '{
  "model": "llama3.1:70b-instruct-q8_0",
  "prompt": "'"$(python -c 'print("Long text..." * 2000)')"'",
  "options": {"num_ctx": 4096}
}' | jq 'select(.prompt_eval_count)|.prompt_eval_count'
4096

Server logs may give insight to what's happening.

<!-- gh-comment-id:2701005708 --> @rick-github commented on GitHub (Mar 5, 2025): [`OLLAMA_CONTEXT_LENGTH`](https://github.com/ollama/ollama/blob/05a01fdecbf9077613c57874b3f8eb7919f76527/envconfig/config.go#L258), not `OLLAMA_NUM_CTX`. Setting `num_ctx` works. ```console curl -s http://localhost:11434/api/generate -d '{ "model": "llama3.1:70b-instruct-q8_0", "prompt": "'"$(python -c 'print("Long text..." * 2000)')"'", "options": {"num_ctx": 4096} }' | jq 'select(.prompt_eval_count)|.prompt_eval_count' 4096 ``` [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may give insight to what's happening.
Author
Owner

@yana-sklyanchuk commented on GitHub (Mar 6, 2025):

@rick-github Thanks for your help!

I had previously tried OLLAMA_CONTEXT_LENGTH=4096 in an earlier version of Ollama, but it didn't work. After updating to Ollama 0.5.12, it now works correctly! 🚀

I also made minor adjustments to the API request and confirmed that prompt_eval_count=4096, meaning the context length is now properly applied.

Thanks again! This should help others facing similar issues. 🙌

<!-- gh-comment-id:2702987232 --> @yana-sklyanchuk commented on GitHub (Mar 6, 2025): @rick-github Thanks for your help! I had previously tried OLLAMA_CONTEXT_LENGTH=4096 in an earlier version of Ollama, but it didn't work. After updating to Ollama 0.5.12, it now works correctly! 🚀 I also made minor adjustments to the API request and confirmed that prompt_eval_count=4096, meaning the context length is now properly applied. Thanks again! This should help others facing similar issues. 🙌
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#6206