[GH-ISSUE #6772] Using the qwen2-7B-q8 model, the GPU can reach 14-15GB when accessed with the Dify API #50779

Closed
opened 2026-04-28 17:06:53 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @bingbing6 on GitHub (Sep 12, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6772

What is the issue?

When using the qwen2-7B-q8 model on Dify's API, the GPU is up to 15GB, but directly using Ollama's API is the normal 9GB, and Dify says that they also call Ollama's API normally, without doing any other processing. See my question to them for details:
https://github.com/langgenius/dify/issues/8294

OS

Windows

GPU

Nvidia

CPU

Intel

Ollama version

ollama version is 0.3.10

Originally created by @bingbing6 on GitHub (Sep 12, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6772 ### What is the issue? When using the qwen2-7B-q8 model on Dify's API, the GPU is up to 15GB, but directly using Ollama's API is the normal 9GB, and Dify says that they also call Ollama's API normally, without doing any other processing. See my question to them for details: https://github.com/langgenius/dify/issues/8294 ### OS Windows ### GPU Nvidia ### CPU Intel ### Ollama version ollama version is 0.3.10
GiteaMirror added the question label 2026-04-28 17:06:53 -05:00
Author
Owner

@rick-github commented on GitHub (Sep 13, 2024):

Dify may be setting a large context window with num_ctx. Server logs from your ollama server may help in debugging.

<!-- gh-comment-id:2347837664 --> @rick-github commented on GitHub (Sep 13, 2024): Dify may be setting a large context window with `num_ctx`. [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) from your ollama server may help in debugging.
Author
Owner

@rick-github commented on GitHub (Sep 16, 2024):

settings screenshot

From this it looks like you are setting context window (token limit) to 128000. The default value via the API is 2048. This is why the model is taking so much more VRAM.

<!-- gh-comment-id:2351907200 --> @rick-github commented on GitHub (Sep 16, 2024): ![settings screenshot](https://private-user-images.githubusercontent.com/51957370/366754820-d3c19613-f1a2-4399-b532-5957568a8082.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjY0NTEyOTIsIm5iZiI6MTcyNjQ1MDk5MiwicGF0aCI6Ii81MTk1NzM3MC8zNjY3NTQ4MjAtZDNjMTk2MTMtZjFhMi00Mzk5LWI1MzItNTk1NzU2OGE4MDgyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA5MTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwOTE2VDAxNDMxMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWNiMzY4YjE2ZDhkOWI5ZjU5ZWE4NzUzNjRlM2JkZDE2ZWY5NWFkOThkNTRlZTExMDU2ODFmNzI1YTgyZWQwYzMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Sn6nhrun5gxV2-sRgbnRIAmuiT94cYyXWIK-qEmLzv4) From this it looks like you are setting context window (token limit) to 128000. The default value via the API is 2048. This is why the model is taking so much more VRAM.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#50779