[GH-ISSUE #6741] Llama 3.1 70b 128k context not fitting 96Gb #30008

Closed
opened 2026-04-22 09:24:38 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @dmatora on GitHub (Sep 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/6741

What is the issue?

Not only it doesn't fit 96Gb (offloading only 10 layers out of 81), but processing actual ~128k request crashes with CUDA error: out of memory on 160Gb (will all layers offloaded)

As mentioned here https://github.com/ollama/ollama/issues/6279#issuecomment-2342546437_
this is obviously a bug

OS

Linux

GPU

Nvidia

CPU

Intel

Ollama version

0.3.10

Originally created by @dmatora on GitHub (Sep 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/6741 ### What is the issue? Not only it doesn't fit 96Gb (offloading only 10 layers out of 81), but processing actual ~128k request crashes with `CUDA error: out of memory` on 160Gb (will all layers offloaded) As mentioned here https://github.com/ollama/ollama/issues/6279#issuecomment-2342546437_ this is obviously a bug ### OS Linux ### GPU Nvidia ### CPU Intel ### Ollama version 0.3.10
GiteaMirror added the nvidiamemorybug labels 2026-04-22 09:24:39 -05:00
Author
Owner

@dmatora commented on GitHub (Sep 11, 2024):

ollama show --modelfile llama3.1:70b > Modelfile3.1-70b
echo 'PARAMETER num_ctx 131072' >> Modelfile3.1-70b
ollama create llama3.1:70b-128k -f Modelfile3.1-70b

optionally echo 'PARAMETER num_gpu 81' >> Modelfile3.1-70b to force 81 layers offloading on 4x4090

<!-- gh-comment-id:2344282365 --> @dmatora commented on GitHub (Sep 11, 2024): ``` ollama show --modelfile llama3.1:70b > Modelfile3.1-70b echo 'PARAMETER num_ctx 131072' >> Modelfile3.1-70b ollama create llama3.1:70b-128k -f Modelfile3.1-70b ``` optionally `echo 'PARAMETER num_gpu 81' >> Modelfile3.1-70b` to force 81 layers offloading on 4x4090
Author
Owner

@dmatora commented on GitHub (Mar 14, 2025):

Issue was ollama didn't support context quantisation which is solved now

<!-- gh-comment-id:2723656177 --> @dmatora commented on GitHub (Mar 14, 2025): Issue was ollama didn't support context quantisation which is solved now
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#30008