[GH-ISSUE #14269] nemotron-3-nano does not work with 1M context #71352

Closed
opened 2026-05-05 01:17:29 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @balki on GitHub (Feb 15, 2026).
Original GitHub issue: https://github.com/ollama/ollama/issues/14269

What is the issue?

This model works good with 512K context and when I try to max it to 1M, it crashes. There is plenty of VRAM in the system (96G AMD Strix Halo).

❯ cat Modelfile.nemotron-3-nano-latest                     
FROM nemotron-3-nano:latest 

PARAMETER num_ctx 1048576

❯ ollama create my-nemotron-3-nano-1M-latest -f Modelfile.nemotron-3-nano-latest  
gathering model components 
using existing layer sha256:a70437c41b3b0b768c48737e15f8160c90f13dc963f5226aabb3a160f708d1ce 
using existing layer sha256:bca58c750377d01da1bea95cdbc6f0176f3b58003f6a31c7b069a2746cbea78f 
creating new layer sha256:e6fa7be2e8de2bec184c63098c1f1a1904cc1af9e88856b62299ea6bfc244628 
writing manifest 
success 

❯ ollama run my-nemotron-3-nano-1M-latest:latest                                 
Error: 500 Internal Server Error: llama runner process has terminated: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed

Relevant log output

ollama-nemo-1M-crash.log

OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.16.1

Originally created by @balki on GitHub (Feb 15, 2026). Original GitHub issue: https://github.com/ollama/ollama/issues/14269 ### What is the issue? This model works good with 512K context and when I try to max it to 1M, it crashes. There is plenty of VRAM in the system (96G AMD Strix Halo). ``` ❯ cat Modelfile.nemotron-3-nano-latest FROM nemotron-3-nano:latest PARAMETER num_ctx 1048576 ❯ ollama create my-nemotron-3-nano-1M-latest -f Modelfile.nemotron-3-nano-latest gathering model components using existing layer sha256:a70437c41b3b0b768c48737e15f8160c90f13dc963f5226aabb3a160f708d1ce using existing layer sha256:bca58c750377d01da1bea95cdbc6f0176f3b58003f6a31c7b069a2746cbea78f creating new layer sha256:e6fa7be2e8de2bec184c63098c1f1a1904cc1af9e88856b62299ea6bfc244628 writing manifest success ❯ ollama run my-nemotron-3-nano-1M-latest:latest Error: 500 Internal Server Error: llama runner process has terminated: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed ``` ### Relevant log output [ollama-nemo-1M-crash.log](https://github.com/user-attachments/files/25328416/ollama-nemo-1M-crash.log) ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.16.1
GiteaMirror added the bug label 2026-05-05 01:17:29 -05:00
Author
Owner

@rick-github commented on GitHub (Feb 15, 2026):

This may have the same cause as #14124, which may be fixed by the next vendor sync. In the meantime the model will load if the context is set below the maximum value, eg 1047552.

$ ollama-run.py nemotron-3-nano hello --context 1047552
Thinking...
We need to respond. It's a greeting "hello". Probably just answer with friendly response.
...done thinking
Hello! How can I help you today?
$ ollama ps
NAME                      ID              SIZE     PROCESSOR    CONTEXT    UNTIL   
nemotron-3-nano:latest    b725f1117407    45 GB    100% GPU     1047552    Forever    

<!-- gh-comment-id:3905219720 --> @rick-github commented on GitHub (Feb 15, 2026): This may have the same cause as #14124, which may be fixed by the next vendor sync. In the meantime the model will load if the context is set below the maximum value, eg 1047552. ```console $ ollama-run.py nemotron-3-nano hello --context 1047552 Thinking... We need to respond. It's a greeting "hello". Probably just answer with friendly response. ...done thinking Hello! How can I help you today? $ ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL nemotron-3-nano:latest b725f1117407 45 GB 100% GPU 1047552 Forever ```
Author
Owner

@balki commented on GitHub (Feb 16, 2026):

Thank you. That worked.

❯ ollama create my-nemotron-3-nano-1023K-latest -f Modelfile.nemotron-3-nano-latest
gathering model components
using existing layer sha256:a70437c41b3b0b768c48737e15f8160c90f13dc963f5226aabb3a160f708d1ce
using existing layer sha256:bca58c750377d01da1bea95cdbc6f0176f3b58003f6a31c7b069a2746cbea78f
creating new layer sha256:e2fa894c05f0e216b6b76df929af486fe2fa9ac527bbcc7204304bbf9875ffe6
writing manifest
success

❯ ollama run my-nemotron-3-nano-1023K-latest:latest
>>> hi
Thinking...
The user says "hi". Simple greeting. Should respond friendly.
...done thinking.

Hello! How can I help you today?

>>>

❯ cat Modelfile.nemotron-3-nano-latest
FROM nemotron-3-nano:latest

# PARAMETER num_ctx 1048576

PARAMETER num_ctx 1047552

<!-- gh-comment-id:3908840583 --> @balki commented on GitHub (Feb 16, 2026): Thank you. That worked. ``` ❯ ollama create my-nemotron-3-nano-1023K-latest -f Modelfile.nemotron-3-nano-latest gathering model components using existing layer sha256:a70437c41b3b0b768c48737e15f8160c90f13dc963f5226aabb3a160f708d1ce using existing layer sha256:bca58c750377d01da1bea95cdbc6f0176f3b58003f6a31c7b069a2746cbea78f creating new layer sha256:e2fa894c05f0e216b6b76df929af486fe2fa9ac527bbcc7204304bbf9875ffe6 writing manifest success ❯ ollama run my-nemotron-3-nano-1023K-latest:latest >>> hi Thinking... The user says "hi". Simple greeting. Should respond friendly. ...done thinking. Hello! How can I help you today? >>> ❯ cat Modelfile.nemotron-3-nano-latest FROM nemotron-3-nano:latest # PARAMETER num_ctx 1048576 PARAMETER num_ctx 1047552 ```
Author
Owner

@balki commented on GitHub (Mar 18, 2026):

@rick-github This still fails

❯ ollama run my-nemotron-3-nano-1M-latest:latest
Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details

~
❯ ollama -v
ollama version is 0.18.1

Logs

Mar 18 16:26:03 cachyfd ollama[1476]:   Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0
Mar 18 16:26:03 cachyfd ollama[1476]: load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so
Mar 18 16:26:03 cachyfd ollama[1476]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
Mar 18 16:26:03 cachyfd ollama[1476]: time=2026-03-18T16:26:03.060-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.AVX512=1 CPU.1.AVX512_VBMI=1 CPU.1.AVX512_VNNI=1 CPU.1.AVX512_BF16=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
Mar 18 16:26:03 cachyfd ollama[1476]: /startdir/src/ollama/ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:396: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed
Mar 18 16:26:03 cachyfd ollama[1476]: /usr/lib/ollama/libggml-base.so.0(+0x15bd7) [0x7f1dd0516bd7]
Mar 18 16:26:03 cachyfd ollama[1476]: /usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x22e) [0x7f1dd05178be]
Mar 18 16:26:03 cachyfd ollama[1476]: /usr/lib/ollama/libggml-base.so.0(ggml_abort+0x183) [0x7f1dd0517ad3]

<!-- gh-comment-id:4085381676 --> @balki commented on GitHub (Mar 18, 2026): @rick-github This still fails ``` ❯ ollama run my-nemotron-3-nano-1M-latest:latest Error: 500 Internal Server Error: model failed to load, this may be due to resource limitations or an internal error, check ollama server logs for details ~ ❯ ollama -v ollama version is 0.18.1 ``` #### Logs ``` Mar 18 16:26:03 cachyfd ollama[1476]: Device 0: Radeon 8060S Graphics, gfx1151 (0x1151), VMM: no, Wave Size: 32, ID: 0 Mar 18 16:26:03 cachyfd ollama[1476]: load_backend: loaded ROCm backend from /usr/lib/ollama/libggml-hip.so Mar 18 16:26:03 cachyfd ollama[1476]: load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so Mar 18 16:26:03 cachyfd ollama[1476]: time=2026-03-18T16:26:03.060-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.AVX512_VBMI=1 CPU.0.AVX512_VNNI=1 CPU.0.AVX512_BF16=1 CPU.0.LLAMAFILE=1 CPU.1.SSE3=1 CPU.1.SSSE3=1 CPU.1.AVX=1 CPU.1.AVX2=1 CPU.1.F16C=1 CPU.1.FMA=1 CPU.1.BMI2=1 CPU.1.AVX512=1 CPU.1.AVX512_VBMI=1 CPU.1.AVX512_VNNI=1 CPU.1.AVX512_BF16=1 CPU.1.LLAMAFILE=1 ROCm.0.NO_VMM=1 ROCm.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) Mar 18 16:26:03 cachyfd ollama[1476]: /startdir/src/ollama/ml/backend/ggml/ggml/src/ggml-cuda/cpy.cu:396: GGML_ASSERT(ggml_nbytes(src0) <= INT_MAX) failed Mar 18 16:26:03 cachyfd ollama[1476]: /usr/lib/ollama/libggml-base.so.0(+0x15bd7) [0x7f1dd0516bd7] Mar 18 16:26:03 cachyfd ollama[1476]: /usr/lib/ollama/libggml-base.so.0(ggml_print_backtrace+0x22e) [0x7f1dd05178be] Mar 18 16:26:03 cachyfd ollama[1476]: /usr/lib/ollama/libggml-base.so.0(ggml_abort+0x183) [0x7f1dd0517ad3] ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#71352