[GH-ISSUE #13556] Successfully run llama3.3 only for the first time. Subsequent run hits ollama._types.ResponseError: model requires more system memory (39.4 GiB) than is available (30.0 GiB) (status code: 500) #70986

Open
opened 2026-05-04 23:39:49 -05:00 by GiteaMirror · 18 comments
Owner

Originally created by @khteh on GitHub (Dec 24, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/13556

What is the issue?

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/usr/src/Python/rag-agent/src/rag_agent/EmailRAG.py", line 332, in <module>
    asyncio.run(main())
    ~~~~~~~~~~~^^^^^^^^
  File "/usr/lib/python3.13/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ~~~~~~~~~~^^^^^^
  File "/usr/lib/python3.13/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete
    return future.result()
           ~~~~~~~~~~~~~^^
  File "/usr/src/Python/rag-agent/src/rag_agent/EmailRAG.py", line 325, in main
    result = await rag.Chat("There's an immediate risk of electrical, water, or fire damage", EMAILS[3], email_state, config)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/src/rag_agent/EmailRAG.py", line 294, in Chat
    async for step in self._agent.with_config({"email_state": email_state, "thread_id": uuid7str()}).astream(
    ...<5 lines>...
        step["messages"][-1].pretty_print()
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/pregel/main.py", line 2971, in astream
    async for _ in runner.atick(
    ...<13 lines>...
            yield o
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/pregel/_runner.py", line 304, in atick
    await arun_with_retry(
    ...<15 lines>...
    )
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/pregel/_retry.py", line 137, in arun_with_retry
    return await task.proc.ainvoke(task.input, config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/_internal/_runnable.py", line 705, in ainvoke
    input = await asyncio.create_task(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^
        step.ainvoke(input, config, **kwargs), context=context
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/_internal/_runnable.py", line 473, in ainvoke
    ret = await self.afunc(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 1189, in amodel_node
    response = await awrap_model_call_handler(request, _execute_model_async)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 277, in final_normalized
    final_result = await result(request, handler)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 261, in composed
    outer_result = await outer(request, inner_handler)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/middleware/todo.py", line 224, in awrap_model_call
    return await handler(request.override(system_message=new_system_message))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 257, in inner_handler
    inner_result = await inner(req, handler)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 261, in composed
    outer_result = await outer(request, inner_handler)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/deepagents/middleware/filesystem.py", line 975, in awrap_model_call
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 257, in inner_handler
    inner_result = await inner(req, handler)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 261, in composed
    outer_result = await outer(request, inner_handler)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/deepagents/middleware/subagents.py", line 483, in awrap_model_call
    return await handler(request.override(system_prompt=system_prompt))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 257, in inner_handler
    inner_result = await inner(req, handler)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_anthropic/middleware/prompt_caching.py", line 140, in awrap_model_call
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 1157, in _execute_model_async
    output = await model_.ainvoke(messages)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/chat_models/base.py", line 676, in ainvoke
    return await self._model(config).ainvoke(input, config=config, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/runnables/base.py", line 5570, in ainvoke
    return await self.bound.ainvoke(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<3 lines>...
    )
    ^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 421, in ainvoke
    llm_result = await self.agenerate_prompt(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
    )
    ^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 1128, in agenerate_prompt
    return await self.agenerate(
           ^^^^^^^^^^^^^^^^^^^^^
        prompt_messages, stop=stop, callbacks=callbacks, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 1086, in agenerate
    raise exceptions[0]
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 1339, in _agenerate_with_cache
    result = await self._agenerate(
             ^^^^^^^^^^^^^^^^^^^^^^
        messages, stop=stop, run_manager=run_manager, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 1208, in _agenerate
    final_chunk = await self._achat_stream_with_aggregation(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        messages, stop, run_manager, verbose=self.verbose, **kwargs
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 991, in _achat_stream_with_aggregation
    async for chunk in self._aiterate_over_stream(messages, stop, **kwargs):
    ...<9 lines>...
            )
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 1131, in _aiterate_over_stream
    async for stream_resp in self._acreate_chat_stream(messages, stop, **kwargs):
    ...<52 lines>...
            yield chunk
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 937, in _acreate_chat_stream
    async for part in await self._async_client.chat(**chat_params):
        yield part
  File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/ollama/_client.py", line 757, in inner
    raise ResponseError(e.response.text, e.response.status_code) from None
ollama._types.ResponseError: model requires more system memory (39.4 GiB) than is available (30.0 GiB) (status code: 500)
During task with name 'model' and id 'adcd9ce6-41f2-c654-a948-2a8a90670c34'

Inside the docker container:

root@ollama-0:/# ollama list
NAME                     ID              SIZE      MODIFIED       
embeddinggemma:latest    85462619ee72    621 MB    40 minutes ago    
llama3.3:latest          a6eb4748fd29    42 GB     40 minutes ago    
root@ollama-0:/# ollama ps
NAME    ID    SIZE    PROCESSOR    CONTEXT    UNTIL 
root@ollama-0:/# 

Is there memory leak?

Relevant log output

[ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=DEBUG source=runner.go:264 msg="refreshing free memory"
[ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery"
[ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 39425"
[ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=DEBUG source=server.go:430 msg=subprocess OLLAMA_MODELS=/models OLLAMA_SCHED_SPREAD=true OLLAMA_HOST=http://0.0.0.0:11434 OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_DEBUG=true LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_FLASH_ATTENTION=true OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13
[ollama-0 ollama] time=2025-12-24T07:11:01.337Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=236.298541ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=map[]
[ollama-0 ollama] time=2025-12-24T07:11:01.337Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=236.391661ms
[ollama-0 ollama] time=2025-12-24T07:11:01.338Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax"
[ollama-0 ollama] time=2025-12-24T07:11:01.349Z level=DEBUG source=ggml.go:282 msg="key with type not found" key=general.alignment default=32
[ollama-0 ollama] time=2025-12-24T07:11:01.349Z level=DEBUG source=sched.go:211 msg="loading first model" model=/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d
[ollama-0 ollama] llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d (version GGUF V3 (latest))
[ollama-0 ollama] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[ollama-0 ollama] llama_model_loader: - kv   0:                       general.architecture str              = llama
[ollama-0 ollama] llama_model_loader: - kv   1:                               general.type str              = model
[ollama-0 ollama] llama_model_loader: - kv   2:                               general.name str              = Llama 3.1 70B Instruct 2024 12
[ollama-0 ollama] llama_model_loader: - kv   3:                            general.version str              = 2024-12
[ollama-0 ollama] llama_model_loader: - kv   4:                           general.finetune str              = Instruct
[ollama-0 ollama] llama_model_loader: - kv   5:                           general.basename str              = Llama-3.1
[ollama-0 ollama] llama_model_loader: - kv   6:                         general.size_label str              = 70B
[ollama-0 ollama] llama_model_loader: - kv   7:                            general.license str              = llama3.1
[ollama-0 ollama] llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
[ollama-0 ollama] llama_model_loader: - kv   9:                  general.base_model.0.name str              = Llama 3.1 70B
[ollama-0 ollama] llama_model_loader: - kv  10:          general.base_model.0.organization str              = Meta Llama
[ollama-0 ollama] llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/meta-llama/Lla...
[ollama-0 ollama] llama_model_loader: - kv  12:                               general.tags arr[str,5]       = ["facebook", "meta", "pytorch", "llam...
[ollama-0 ollama] llama_model_loader: - kv  13:                          general.languages arr[str,7]       = ["fr", "it", "pt", "hi", "es", "th", ...
[ollama-0 ollama] llama_model_loader: - kv  14:                          llama.block_count u32              = 80
[ollama-0 ollama] llama_model_loader: - kv  15:                       llama.context_length u32              = 131072
[ollama-0 ollama] llama_model_loader: - kv  16:                     llama.embedding_length u32              = 8192
[ollama-0 ollama] llama_model_loader: - kv  17:                  llama.feed_forward_length u32              = 28672
[ollama-0 ollama] llama_model_loader: - kv  18:                 llama.attention.head_count u32              = 64
[ollama-0 ollama] llama_model_loader: - kv  19:              llama.attention.head_count_kv u32              = 8
[ollama-0 ollama] llama_model_loader: - kv  20:                       llama.rope.freq_base f32              = 500000.000000
[ollama-0 ollama] llama_model_loader: - kv  21:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
[ollama-0 ollama] llama_model_loader: - kv  22:                 llama.attention.key_length u32              = 128
[ollama-0 ollama] llama_model_loader: - kv  23:               llama.attention.value_length u32              = 128
[ollama-0 ollama] llama_model_loader: - kv  24:                          general.file_type u32              = 15
[ollama-0 ollama] llama_model_loader: - kv  25:                           llama.vocab_size u32              = 128256
[ollama-0 ollama] llama_model_loader: - kv  26:                 llama.rope.dimension_count u32              = 128
[ollama-0 ollama] llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = gpt2
[ollama-0 ollama] llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = llama-bpe
[ollama-0 ollama] llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[ollama-0 ollama] llama_model_loader: - kv  30:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[ollama-0 ollama] llama_model_loader: - kv  31:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
[ollama-0 ollama] llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 128000
[ollama-0 ollama] llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 128009
[ollama-0 ollama] llama_model_loader: - kv  34:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
[ollama-0 ollama] llama_model_loader: - kv  35:               general.quantization_version u32              = 2
[ollama-0 ollama] llama_model_loader: - type  f32:  162 tensors
[ollama-0 ollama] llama_model_loader: - type q4_K:  441 tensors
[ollama-0 ollama] llama_model_loader: - type q5_K:   40 tensors
[ollama-0 ollama] llama_model_loader: - type q6_K:   81 tensors
[ollama-0 ollama] print_info: file format = GGUF V3 (latest)
[ollama-0 ollama] print_info: file type   = Q4_K - Medium
[ollama-0 ollama] print_info: file size   = 39.59 GiB (4.82 BPW) 
[ollama-0 ollama] init_tokenizer: initializing tokenizer for type 2
[ollama-0 ollama] load: printing all EOG tokens:
[ollama-0 ollama] load:   - 128001 ('<|end_of_text|>')
[ollama-0 ollama] load:   - 128008 ('<|eom_id|>')
[ollama-0 ollama] load:   - 128009 ('<|eot_id|>')
[ollama-0 ollama] load: special tokens cache size = 256
[ollama-0 ollama] load: token to piece cache size = 0.7999 MB
[ollama-0 ollama] print_info: arch             = llama
[ollama-0 ollama] print_info: vocab_only       = 1
[ollama-0 ollama] print_info: no_alloc         = 0
[ollama-0 ollama] print_info: model type       = ?B
[ollama-0 ollama] print_info: model params     = 70.55 B
[ollama-0 ollama] print_info: general.name     = Llama 3.1 70B Instruct 2024 12
[ollama-0 ollama] print_info: vocab type       = BPE
[ollama-0 ollama] print_info: n_vocab          = 128256
[ollama-0 ollama] print_info: n_merges         = 280147
[ollama-0 ollama] print_info: BOS token        = 128000 '<|begin_of_text|>'
[ollama-0 ollama] print_info: EOS token        = 128009 '<|eot_id|>'
[ollama-0 ollama] print_info: EOT token        = 128009 '<|eot_id|>'
[ollama-0 ollama] print_info: EOM token        = 128008 '<|eom_id|>'
[ollama-0 ollama] print_info: LF token         = 198 'Ċ'
[ollama-0 ollama] print_info: EOG token        = 128001 '<|end_of_text|>'
[ollama-0 ollama] print_info: EOG token        = 128008 '<|eom_id|>'
[ollama-0 ollama] print_info: EOG token        = 128009 '<|eot_id|>'
[ollama-0 ollama] print_info: max token length = 256
[ollama-0 ollama] llama_model_load: vocab only - skipping tensors
[ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --model /models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d --port 39731"
[ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=DEBUG source=server.go:430 msg=subprocess OLLAMA_MODELS=/models OLLAMA_SCHED_SPREAD=true OLLAMA_HOST=http://0.0.0.0:11434 OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_DEBUG=true LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_FLASH_ATTENTION=true OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13
[ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=sched.go:443 msg="system memory" total="68.1 GiB" free="27.8 GiB" free_swap="917.6 MiB"
[ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA available="3.1 GiB" free="3.5 GiB" minimum="457.0 MiB" overhead="0 B"
[ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=server.go:496 msg="loading model" "model layers"=81 requested=-1
[ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=DEBUG source=ggml.go:617 msg="default cache size estimate" "attention MiB"=2560 "attention bytes"=2684354560 "recurrent MiB"=0 "recurrent bytes"=0
[ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA "available layer vram"="2.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
[ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=WARN source=server.go:1033 msg="model request too large for system" requested="39.4 GiB" available="28.7 GiB" total="68.1 GiB" free="27.8 GiB" swap="917.6 MiB"
[ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=INFO source=sched.go:470 msg="Load failed" model=/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d error="model requires more system memory (39.4 GiB) than is available (28.7 GiB)"
[ollama-0 ollama] time=2025-12-24T07:11:01.678Z level=INFO source=runner.go:965 msg="starting go runner"
[ollama-0 ollama] time=2025-12-24T07:11:01.678Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
[ollama-0 ollama] load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so
[ollama-0 ollama] time=2025-12-24T07:11:01.684Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v13
[ollama-0 ollama] time=2025-12-24T07:11:01.770Z level=DEBUG source=server.go:1803 msg="stopping llama server" pid=8041
[ollama-0 ollama] time=2025-12-24T07:11:01.770Z level=DEBUG source=server.go:1809 msg="waiting for llama server to exit" pid=8041
[ollama-0 ollama] time=2025-12-24T07:11:01.871Z level=DEBUG source=server.go:1813 msg="llama server stopped" pid=8041

OS

Linux, Docker

GPU

Nvidia

CPU

Intel

Ollama version

0.13.5

Originally created by @khteh on GitHub (Dec 24, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/13556 ### What is the issue? ``` Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/usr/src/Python/rag-agent/src/rag_agent/EmailRAG.py", line 332, in <module> asyncio.run(main()) ~~~~~~~~~~~^^^^^^^^ File "/usr/lib/python3.13/asyncio/runners.py", line 195, in run return runner.run(main) ~~~~~~~~~~^^^^^^ File "/usr/lib/python3.13/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^ File "/usr/lib/python3.13/asyncio/base_events.py", line 725, in run_until_complete return future.result() ~~~~~~~~~~~~~^^ File "/usr/src/Python/rag-agent/src/rag_agent/EmailRAG.py", line 325, in main result = await rag.Chat("There's an immediate risk of electrical, water, or fire damage", EMAILS[3], email_state, config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/src/rag_agent/EmailRAG.py", line 294, in Chat async for step in self._agent.with_config({"email_state": email_state, "thread_id": uuid7str()}).astream( ...<5 lines>... step["messages"][-1].pretty_print() File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/pregel/main.py", line 2971, in astream async for _ in runner.atick( ...<13 lines>... yield o File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/pregel/_runner.py", line 304, in atick await arun_with_retry( ...<15 lines>... ) File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/pregel/_retry.py", line 137, in arun_with_retry return await task.proc.ainvoke(task.input, config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/_internal/_runnable.py", line 705, in ainvoke input = await asyncio.create_task( ^^^^^^^^^^^^^^^^^^^^^^^^^^ step.ainvoke(input, config, **kwargs), context=context ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langgraph/_internal/_runnable.py", line 473, in ainvoke ret = await self.afunc(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 1189, in amodel_node response = await awrap_model_call_handler(request, _execute_model_async) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 277, in final_normalized final_result = await result(request, handler) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 261, in composed outer_result = await outer(request, inner_handler) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/middleware/todo.py", line 224, in awrap_model_call return await handler(request.override(system_message=new_system_message)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 257, in inner_handler inner_result = await inner(req, handler) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 261, in composed outer_result = await outer(request, inner_handler) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/deepagents/middleware/filesystem.py", line 975, in awrap_model_call return await handler(request) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 257, in inner_handler inner_result = await inner(req, handler) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 261, in composed outer_result = await outer(request, inner_handler) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/deepagents/middleware/subagents.py", line 483, in awrap_model_call return await handler(request.override(system_prompt=system_prompt)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 257, in inner_handler inner_result = await inner(req, handler) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_anthropic/middleware/prompt_caching.py", line 140, in awrap_model_call return await handler(request) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/agents/factory.py", line 1157, in _execute_model_async output = await model_.ainvoke(messages) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain/chat_models/base.py", line 676, in ainvoke return await self._model(config).ainvoke(input, config=config, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/runnables/base.py", line 5570, in ainvoke return await self.bound.ainvoke( ^^^^^^^^^^^^^^^^^^^^^^^^^ ...<3 lines>... ) ^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 421, in ainvoke llm_result = await self.agenerate_prompt( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ...<8 lines>... ) ^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 1128, in agenerate_prompt return await self.agenerate( ^^^^^^^^^^^^^^^^^^^^^ prompt_messages, stop=stop, callbacks=callbacks, **kwargs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 1086, in agenerate raise exceptions[0] File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_core/language_models/chat_models.py", line 1339, in _agenerate_with_cache result = await self._agenerate( ^^^^^^^^^^^^^^^^^^^^^^ messages, stop=stop, run_manager=run_manager, **kwargs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 1208, in _agenerate final_chunk = await self._achat_stream_with_aggregation( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ messages, stop, run_manager, verbose=self.verbose, **kwargs ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ) ^ File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 991, in _achat_stream_with_aggregation async for chunk in self._aiterate_over_stream(messages, stop, **kwargs): ...<9 lines>... ) File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 1131, in _aiterate_over_stream async for stream_resp in self._acreate_chat_stream(messages, stop, **kwargs): ...<52 lines>... yield chunk File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/langchain_ollama/chat_models.py", line 937, in _acreate_chat_stream async for part in await self._async_client.chat(**chat_params): yield part File "/usr/src/Python/rag-agent/.venv/lib/python3.13/site-packages/ollama/_client.py", line 757, in inner raise ResponseError(e.response.text, e.response.status_code) from None ollama._types.ResponseError: model requires more system memory (39.4 GiB) than is available (30.0 GiB) (status code: 500) During task with name 'model' and id 'adcd9ce6-41f2-c654-a948-2a8a90670c34' ``` Inside the docker container: ``` root@ollama-0:/# ollama list NAME ID SIZE MODIFIED embeddinggemma:latest 85462619ee72 621 MB 40 minutes ago llama3.3:latest a6eb4748fd29 42 GB 40 minutes ago root@ollama-0:/# ollama ps NAME ID SIZE PROCESSOR CONTEXT UNTIL root@ollama-0:/# ``` Is there memory leak? ### Relevant log output ```shell [ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=DEBUG source=runner.go:264 msg="refreshing free memory" [ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=DEBUG source=runner.go:328 msg="unable to refresh all GPUs with existing runners, performing bootstrap discovery" [ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --port 39425" [ollama-0 ollama] time=2025-12-24T07:11:01.101Z level=DEBUG source=server.go:430 msg=subprocess OLLAMA_MODELS=/models OLLAMA_SCHED_SPREAD=true OLLAMA_HOST=http://0.0.0.0:11434 OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_DEBUG=true LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_FLASH_ATTENTION=true OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13 [ollama-0 ollama] time=2025-12-24T07:11:01.337Z level=DEBUG source=runner.go:437 msg="bootstrap discovery took" duration=236.298541ms OLLAMA_LIBRARY_PATH="[/usr/lib/ollama /usr/lib/ollama/cuda_v13]" extra_envs=map[] [ollama-0 ollama] time=2025-12-24T07:11:01.337Z level=DEBUG source=runner.go:40 msg="overall device VRAM discovery took" duration=236.391661ms [ollama-0 ollama] time=2025-12-24T07:11:01.338Z level=WARN source=cpu_linux.go:130 msg="failed to parse CPU allowed micro secs" error="strconv.ParseInt: parsing \"max\": invalid syntax" [ollama-0 ollama] time=2025-12-24T07:11:01.349Z level=DEBUG source=ggml.go:282 msg="key with type not found" key=general.alignment default=32 [ollama-0 ollama] time=2025-12-24T07:11:01.349Z level=DEBUG source=sched.go:211 msg="loading first model" model=/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d [ollama-0 ollama] llama_model_loader: loaded meta data with 36 key-value pairs and 724 tensors from /models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d (version GGUF V3 (latest)) [ollama-0 ollama] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. [ollama-0 ollama] llama_model_loader: - kv 0: general.architecture str = llama [ollama-0 ollama] llama_model_loader: - kv 1: general.type str = model [ollama-0 ollama] llama_model_loader: - kv 2: general.name str = Llama 3.1 70B Instruct 2024 12 [ollama-0 ollama] llama_model_loader: - kv 3: general.version str = 2024-12 [ollama-0 ollama] llama_model_loader: - kv 4: general.finetune str = Instruct [ollama-0 ollama] llama_model_loader: - kv 5: general.basename str = Llama-3.1 [ollama-0 ollama] llama_model_loader: - kv 6: general.size_label str = 70B [ollama-0 ollama] llama_model_loader: - kv 7: general.license str = llama3.1 [ollama-0 ollama] llama_model_loader: - kv 8: general.base_model.count u32 = 1 [ollama-0 ollama] llama_model_loader: - kv 9: general.base_model.0.name str = Llama 3.1 70B [ollama-0 ollama] llama_model_loader: - kv 10: general.base_model.0.organization str = Meta Llama [ollama-0 ollama] llama_model_loader: - kv 11: general.base_model.0.repo_url str = https://huggingface.co/meta-llama/Lla... [ollama-0 ollama] llama_model_loader: - kv 12: general.tags arr[str,5] = ["facebook", "meta", "pytorch", "llam... [ollama-0 ollama] llama_model_loader: - kv 13: general.languages arr[str,7] = ["fr", "it", "pt", "hi", "es", "th", ... [ollama-0 ollama] llama_model_loader: - kv 14: llama.block_count u32 = 80 [ollama-0 ollama] llama_model_loader: - kv 15: llama.context_length u32 = 131072 [ollama-0 ollama] llama_model_loader: - kv 16: llama.embedding_length u32 = 8192 [ollama-0 ollama] llama_model_loader: - kv 17: llama.feed_forward_length u32 = 28672 [ollama-0 ollama] llama_model_loader: - kv 18: llama.attention.head_count u32 = 64 [ollama-0 ollama] llama_model_loader: - kv 19: llama.attention.head_count_kv u32 = 8 [ollama-0 ollama] llama_model_loader: - kv 20: llama.rope.freq_base f32 = 500000.000000 [ollama-0 ollama] llama_model_loader: - kv 21: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 [ollama-0 ollama] llama_model_loader: - kv 22: llama.attention.key_length u32 = 128 [ollama-0 ollama] llama_model_loader: - kv 23: llama.attention.value_length u32 = 128 [ollama-0 ollama] llama_model_loader: - kv 24: general.file_type u32 = 15 [ollama-0 ollama] llama_model_loader: - kv 25: llama.vocab_size u32 = 128256 [ollama-0 ollama] llama_model_loader: - kv 26: llama.rope.dimension_count u32 = 128 [ollama-0 ollama] llama_model_loader: - kv 27: tokenizer.ggml.model str = gpt2 [ollama-0 ollama] llama_model_loader: - kv 28: tokenizer.ggml.pre str = llama-bpe [ollama-0 ollama] llama_model_loader: - kv 29: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... [ollama-0 ollama] llama_model_loader: - kv 30: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... [ollama-0 ollama] llama_model_loader: - kv 31: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... [ollama-0 ollama] llama_model_loader: - kv 32: tokenizer.ggml.bos_token_id u32 = 128000 [ollama-0 ollama] llama_model_loader: - kv 33: tokenizer.ggml.eos_token_id u32 = 128009 [ollama-0 ollama] llama_model_loader: - kv 34: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... [ollama-0 ollama] llama_model_loader: - kv 35: general.quantization_version u32 = 2 [ollama-0 ollama] llama_model_loader: - type f32: 162 tensors [ollama-0 ollama] llama_model_loader: - type q4_K: 441 tensors [ollama-0 ollama] llama_model_loader: - type q5_K: 40 tensors [ollama-0 ollama] llama_model_loader: - type q6_K: 81 tensors [ollama-0 ollama] print_info: file format = GGUF V3 (latest) [ollama-0 ollama] print_info: file type = Q4_K - Medium [ollama-0 ollama] print_info: file size = 39.59 GiB (4.82 BPW) [ollama-0 ollama] init_tokenizer: initializing tokenizer for type 2 [ollama-0 ollama] load: printing all EOG tokens: [ollama-0 ollama] load: - 128001 ('<|end_of_text|>') [ollama-0 ollama] load: - 128008 ('<|eom_id|>') [ollama-0 ollama] load: - 128009 ('<|eot_id|>') [ollama-0 ollama] load: special tokens cache size = 256 [ollama-0 ollama] load: token to piece cache size = 0.7999 MB [ollama-0 ollama] print_info: arch = llama [ollama-0 ollama] print_info: vocab_only = 1 [ollama-0 ollama] print_info: no_alloc = 0 [ollama-0 ollama] print_info: model type = ?B [ollama-0 ollama] print_info: model params = 70.55 B [ollama-0 ollama] print_info: general.name = Llama 3.1 70B Instruct 2024 12 [ollama-0 ollama] print_info: vocab type = BPE [ollama-0 ollama] print_info: n_vocab = 128256 [ollama-0 ollama] print_info: n_merges = 280147 [ollama-0 ollama] print_info: BOS token = 128000 '<|begin_of_text|>' [ollama-0 ollama] print_info: EOS token = 128009 '<|eot_id|>' [ollama-0 ollama] print_info: EOT token = 128009 '<|eot_id|>' [ollama-0 ollama] print_info: EOM token = 128008 '<|eom_id|>' [ollama-0 ollama] print_info: LF token = 198 'Ċ' [ollama-0 ollama] print_info: EOG token = 128001 '<|end_of_text|>' [ollama-0 ollama] print_info: EOG token = 128008 '<|eom_id|>' [ollama-0 ollama] print_info: EOG token = 128009 '<|eot_id|>' [ollama-0 ollama] print_info: max token length = 256 [ollama-0 ollama] llama_model_load: vocab only - skipping tensors [ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --model /models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d --port 39731" [ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=DEBUG source=server.go:430 msg=subprocess OLLAMA_MODELS=/models OLLAMA_SCHED_SPREAD=true OLLAMA_HOST=http://0.0.0.0:11434 OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_DEBUG=true LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_FLASH_ATTENTION=true OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13 [ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=sched.go:443 msg="system memory" total="68.1 GiB" free="27.8 GiB" free_swap="917.6 MiB" [ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA available="3.1 GiB" free="3.5 GiB" minimum="457.0 MiB" overhead="0 B" [ollama-0 ollama] time=2025-12-24T07:11:01.666Z level=INFO source=server.go:496 msg="loading model" "model layers"=81 requested=-1 [ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=DEBUG source=ggml.go:617 msg="default cache size estimate" "attention MiB"=2560 "attention bytes"=2684354560 "recurrent MiB"=0 "recurrent bytes"=0 [ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA "available layer vram"="2.5 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" [ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=WARN source=server.go:1033 msg="model request too large for system" requested="39.4 GiB" available="28.7 GiB" total="68.1 GiB" free="27.8 GiB" swap="917.6 MiB" [ollama-0 ollama] time=2025-12-24T07:11:01.667Z level=INFO source=sched.go:470 msg="Load failed" model=/models/blobs/sha256-4824460d29f2058aaf6e1118a63a7a197a09bed509f0e7d4e2efb1ee273b447d error="model requires more system memory (39.4 GiB) than is available (28.7 GiB)" [ollama-0 ollama] time=2025-12-24T07:11:01.678Z level=INFO source=runner.go:965 msg="starting go runner" [ollama-0 ollama] time=2025-12-24T07:11:01.678Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama [ollama-0 ollama] load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-icelake.so [ollama-0 ollama] time=2025-12-24T07:11:01.684Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v13 [ollama-0 ollama] time=2025-12-24T07:11:01.770Z level=DEBUG source=server.go:1803 msg="stopping llama server" pid=8041 [ollama-0 ollama] time=2025-12-24T07:11:01.770Z level=DEBUG source=server.go:1809 msg="waiting for llama server to exit" pid=8041 [ollama-0 ollama] time=2025-12-24T07:11:01.871Z level=DEBUG source=server.go:1813 msg="llama server stopped" pid=8041 ``` ### OS Linux, Docker ### GPU Nvidia ### CPU Intel ### Ollama version 0.13.5
GiteaMirror added the bug label 2026-05-04 23:39:50 -05:00
Author
Owner

@rick-github commented on GitHub (Dec 24, 2025):

model requires more system memory (39.4 GiB) than is available (30.0 GiB) (status code: 500)

The model requires more system memory than is available. ollama wants 39.4GiB to load the model and only 30.0 GiB is free. Information about resources and a server log will help in debugging.

<!-- gh-comment-id:3688773030 --> @rick-github commented on GitHub (Dec 24, 2025): ``` model requires more system memory (39.4 GiB) than is available (30.0 GiB) (status code: 500) ``` The model requires more system memory than is available. ollama wants 39.4GiB to load the model and only 30.0 GiB is free. Information about resources and a [server log](https://docs.ollama.com/troubleshooting) will help in debugging.
Author
Owner

@khteh commented on GitHub (Dec 24, 2025):

Added the logs

<!-- gh-comment-id:3688903608 --> @khteh commented on GitHub (Dec 24, 2025): Added the logs
Author
Owner

@davejpeters commented on GitHub (Jan 4, 2026):

You might try lowering the context of the ollama model you are loading from within your python program. i.e. ....

llm = Ollama(
    model="llama3.2:3b",
    json_mode=True,
    temperature=0,
    context_window=4096,   # <-- This is the part that might help
    request_timeout=300.0,
    system_prompt=system_prompt,
    # api_key=api_key,
)

I ran into similar issues and this solved them for me. Don't quote me, but it seems most, of the time, the full context window (and thus memory hog) isn't needed.

<!-- gh-comment-id:3708273416 --> @davejpeters commented on GitHub (Jan 4, 2026): You might try lowering the context of the ollama model you are loading from within your python program. i.e. .... ```python llm = Ollama( model="llama3.2:3b", json_mode=True, temperature=0, context_window=4096, # <-- This is the part that might help request_timeout=300.0, system_prompt=system_prompt, # api_key=api_key, ) ``` I ran into similar issues and this solved them for me. Don't quote me, but it seems most, of the time, the full context window (and thus memory hog) isn't needed.
Author
Owner

@khteh commented on GitHub (Jan 12, 2026):

Doesn't that affect the performance of the model?

<!-- gh-comment-id:3736846596 --> @khteh commented on GitHub (Jan 12, 2026): Doesn't that affect the performance of the model?
Author
Owner

@khteh commented on GitHub (Jan 12, 2026):

To no avail. Same error.

<!-- gh-comment-id:3736953789 --> @khteh commented on GitHub (Jan 12, 2026): To no avail. Same error.
Author
Owner

@davejpeters commented on GitHub (Jan 13, 2026):

Performance can depend on the task.

<!-- gh-comment-id:3741913476 --> @davejpeters commented on GitHub (Jan 13, 2026): Performance can depend on the task.
Author
Owner

@khteh commented on GitHub (Jan 13, 2026):

I switch to gpt-oss model now and does NOT see this issue.

<!-- gh-comment-id:3741916829 --> @khteh commented on GitHub (Jan 13, 2026): I switch to `gpt-oss` model now and does NOT see this issue.
Author
Owner

@markasoftware-tc commented on GitHub (Jan 19, 2026):

How much RAM does your system have / have you given to the container? I believe this may be an instance of a bug in how ollama detects available memory in containers

<!-- gh-comment-id:3770036288 --> @markasoftware-tc commented on GitHub (Jan 19, 2026): How much RAM does your system have / have you given to the container? I believe this may be an instance of a bug in how ollama detects available memory in containers
Author
Owner

@khteh commented on GitHub (Jan 20, 2026):

45GiB out of 72GB given to the container.

<!-- gh-comment-id:3770739629 --> @khteh commented on GitHub (Jan 20, 2026): `45GiB` out of `72GB` given to the container.
Author
Owner

@markasoftware-tc commented on GitHub (Jan 20, 2026):

Yeah this is probably due to a bug with how Ollama accounts for used memory in containers, which I fixed in the PR you can see above. If you're familiar with how to build ollama then I recommend building from source on my pr branch with the instructions here https://github.com/ollama/ollama/blob/main/docs/development.md#linux

<!-- gh-comment-id:3774211427 --> @markasoftware-tc commented on GitHub (Jan 20, 2026): Yeah this is probably due to a bug with how Ollama accounts for used memory in containers, which I fixed in the PR you can see above. If you're familiar with how to build ollama then I recommend building from source on my pr branch with the instructions here https://github.com/ollama/ollama/blob/main/docs/development.md#linux
Author
Owner

@khteh commented on GitHub (Jan 21, 2026):

Please provide a formal release. Thanks.

<!-- gh-comment-id:3775633335 --> @khteh commented on GitHub (Jan 21, 2026): Please provide a formal release. Thanks.
Author
Owner

@khteh commented on GitHub (Jan 21, 2026):

BTW, is it related to this https://github.com/ollama/ollama/issues/10124 ?

<!-- gh-comment-id:3777736490 --> @khteh commented on GitHub (Jan 21, 2026): BTW, is it related to this https://github.com/ollama/ollama/issues/10124 ?
Author
Owner

@markasoftware-tc commented on GitHub (Jan 21, 2026):

BTW, is it related to this #10124 ?

i don't think so, this is purely about system ram/memory usage.

<!-- gh-comment-id:3780187062 --> @markasoftware-tc commented on GitHub (Jan 21, 2026): > BTW, is it related to this [#10124](https://github.com/ollama/ollama/issues/10124) ? i don't think so, this is purely about system ram/memory usage.
Author
Owner

@khteh commented on GitHub (Jan 22, 2026):

Yeah this is probably due to a bug with how Ollama accounts for used memory in containers, which I fixed in the PR you can see above. If you're familiar with how to build ollama then I recommend building from source on my pr branch with the instructions here https://github.com/ollama/ollama/blob/main/docs/development.md#linux

Don't find your branch.

Image
<!-- gh-comment-id:3782517709 --> @khteh commented on GitHub (Jan 22, 2026): > Yeah this is probably due to a bug with how Ollama accounts for used memory in containers, which I fixed in the PR you can see above. If you're familiar with how to build ollama then I recommend building from source on my pr branch with the instructions here https://github.com/ollama/ollama/blob/main/docs/development.md#linux Don't find your branch. <img width="2508" height="448" alt="Image" src="https://github.com/user-attachments/assets/310f3bba-0006-48e9-a8ba-8a18790ee140" />
Author
Owner

@markasoftware-tc commented on GitHub (Jan 22, 2026):

Yeah this is probably due to a bug with how Ollama accounts for used memory in containers, which I fixed in the PR you can see above. If you're familiar with how to build ollama then I recommend building from source on my pr branch with the instructions here https://github.com/ollama/ollama/blob/main/docs/development.md#linux

Don't find your branch.
Image

the way github pull requests work is that you typically fork the repository into your own account and create the branch there, not under the main repo unless you are a trusted collaborator. My branch is at https://github.com/markasoftware-tc/ollama/tree/markasoftware/cgroup-reclaimable-memory

<!-- gh-comment-id:3786167059 --> @markasoftware-tc commented on GitHub (Jan 22, 2026): > > Yeah this is probably due to a bug with how Ollama accounts for used memory in containers, which I fixed in the PR you can see above. If you're familiar with how to build ollama then I recommend building from source on my pr branch with the instructions here https://github.com/ollama/ollama/blob/main/docs/development.md#linux > > Don't find your branch. > <img alt="Image" width="2000" height="448" src="https://private-user-images.githubusercontent.com/3871483/538950363-310f3bba-0006-48e9-a8ba-8a18790ee140.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NjkxMDg3ODYsIm5iZiI6MTc2OTEwODQ4NiwicGF0aCI6Ii8zODcxNDgzLzUzODk1MDM2My0zMTBmM2JiYS0wMDA2LTQ4ZTktYThiYS04YTE4NzkwZWUxNDAucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI2MDEyMiUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNjAxMjJUMTkwMTI2WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9MTBiOWJiMDBhY2Y4MmY1MjJlODk5ZjJjN2RiZmFlOGQ0NTc5NzcyYjcyMjJiNTJjNmE1YjFkY2Q5MTJmYWM0NCZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.xD67aY1E2cbych7FHHy6tdhDQhHOcWc-1C-yHO33_Xw"> the way github pull requests work is that you typically fork the repository into your own account and create the branch there, not under the main repo unless you are a trusted collaborator. My branch is at https://github.com/markasoftware-tc/ollama/tree/markasoftware/cgroup-reclaimable-memory
Author
Owner

@khteh commented on GitHub (Jan 23, 2026):

Can you create a docker image and push to https://hub.docker.com/ so that I can test it?

<!-- gh-comment-id:3787891921 --> @khteh commented on GitHub (Jan 23, 2026): Can you create a docker image and push to https://hub.docker.com/ so that I can test it?
Author
Owner

@markasoftware-tc commented on GitHub (Jan 23, 2026):

unfortunately no, this is not my responsibility.

<!-- gh-comment-id:3791882024 --> @markasoftware-tc commented on GitHub (Jan 23, 2026): unfortunately no, this is not my responsibility.
Author
Owner

@khteh commented on GitHub (Feb 5, 2026):

Can anybody please review and merge the PR? There is a similar issue with another model qwen3-next:latest:

[ragagent-1 ragagent] ollama._types.ResponseError: model requires more system memory (45.5 GiB) than is available (38.4 GiB) (status code: 500)
[ragagent-1 ragagent] During task with name 'model' and id '01053c47-0aa6-557d-4ee6-518c95fbce97'

$ free -h
               total        used        free      shared  buff/cache   available
Mem:            68Gi        18Gi        11Gi       3.4Gi        42Gi        50Gi
[ollama-0 ollama] llama_model_loader: loaded meta data with 45 key-value pairs and 807 tensors from /models/blobs/sha256-8476acca2ca7dc4dd86ad2e069cb270fdbd44287d9ff3006d86e9a54cc19acd1 (version GGUF V3 (latest))
[ollama-0 ollama] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
[ollama-0 ollama] llama_model_loader: - kv   0:                       general.architecture str              = qwen3next
[ollama-0 ollama] llama_model_loader: - kv   1:                           general.basename str              = Qwen3-Next
[ollama-0 ollama] llama_model_loader: - kv   2:                          general.file_type u32              = 15
[ollama-0 ollama] llama_model_loader: - kv   3:                           general.finetune str              = Thinking
[ollama-0 ollama] llama_model_loader: - kv   4:                            general.license str              = apache-2.0
[ollama-0 ollama] llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3-Nex...
[ollama-0 ollama] llama_model_loader: - kv   6:                               general.name str              = Qwen3 Next 80B A3B Thinking
[ollama-0 ollama] llama_model_loader: - kv   7:                    general.parameter_count u64              = 79674391296
[ollama-0 ollama] llama_model_loader: - kv   8:               general.quantization_version u32              = 2
[ollama-0 ollama] llama_model_loader: - kv   9:                      general.sampling.temp f32              = 0.600000
[ollama-0 ollama] llama_model_loader: - kv  10:                     general.sampling.top_k i32              = 20
[ollama-0 ollama] llama_model_loader: - kv  11:                     general.sampling.top_p f32              = 0.950000
[ollama-0 ollama] llama_model_loader: - kv  12:                         general.size_label str              = 80B-A3B
[ollama-0 ollama] llama_model_loader: - kv  13:                               general.tags arr[str,1]       = ["text-generation"]
[ollama-0 ollama] llama_model_loader: - kv  14:                               general.type str              = model
[ollama-0 ollama] llama_model_loader: - kv  15:             qwen3next.attention.head_count u32              = 16
[ollama-0 ollama] llama_model_loader: - kv  16:          qwen3next.attention.head_count_kv u32              = 2
[ollama-0 ollama] llama_model_loader: - kv  17:             qwen3next.attention.key_length u32              = 256
[ollama-0 ollama] llama_model_loader: - kv  18: qwen3next.attention.layer_norm_rms_epsilon f32              = 0.000001
[ollama-0 ollama] llama_model_loader: - kv  19:           qwen3next.attention.value_length u32              = 256
[ollama-0 ollama] llama_model_loader: - kv  20:                      qwen3next.block_count u32              = 48
[ollama-0 ollama] llama_model_loader: - kv  21:                   qwen3next.context_length u32              = 262144
[ollama-0 ollama] llama_model_loader: - kv  22:                 qwen3next.embedding_length u32              = 2048
[ollama-0 ollama] llama_model_loader: - kv  23:                     qwen3next.expert_count u32              = 512
[ollama-0 ollama] llama_model_loader: - kv  24:       qwen3next.expert_feed_forward_length u32              = 512
[ollama-0 ollama] llama_model_loader: - kv  25: qwen3next.expert_shared_feed_forward_length u32              = 512
[ollama-0 ollama] llama_model_loader: - kv  26:                qwen3next.expert_used_count u32              = 10
[ollama-0 ollama] llama_model_loader: - kv  27:              qwen3next.feed_forward_length u32              = 5120
[ollama-0 ollama] llama_model_loader: - kv  28:             qwen3next.rope.dimension_count u32              = 64
[ollama-0 ollama] llama_model_loader: - kv  29:                   qwen3next.rope.freq_base f32              = 10000000.000000
[ollama-0 ollama] llama_model_loader: - kv  30:                  qwen3next.ssm.conv_kernel u32              = 4
[ollama-0 ollama] llama_model_loader: - kv  31:                  qwen3next.ssm.group_count u32              = 16
[ollama-0 ollama] llama_model_loader: - kv  32:                   qwen3next.ssm.inner_size u32              = 4096
[ollama-0 ollama] llama_model_loader: - kv  33:                   qwen3next.ssm.state_size u32              = 128
[ollama-0 ollama] llama_model_loader: - kv  34:               qwen3next.ssm.time_step_rank u32              = 32
[ollama-0 ollama] llama_model_loader: - kv  35:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
[ollama-0 ollama] llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = false
[ollama-0 ollama] llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 151643
[ollama-0 ollama] llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 151645
[ollama-0 ollama] llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
[ollama-0 ollama] llama_model_loader: - kv  40:                       tokenizer.ggml.model str              = gpt2
[ollama-0 ollama] llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 151643
[ollama-0 ollama] llama_model_loader: - kv  42:                         tokenizer.ggml.pre str              = qwen2
[ollama-0 ollama] llama_model_loader: - kv  43:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
[ollama-0 ollama] llama_model_loader: - kv  44:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
[ollama-0 ollama] llama_model_loader: - type  f32:  313 tensors
[ollama-0 ollama] llama_model_loader: - type q4_K:  415 tensors
[ollama-0 ollama] llama_model_loader: - type q6_K:   79 tensors
[ollama-0 ollama] print_info: file format = GGUF V3 (latest)
[ollama-0 ollama] print_info: file type   = Q4_K - Medium
[ollama-0 ollama] print_info: file size   = 46.89 GiB (5.06 BPW) 
[ollama-0 ollama] init_tokenizer: initializing tokenizer for type 2
[ollama-0 ollama] load: control token: 151660 '<|fim_middle|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151653 '<|vision_end|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151648 '<|box_start|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151649 '<|box_end|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151655 '<|image_pad|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151651 '<|quad_end|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151652 '<|vision_start|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151654 '<|vision_pad|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151656 '<|video_pad|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151644 '<|im_start|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
[ollama-0 ollama] load: control token: 151650 '<|quad_start|>' is not marked as EOG
[ollama-0 ollama] load: printing all EOG tokens:
[ollama-0 ollama] load:   - 151643 ('<|endoftext|>')
[ollama-0 ollama] load:   - 151645 ('<|im_end|>')
[ollama-0 ollama] load:   - 151662 ('<|fim_pad|>')
[ollama-0 ollama] load:   - 151663 ('<|repo_name|>')
[ollama-0 ollama] load:   - 151664 ('<|file_sep|>')
[ollama-0 ollama] load: special tokens cache size = 26
[ollama-0 ollama] load: token to piece cache size = 0.9311 MB
[ollama-0 ollama] print_info: arch             = qwen3next
[ollama-0 ollama] print_info: vocab_only       = 1
[ollama-0 ollama] print_info: no_alloc         = 0
[ollama-0 ollama] print_info: ssm_d_conv       = 0
[ollama-0 ollama] print_info: ssm_d_inner      = 0
[ollama-0 ollama] print_info: ssm_d_state      = 0
[ollama-0 ollama] print_info: ssm_dt_rank      = 0
[ollama-0 ollama] print_info: ssm_n_group      = 0
[ollama-0 ollama] print_info: ssm_dt_b_c_rms   = 0
[ollama-0 ollama] print_info: model type       = ?B
[ollama-0 ollama] print_info: model params     = 79.67 B
[ollama-0 ollama] print_info: general.name     = Qwen3 Next 80B A3B Thinking
[ollama-0 ollama] print_info: vocab type       = BPE
[ollama-0 ollama] print_info: n_vocab          = 151936
[ollama-0 ollama] print_info: n_merges         = 151387
[ollama-0 ollama] print_info: BOS token        = 151643 '<|endoftext|>'
[ollama-0 ollama] print_info: EOS token        = 151645 '<|im_end|>'
[ollama-0 ollama] print_info: EOT token        = 151645 '<|im_end|>'
[ollama-0 ollama] print_info: PAD token        = 151643 '<|endoftext|>'
[ollama-0 ollama] print_info: LF token         = 198 'Ċ'
[ollama-0 ollama] print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
[ollama-0 ollama] print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
[ollama-0 ollama] print_info: FIM MID token    = 151660 '<|fim_middle|>'
[ollama-0 ollama] print_info: FIM PAD token    = 151662 '<|fim_pad|>'
[ollama-0 ollama] print_info: FIM REP token    = 151663 '<|repo_name|>'
[ollama-0 ollama] print_info: FIM SEP token    = 151664 '<|file_sep|>'
[ollama-0 ollama] print_info: EOG token        = 151643 '<|endoftext|>'
[ollama-0 ollama] print_info: EOG token        = 151645 '<|im_end|>'
[ollama-0 ollama] print_info: EOG token        = 151662 '<|fim_pad|>'
[ollama-0 ollama] print_info: EOG token        = 151663 '<|repo_name|>'
[ollama-0 ollama] print_info: EOG token        = 151664 '<|file_sep|>'
[ollama-0 ollama] print_info: max token length = 256
[ollama-0 ollama] llama_model_load: vocab only - skipping tensors
[ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --model /models/blobs/sha256-8476acca2ca7dc4dd86ad2e069cb270fdbd44287d9ff3006d86e9a54cc19acd1 --port 38045"
[ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=DEBUG source=server.go:430 msg=subprocess OLLAMA_MODELS=/models OLLAMA_SCHED_SPREAD=true OLLAMA_HOST=http://0.0.0.0:11434 OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_DEBUG=true LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_FLASH_ATTENTION=true OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13
[ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=sched.go:443 msg="system memory" total="68.1 GiB" free="38.5 GiB" free_swap="2.0 GiB"
[ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA available="3.2 GiB" free="3.7 GiB" minimum="457.0 MiB" overhead="0 B"
[ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=server.go:496 msg="loading model" "model layers"=49 requested=-1
[ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=DEBUG source=ggml.go:617 msg="default cache size estimate" "attention MiB"=768 "attention bytes"=805306368 "recurrent MiB"=0 "recurrent bytes"=0
[ollama-0 ollama] time=2026-02-05T05:28:04.824Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA "available layer vram"="2.2 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B"
[ollama-0 ollama] time=2026-02-05T05:28:04.824Z level=WARN source=server.go:1033 msg="model request too large for system" requested="45.5 GiB" available="40.5 GiB" total="68.1 GiB" free="38.5 GiB" swap="2.0 GiB"
[ollama-0 ollama] time=2026-02-05T05:28:04.824Z level=INFO source=sched.go:470 msg="Load failed" model=/models/blobs/sha256-8476acca2ca7dc4dd86ad2e069cb270fdbd44287d9ff3006d86e9a54cc19acd1 error="model requires more system memory (45.5 GiB) than is available (40.5 GiB)"
<!-- gh-comment-id:3851172311 --> @khteh commented on GitHub (Feb 5, 2026): Can anybody please review and merge the PR? There is a similar issue with another model `qwen3-next:latest`: ``` [ragagent-1 ragagent] ollama._types.ResponseError: model requires more system memory (45.5 GiB) than is available (38.4 GiB) (status code: 500) [ragagent-1 ragagent] During task with name 'model' and id '01053c47-0aa6-557d-4ee6-518c95fbce97' $ free -h total used free shared buff/cache available Mem: 68Gi 18Gi 11Gi 3.4Gi 42Gi 50Gi ``` ``` [ollama-0 ollama] llama_model_loader: loaded meta data with 45 key-value pairs and 807 tensors from /models/blobs/sha256-8476acca2ca7dc4dd86ad2e069cb270fdbd44287d9ff3006d86e9a54cc19acd1 (version GGUF V3 (latest)) [ollama-0 ollama] llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. [ollama-0 ollama] llama_model_loader: - kv 0: general.architecture str = qwen3next [ollama-0 ollama] llama_model_loader: - kv 1: general.basename str = Qwen3-Next [ollama-0 ollama] llama_model_loader: - kv 2: general.file_type u32 = 15 [ollama-0 ollama] llama_model_loader: - kv 3: general.finetune str = Thinking [ollama-0 ollama] llama_model_loader: - kv 4: general.license str = apache-2.0 [ollama-0 ollama] llama_model_loader: - kv 5: general.license.link str = https://huggingface.co/Qwen/Qwen3-Nex... [ollama-0 ollama] llama_model_loader: - kv 6: general.name str = Qwen3 Next 80B A3B Thinking [ollama-0 ollama] llama_model_loader: - kv 7: general.parameter_count u64 = 79674391296 [ollama-0 ollama] llama_model_loader: - kv 8: general.quantization_version u32 = 2 [ollama-0 ollama] llama_model_loader: - kv 9: general.sampling.temp f32 = 0.600000 [ollama-0 ollama] llama_model_loader: - kv 10: general.sampling.top_k i32 = 20 [ollama-0 ollama] llama_model_loader: - kv 11: general.sampling.top_p f32 = 0.950000 [ollama-0 ollama] llama_model_loader: - kv 12: general.size_label str = 80B-A3B [ollama-0 ollama] llama_model_loader: - kv 13: general.tags arr[str,1] = ["text-generation"] [ollama-0 ollama] llama_model_loader: - kv 14: general.type str = model [ollama-0 ollama] llama_model_loader: - kv 15: qwen3next.attention.head_count u32 = 16 [ollama-0 ollama] llama_model_loader: - kv 16: qwen3next.attention.head_count_kv u32 = 2 [ollama-0 ollama] llama_model_loader: - kv 17: qwen3next.attention.key_length u32 = 256 [ollama-0 ollama] llama_model_loader: - kv 18: qwen3next.attention.layer_norm_rms_epsilon f32 = 0.000001 [ollama-0 ollama] llama_model_loader: - kv 19: qwen3next.attention.value_length u32 = 256 [ollama-0 ollama] llama_model_loader: - kv 20: qwen3next.block_count u32 = 48 [ollama-0 ollama] llama_model_loader: - kv 21: qwen3next.context_length u32 = 262144 [ollama-0 ollama] llama_model_loader: - kv 22: qwen3next.embedding_length u32 = 2048 [ollama-0 ollama] llama_model_loader: - kv 23: qwen3next.expert_count u32 = 512 [ollama-0 ollama] llama_model_loader: - kv 24: qwen3next.expert_feed_forward_length u32 = 512 [ollama-0 ollama] llama_model_loader: - kv 25: qwen3next.expert_shared_feed_forward_length u32 = 512 [ollama-0 ollama] llama_model_loader: - kv 26: qwen3next.expert_used_count u32 = 10 [ollama-0 ollama] llama_model_loader: - kv 27: qwen3next.feed_forward_length u32 = 5120 [ollama-0 ollama] llama_model_loader: - kv 28: qwen3next.rope.dimension_count u32 = 64 [ollama-0 ollama] llama_model_loader: - kv 29: qwen3next.rope.freq_base f32 = 10000000.000000 [ollama-0 ollama] llama_model_loader: - kv 30: qwen3next.ssm.conv_kernel u32 = 4 [ollama-0 ollama] llama_model_loader: - kv 31: qwen3next.ssm.group_count u32 = 16 [ollama-0 ollama] llama_model_loader: - kv 32: qwen3next.ssm.inner_size u32 = 4096 [ollama-0 ollama] llama_model_loader: - kv 33: qwen3next.ssm.state_size u32 = 128 [ollama-0 ollama] llama_model_loader: - kv 34: qwen3next.ssm.time_step_rank u32 = 32 [ollama-0 ollama] llama_model_loader: - kv 35: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>... [ollama-0 ollama] llama_model_loader: - kv 36: tokenizer.ggml.add_bos_token bool = false [ollama-0 ollama] llama_model_loader: - kv 37: tokenizer.ggml.bos_token_id u32 = 151643 [ollama-0 ollama] llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 151645 [ollama-0 ollama] llama_model_loader: - kv 39: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... [ollama-0 ollama] llama_model_loader: - kv 40: tokenizer.ggml.model str = gpt2 [ollama-0 ollama] llama_model_loader: - kv 41: tokenizer.ggml.padding_token_id u32 = 151643 [ollama-0 ollama] llama_model_loader: - kv 42: tokenizer.ggml.pre str = qwen2 [ollama-0 ollama] llama_model_loader: - kv 43: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... [ollama-0 ollama] llama_model_loader: - kv 44: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ... [ollama-0 ollama] llama_model_loader: - type f32: 313 tensors [ollama-0 ollama] llama_model_loader: - type q4_K: 415 tensors [ollama-0 ollama] llama_model_loader: - type q6_K: 79 tensors [ollama-0 ollama] print_info: file format = GGUF V3 (latest) [ollama-0 ollama] print_info: file type = Q4_K - Medium [ollama-0 ollama] print_info: file size = 46.89 GiB (5.06 BPW) [ollama-0 ollama] init_tokenizer: initializing tokenizer for type 2 [ollama-0 ollama] load: control token: 151660 '<|fim_middle|>' is not marked as EOG [ollama-0 ollama] load: control token: 151659 '<|fim_prefix|>' is not marked as EOG [ollama-0 ollama] load: control token: 151653 '<|vision_end|>' is not marked as EOG [ollama-0 ollama] load: control token: 151648 '<|box_start|>' is not marked as EOG [ollama-0 ollama] load: control token: 151646 '<|object_ref_start|>' is not marked as EOG [ollama-0 ollama] load: control token: 151649 '<|box_end|>' is not marked as EOG [ollama-0 ollama] load: control token: 151655 '<|image_pad|>' is not marked as EOG [ollama-0 ollama] load: control token: 151651 '<|quad_end|>' is not marked as EOG [ollama-0 ollama] load: control token: 151647 '<|object_ref_end|>' is not marked as EOG [ollama-0 ollama] load: control token: 151652 '<|vision_start|>' is not marked as EOG [ollama-0 ollama] load: control token: 151654 '<|vision_pad|>' is not marked as EOG [ollama-0 ollama] load: control token: 151656 '<|video_pad|>' is not marked as EOG [ollama-0 ollama] load: control token: 151644 '<|im_start|>' is not marked as EOG [ollama-0 ollama] load: control token: 151661 '<|fim_suffix|>' is not marked as EOG [ollama-0 ollama] load: control token: 151650 '<|quad_start|>' is not marked as EOG [ollama-0 ollama] load: printing all EOG tokens: [ollama-0 ollama] load: - 151643 ('<|endoftext|>') [ollama-0 ollama] load: - 151645 ('<|im_end|>') [ollama-0 ollama] load: - 151662 ('<|fim_pad|>') [ollama-0 ollama] load: - 151663 ('<|repo_name|>') [ollama-0 ollama] load: - 151664 ('<|file_sep|>') [ollama-0 ollama] load: special tokens cache size = 26 [ollama-0 ollama] load: token to piece cache size = 0.9311 MB [ollama-0 ollama] print_info: arch = qwen3next [ollama-0 ollama] print_info: vocab_only = 1 [ollama-0 ollama] print_info: no_alloc = 0 [ollama-0 ollama] print_info: ssm_d_conv = 0 [ollama-0 ollama] print_info: ssm_d_inner = 0 [ollama-0 ollama] print_info: ssm_d_state = 0 [ollama-0 ollama] print_info: ssm_dt_rank = 0 [ollama-0 ollama] print_info: ssm_n_group = 0 [ollama-0 ollama] print_info: ssm_dt_b_c_rms = 0 [ollama-0 ollama] print_info: model type = ?B [ollama-0 ollama] print_info: model params = 79.67 B [ollama-0 ollama] print_info: general.name = Qwen3 Next 80B A3B Thinking [ollama-0 ollama] print_info: vocab type = BPE [ollama-0 ollama] print_info: n_vocab = 151936 [ollama-0 ollama] print_info: n_merges = 151387 [ollama-0 ollama] print_info: BOS token = 151643 '<|endoftext|>' [ollama-0 ollama] print_info: EOS token = 151645 '<|im_end|>' [ollama-0 ollama] print_info: EOT token = 151645 '<|im_end|>' [ollama-0 ollama] print_info: PAD token = 151643 '<|endoftext|>' [ollama-0 ollama] print_info: LF token = 198 'Ċ' [ollama-0 ollama] print_info: FIM PRE token = 151659 '<|fim_prefix|>' [ollama-0 ollama] print_info: FIM SUF token = 151661 '<|fim_suffix|>' [ollama-0 ollama] print_info: FIM MID token = 151660 '<|fim_middle|>' [ollama-0 ollama] print_info: FIM PAD token = 151662 '<|fim_pad|>' [ollama-0 ollama] print_info: FIM REP token = 151663 '<|repo_name|>' [ollama-0 ollama] print_info: FIM SEP token = 151664 '<|file_sep|>' [ollama-0 ollama] print_info: EOG token = 151643 '<|endoftext|>' [ollama-0 ollama] print_info: EOG token = 151645 '<|im_end|>' [ollama-0 ollama] print_info: EOG token = 151662 '<|fim_pad|>' [ollama-0 ollama] print_info: EOG token = 151663 '<|repo_name|>' [ollama-0 ollama] print_info: EOG token = 151664 '<|file_sep|>' [ollama-0 ollama] print_info: max token length = 256 [ollama-0 ollama] llama_model_load: vocab only - skipping tensors [ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=server.go:429 msg="starting runner" cmd="/usr/bin/ollama runner --model /models/blobs/sha256-8476acca2ca7dc4dd86ad2e069cb270fdbd44287d9ff3006d86e9a54cc19acd1 --port 38045" [ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=DEBUG source=server.go:430 msg=subprocess OLLAMA_MODELS=/models OLLAMA_SCHED_SPREAD=true OLLAMA_HOST=http://0.0.0.0:11434 OLLAMA_CONTEXT_LENGTH=8192 OLLAMA_DEBUG=true LD_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13:/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin OLLAMA_FLASH_ATTENTION=true OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v13 [ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=sched.go:443 msg="system memory" total="68.1 GiB" free="38.5 GiB" free_swap="2.0 GiB" [ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=sched.go:450 msg="gpu memory" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA available="3.2 GiB" free="3.7 GiB" minimum="457.0 MiB" overhead="0 B" [ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=INFO source=server.go:496 msg="loading model" "model layers"=49 requested=-1 [ollama-0 ollama] time=2026-02-05T05:28:04.823Z level=DEBUG source=ggml.go:617 msg="default cache size estimate" "attention MiB"=768 "attention bytes"=805306368 "recurrent MiB"=0 "recurrent bytes"=0 [ollama-0 ollama] time=2026-02-05T05:28:04.824Z level=DEBUG source=server.go:965 msg="available gpu" id=GPU-9762feba-cea4-7981-7353-533400b79c72 library=CUDA "available layer vram"="2.2 GiB" backoff=0.00 minimum="457.0 MiB" overhead="0 B" graph="0 B" [ollama-0 ollama] time=2026-02-05T05:28:04.824Z level=WARN source=server.go:1033 msg="model request too large for system" requested="45.5 GiB" available="40.5 GiB" total="68.1 GiB" free="38.5 GiB" swap="2.0 GiB" [ollama-0 ollama] time=2026-02-05T05:28:04.824Z level=INFO source=sched.go:470 msg="Load failed" model=/models/blobs/sha256-8476acca2ca7dc4dd86ad2e069cb270fdbd44287d9ff3006d86e9a54cc19acd1 error="model requires more system memory (45.5 GiB) than is available (40.5 GiB)" ```
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70986