Fetch Failed Error on using OLLAMA locally with nomic-embed-text and llama3.1:8b #4327

New Issue

GiteaMirror · 2025-11-12T12:15:38-06:00

GiteaMirror commented

2025-11-12 12:15:38 -06:00

Originally created by @saisandeepbalbari on GitHub (Sep 18, 2024).

What is the issue?

I'm using OLLAMA with Anything LLM. It's taking a lot of time to respond to the prompts

The following error I'm getting from the docker logs of anything llm

[OllamaEmbedder] Embedding 1 chunks of text with nomic-embed-text:latest.
TypeError: fetch failed
at node:internal/deps/undici/undici:12618:11
at async createOllamaStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:12:22)
at async createOllamaChatStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:61:5)
at async ChatOllama._streamResponseChunks (/app/server/node_modules/@langchain/community/dist/chat_models/ollama.cjs:399:30)
at async ChatOllama._streamIterator (/app/server/node_modules/@langchain/core/dist/language_models/chat_models.cjs:82:34)
at async ChatOllama.transform (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:382:9)
at async wrapInputForTracing (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:258:30)
at async pipeGeneratorWithSetup (/app/server/node_modules/@langchain/core/dist/utils/stream.cjs:230:19)
at async StringOutputParser._transformStreamWithConfig (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:279:26)
at async StringOutputParser.transform (/app/server/node_modules/@langchain/core/dist/output_parsers/transform.cjs:36:9) {
cause: HeadersTimeoutError: Headers Timeout Error
at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:9117:32)
at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:7148:17)
at listOnTimeout (node:internal/timers:569:17)
at process.processTimers (node:internal/timers:512:7) {
code: 'UND_ERR_HEADERS_TIMEOUT'
}
}

OS

Linux

GPU

No response

CPU

Intel

Ollama version

No response

Originally created by @saisandeepbalbari on GitHub (Sep 18, 2024). ### What is the issue? I'm using OLLAMA with Anything LLM. It's taking a lot of time to respond to the prompts The following error I'm getting from the docker logs of anything llm [OllamaEmbedder] Embedding 1 chunks of text with nomic-embed-text:latest. TypeError: fetch failed at node:internal/deps/undici/undici:12618:11 at async createOllamaStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:12:22) at async createOllamaChatStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:61:5) at async ChatOllama._streamResponseChunks (/app/server/node_modules/@langchain/community/dist/chat_models/ollama.cjs:399:30) at async ChatOllama._streamIterator (/app/server/node_modules/@langchain/core/dist/language_models/chat_models.cjs:82:34) at async ChatOllama.transform (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:382:9) at async wrapInputForTracing (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:258:30) at async pipeGeneratorWithSetup (/app/server/node_modules/@langchain/core/dist/utils/stream.cjs:230:19) at async StringOutputParser._transformStreamWithConfig (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:279:26) at async StringOutputParser.transform (/app/server/node_modules/@langchain/core/dist/output_parsers/transform.cjs:36:9) { cause: HeadersTimeoutError: Headers Timeout Error at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:9117:32) at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:7148:17) at listOnTimeout (node:internal/timers:569:17) at process.processTimers (node:internal/timers:512:7) { code: 'UND_ERR_HEADERS_TIMEOUT' } } ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version _No response_

GiteaMirror added the

bug

label 2025-11-12 12:15:38 -06:00

GiteaMirror closed this issue

2025-11-12 12:15:39 -06:00

GiteaMirror commented

2025-11-12 12:15:40 -06:00

@rick-github commented on GitHub (Sep 18, 2024):

What are the logs from the ollama container? What's the Anything LLM config you are setting to talk to ollama?

@rick-github commented on GitHub (Sep 18, 2024): What are the logs from the ollama container? What's the Anything LLM config you are setting to talk to ollama?

GiteaMirror commented

2025-11-12 12:15:40 -06:00

@saisandeepbalbari commented on GitHub (Sep 19, 2024):

The following are the logs

Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.799Z level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model"
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: special tokens cache size = 256
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: format = GGUF V3 (latest)
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: arch = llama
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab type = BPE
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_vocab = 128256
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_merges = 280147
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab_only = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_train = 131072
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd = 4096
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_layer = 32
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head = 32
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head_kv = 8
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_rot = 128
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_swa = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_k = 128
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_v = 128
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_gqa = 4
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_k_gqa = 1024
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_v_gqa = 1024
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_eps = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_logit_scale = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ff = 14336
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert_used = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: causal attn = 1
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: pooling type = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope type = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope scaling = linear
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_base_train = 500000.0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_scale_train = 1
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope_finetuned = unknown
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_conv = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_inner = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_state = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_rank = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model type = 8B
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model ftype = Q4_0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model params = 8.03 B
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: LF token = 128 'Ä'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: max token length = 256
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: ggml ctx size = 0.14 MiB
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: CPU buffer size = 4437.81 MiB
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ctx = 8192
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_batch = 512
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ubatch = 512
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: flash_attn = 0
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_base = 500000.0
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_scale = 1
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU output buffer size = 2.02 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU compute buffer size = 560.01 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph nodes = 1030
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph splits = 1
Sep 19 05:04:14 ip-15-0-125-54 ollama[186217]: INFO [main] model loaded | tid="129247310618752" timestamp=1726722254
Sep 19 05:04:14 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:14.422Z level=INFO source=server.go:629 msg="llama runner started in 4.87 seconds"
Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat"

@saisandeepbalbari commented on GitHub (Sep 19, 2024): The following are the logs Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.799Z level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model" Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: special tokens cache size = 256 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: format = GGUF V3 (latest) Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: arch = llama Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab type = BPE Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_vocab = 128256 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_merges = 280147 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab_only = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_train = 131072 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd = 4096 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_layer = 32 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head = 32 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head_kv = 8 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_rot = 128 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_swa = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_k = 128 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_v = 128 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_gqa = 4 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ff = 14336 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert_used = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: causal attn = 1 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: pooling type = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope type = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope scaling = linear Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_base_train = 500000.0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_scale_train = 1 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope_finetuned = unknown Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_conv = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_inner = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_state = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_rank = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model type = 8B Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model ftype = Q4_0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model params = 8.03 B Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: LF token = 128 'Ä' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: max token length = 256 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: ggml ctx size = 0.14 MiB Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root). Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: CPU buffer size = 4437.81 MiB Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ctx = 8192 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_batch = 512 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ubatch = 512 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: flash_attn = 0 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_base = 500000.0 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_scale = 1 Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU output buffer size = 2.02 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU compute buffer size = 560.01 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph nodes = 1030 Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph splits = 1 Sep 19 05:04:14 ip-15-0-125-54 ollama[186217]: INFO [main] model loaded | tid="129247310618752" timestamp=1726722254 Sep 19 05:04:14 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:14.422Z level=INFO source=server.go:629 msg="llama runner started in 4.87 seconds" Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat"

GiteaMirror commented

2025-11-12 12:15:40 -06:00

@rick-github commented on GitHub (Sep 19, 2024):

Need the full log.

@rick-github commented on GitHub (Sep 19, 2024): Need the full log.

GiteaMirror commented

2025-11-12 12:15:40 -06:00

@saisandeepbalbari commented on GitHub (Sep 19, 2024):

Here I'm sharing the complete log file

pfa

ollama_logs_from_server_start.txt

@saisandeepbalbari commented on GitHub (Sep 19, 2024): Here I'm sharing the complete log file pfa [ollama_logs_from_server_start.txt](https://github.com/user-attachments/files/17069943/ollama_logs_from_server_start.txt)

GiteaMirror commented

2025-11-12 12:15:40 -06:00

@rick-github commented on GitHub (Sep 21, 2024):

I see two issues here.

mlock

You have no GPU, 16GB of RAM most of which is free, and no swap.

Sep 16 17:30:27 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:30:27.551Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered"
Sep 16 17:37:20 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:37:20.132Z level=INFO source=server.go:101 msg="system memory" total="15.4 GiB" free="14.4 GiB" free_swap="0 B"

The embedding model loads fine.

Sep 16 17:37:21 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:37:21 | 200 |  1.574410247s |      172.17.0.3 | POST     "/api/embeddings"

The llama3.1:8b model wants to load in to memory and needs 5.8G

Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=memory.go:326 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[13.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"

The model is loaded with --mlock. This is probably because the client (Anything LLM) is sending "use_mlock":true in the API call

Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 38237"

On linux systems, a common limit to how much memory a process can lock is ~4G

$ echo $(dc <<< "$(ulimit -l) 1k 1024/1024/p")G
3.8G

ollama tries to lock 4.3G and fails:

ollama-1  | warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory

This is not a fatal error, and can be remedied by setting the value of RLIMIT_MEMLOCK to something larger. It looks like you are using docker, so you can do that with:

services:
  ollama:
    ulimits:
      memlock:
        soft: 8192000000
        hard: 8192000000

timeouts

Almost all of the calls to /api/chat are taking more than 5 minutes.

Sep 16 17:42:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:42:36 | 200 |          4m2s |      172.17.0.3 | POST     "/api/chat"
Sep 16 17:50:05 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:50:05 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 16 18:01:17 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:01:17 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 16 18:10:40 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:10:40 | 200 |         1m39s |      172.17.0.2 | POST     "/api/chat"
Sep 16 18:18:12 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:18:12 | 200 |         6m14s |      172.17.0.2 | POST     "/api/chat"
Sep 17 05:58:56 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/17 - 05:58:56 | 200 |         6m18s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:06:31 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:06:31 | 200 |         5m30s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:38:35 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:38:35 | 200 |         5m37s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:39:02 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:39:02 | 200 | 16.797917219s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:51:45 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:51:45 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:52:50 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:52:50 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:59:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:59:36 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:12:24 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:12:24 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:14:37 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:37 | 200 |        12m12s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:14:43 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:43 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:32:39 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:32:39 | 200 |         6m52s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:37:49 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:37:49 | 200 |          5m1s |      172.17.0.2 | POST     "/api/chat"
Sep 18 06:40:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:40:07 | 200 |         5m15s |      172.17.0.2 | POST     "/api/chat"
Sep 18 06:45:38 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:45:38 | 200 |          5m0s |      172.17.0.2 | POST     "/api/chat"
Sep 18 07:55:16 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:55:16 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 18 07:56:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:56:07 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 19 05:03:10 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:03:10 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"

Many complete just after 5 minutes, so it's possible there's a timeout on the client side for some of these calls. Since you don't have a GPU, inference will take longer than a system with one, but > 5m is a poor experience. Since you are using an embedding model. I'm going to guess that you have some sort of RAG system and are sending a bunch of context along with the query. More prompt means more prompt processing, so bigger queries will take longer to return a result. To mitigate this, you can try sending less context with the queries.

Another possibility is that you are exceeding the size of the context window, which is using the default of 2k per model thread:

Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.547Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 37153"

Exceeding the context window causes a model to slow down as it discards a bunch of the tokens in the input stream, which may be contributing to the long response times. If you enable extra debugging with OLLAMA_DEBUG=1 in the server process, we can get a better idea of what's causing the slow responses.

If it is a case of exceeding the context window, you can either adjust the client to request a larger context by setting "options":{"num_ctx":8192} in the API call, or you can create a model with a larger context window and use that instead of llama3.1:8b:

$ cat > Modelfile <<EOF
FROM llama3.1:8b
PARAMETER num_ctx 8192
EOF
$ ollama create llama3.1:8b-8kcontext

Note that increasing the context window will require more RAM, so be aware of the available RAM on your system.

@rick-github commented on GitHub (Sep 21, 2024): I see two issues here. ## mlock You have no GPU, 16GB of RAM most of which is free, and no swap. ``` Sep 16 17:30:27 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:30:27.551Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered" Sep 16 17:37:20 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:37:20.132Z level=INFO source=server.go:101 msg="system memory" total="15.4 GiB" free="14.4 GiB" free_swap="0 B" ``` The embedding model loads fine. ``` Sep 16 17:37:21 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:37:21 | 200 | 1.574410247s | 172.17.0.3 | POST "/api/embeddings" ``` The llama3.1:8b model wants to load in to memory and needs 5.8G ``` Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=memory.go:326 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[13.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" ``` The model is loaded with `--mlock`. This is probably because the client (Anything LLM) is sending `"use_mlock":true` in the API call ``` Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 38237" ``` On linux systems, a common limit to how much memory a process can lock is ~4G ``` $ echo $(dc <<< "$(ulimit -l) 1k 1024/1024/p")G 3.8G ``` ollama tries to lock 4.3G and fails: ``` ollama-1 | warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory ``` This is not a fatal error, and can be remedied by setting the value of `RLIMIT_MEMLOCK` to something larger. It looks like you are using docker, so you can do that with: ```yaml services: ollama: ulimits: memlock: soft: 8192000000 hard: 8192000000 ``` ## timeouts Almost all of the calls to `/api/chat` are taking more than 5 minutes. ``` Sep 16 17:42:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:42:36 | 200 | 4m2s | 172.17.0.3 | POST "/api/chat" Sep 16 17:50:05 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:50:05 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 16 18:01:17 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:01:17 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 16 18:10:40 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:10:40 | 200 | 1m39s | 172.17.0.2 | POST "/api/chat" Sep 16 18:18:12 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:18:12 | 200 | 6m14s | 172.17.0.2 | POST "/api/chat" Sep 17 05:58:56 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/17 - 05:58:56 | 200 | 6m18s | 172.17.0.3 | POST "/api/chat" Sep 18 05:06:31 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:06:31 | 200 | 5m30s | 172.17.0.3 | POST "/api/chat" Sep 18 05:38:35 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:38:35 | 200 | 5m37s | 172.17.0.3 | POST "/api/chat" Sep 18 05:39:02 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:39:02 | 200 | 16.797917219s | 172.17.0.3 | POST "/api/chat" Sep 18 05:51:45 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:51:45 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 05:52:50 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:52:50 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 05:59:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:59:36 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 06:12:24 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:12:24 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 18 06:14:37 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:37 | 200 | 12m12s | 172.17.0.3 | POST "/api/chat" Sep 18 06:14:43 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:43 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 06:32:39 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:32:39 | 200 | 6m52s | 172.17.0.3 | POST "/api/chat" Sep 18 06:37:49 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:37:49 | 200 | 5m1s | 172.17.0.2 | POST "/api/chat" Sep 18 06:40:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:40:07 | 200 | 5m15s | 172.17.0.2 | POST "/api/chat" Sep 18 06:45:38 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:45:38 | 200 | 5m0s | 172.17.0.2 | POST "/api/chat" Sep 18 07:55:16 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:55:16 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 18 07:56:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:56:07 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 19 05:03:10 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:03:10 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" ``` Many complete just after 5 minutes, so it's possible there's a timeout on the client side for some of these calls. Since you don't have a GPU, inference will take longer than a system with one, but > 5m is a poor experience. Since you are using an embedding model. I'm going to guess that you have some sort of RAG system and are sending a bunch of context along with the query. More prompt means more prompt processing, so bigger queries will take longer to return a result. To mitigate this, you can try sending less context with the queries. Another possibility is that you are exceeding the size of the context window, which is using the default of 2k per model thread: ``` Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.547Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 37153" ``` Exceeding the context window causes a model to slow down as it discards a bunch of the tokens in the input stream, which may be contributing to the long response times. If you enable extra debugging with `OLLAMA_DEBUG=1` in the server process, we can get a better idea of what's causing the slow responses. If it is a case of exceeding the context window, you can either adjust the client to request a larger context by setting `"options":{"num_ctx":8192}` in the API call, or you can create a model with a larger context window and use that instead of llama3.1:8b: ``` $ cat > Modelfile <<EOF FROM llama3.1:8b PARAMETER num_ctx 8192 EOF $ ollama create llama3.1:8b-8kcontext ``` Note that increasing the context window will require more RAM, so be aware of the available RAM on your system.

GiteaMirror commented

2025-11-12 12:15:41 -06:00

@saisandeepbalbari commented on GitHub (Sep 30, 2024):

Thank you @rick-github

I have made the changes suggested by you. But I'm still not able to load the llama3.1:8b model

However, I'm able to load and run a smaller model of OLLAMA qwen2 and getting the response.

But in both the cases I'm getting the cannot allocate memory error. Is it because the system is having only 6.4 GB of Free memory after allocating 8 GB of Swap?

Here are the logs in using both these models

ollama_logs_27_sep_2024_llama3.1.txt
ollama_logs_27_sep_2024_qwen2.txt

@saisandeepbalbari commented on GitHub (Sep 30, 2024): Thank you @rick-github I have made the changes suggested by you. But I'm still not able to load the llama3.1:8b model However, I'm able to load and run a smaller model of OLLAMA qwen2 and getting the response. But in both the cases I'm getting the cannot allocate memory error. Is it because the system is having only 6.4 GB of Free memory after allocating 8 GB of Swap? Here are the logs in using both these models [ollama_logs_27_sep_2024_llama3.1.txt](https://github.com/user-attachments/files/17184124/ollama_logs_27_sep_2024_llama3.1.txt) [ollama_logs_27_sep_2024_qwen2.txt](https://github.com/user-attachments/files/17184125/ollama_logs_27_sep_2024_qwen2.txt)

GiteaMirror commented

2025-11-12 12:15:41 -06:00

@rick-github commented on GitHub (Sep 30, 2024):

time=2024-09-27T09:49:47.649Z level=INFO source=server.go:629 msg="llama runner started in 50.78 seconds"
[GIN] 2024/09/27 - 09:53:58 | 200 |          5m1s |      172.17.0.2 | POST     "/api/chat"

The llama3.1:8b model loads, but you don't get a response because it's taking longer than the 5 minute timeout you have in the client. You need to increase the timeout in the client, send prompts that are easier to respond to, or get a GPU so that the model runs faster. You could also try the new llama3.2 models, Meta are saying they are as good as the llama3.1 models but because they are smaller, they will run faster.

Regarding the memory error, it only indicates the the ollama server wasn't able to pin the model in RAM to prevent it from going to swap. It's just to prevent some slowdown in processing (unless you have a really slow disk that you are swapping to, it won't make much difference to inference speed). But it's better to lock the model in RAM if possible. Did you add the ulimits parameters to the docker file? What's the output of docker container inspect ollama | jq '.[].HostConfig.Ulimits'?

@rick-github commented on GitHub (Sep 30, 2024): ``` time=2024-09-27T09:49:47.649Z level=INFO source=server.go:629 msg="llama runner started in 50.78 seconds" [GIN] 2024/09/27 - 09:53:58 | 200 | 5m1s | 172.17.0.2 | POST "/api/chat" ``` The llama3.1:8b model loads, but you don't get a response because it's taking longer than the 5 minute timeout you have in the client. You need to increase the timeout in the client, send prompts that are easier to respond to, or get a GPU so that the model runs faster. You could also try the new [llama3.2](https://ollama.com/library/llama3.2) models, Meta are saying they are as good as the llama3.1 models but because they are smaller, they will run faster. Regarding the memory error, it only indicates the the ollama server wasn't able to pin the model in RAM to prevent it from going to swap. It's just to prevent some slowdown in processing (unless you have a really slow disk that you are swapping to, it won't make much difference to inference speed). But it's better to lock the model in RAM if possible. Did you add the `ulimits` parameters to the docker file? What's the output of `docker container inspect ollama | jq '.[].HostConfig.Ulimits'`?

GiteaMirror commented

2025-11-12 12:15:41 -06:00

@saisandeepbalbari commented on GitHub (Sep 30, 2024):

Thank you for your quick response.

I will try to increase the timeout and also try using the llama3.2 models and if it doesn't work then use a machine with GPUs

In case I need to use GPU, what is the minimum required configuration I should have?

I'm not running OLLAMA using docker. I'm running it in background in the system.

I have modified the /etc/security/limits.conf file by adding these two lines

soft memlock 8192000000
hard memlock 8192000000

@saisandeepbalbari commented on GitHub (Sep 30, 2024): Thank you for your quick response. I will try to increase the timeout and also try using the llama3.2 models and if it doesn't work then use a machine with GPUs In case I need to use GPU, what is the minimum required configuration I should have? I'm not running OLLAMA using docker. I'm running it in background in the system. I have modified the /etc/security/limits.conf file by adding these two lines * soft memlock 8192000000 * hard memlock 8192000000

GiteaMirror commented

2025-11-12 12:15:41 -06:00

@rick-github commented on GitHub (Sep 30, 2024):

Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b:

Context Size	VRAM
2048	5.1GB
4096	5.4GB
8192	6.2GB
16384	7.7GB
32768	10.8GB
65536	17.3GB
131072	29.9GB

So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM).

The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K:

Context Size	VRAM
2048	2.9GB
4096	3.1GB
8192	3.7GB
16384	5.0GB
32768	7.6GB
65536	12.8GB
131072	23.7GB

@rick-github commented on GitHub (Sep 30, 2024): Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b: | Context Size | VRAM | |--------------|------| | 2048 | 5.1GB | | 4096 | 5.4GB | | 8192 | 6.2GB | |16384 | 7.7GB | |32768 |10.8GB | |65536 | 17.3GB | | 131072 | 29.9GB | So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM). The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K: | Context Size | VRAM | |--------------|------| | 2048 | 2.9GB | | 4096 | 3.1GB | | 8192 | 3.7GB | |16384 | 5.0GB | |32768 |7.6GB | |65536 | 12.8GB | | 131072 | 23.7GB |

GiteaMirror commented

2025-11-12 12:15:41 -06:00

@robotom commented on GitHub (Oct 26, 2024):

Really helpful. Could you tell me how to determine this for any other model? Like a Qwen for example.

Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b:
Context Size VRAM
2048 5.1GB
4096 5.4GB
8192 6.2GB
16384 7.7GB
32768 10.8GB
65536 17.3GB
131072 29.9GB

So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM).

The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K:
Context Size VRAM
2048 2.9GB
4096 3.1GB
8192 3.7GB
16384 5.0GB
32768 7.6GB
65536 12.8GB
131072 23.7GB

@robotom commented on GitHub (Oct 26, 2024): Really helpful. Could you tell me how to determine this for any other model? Like a Qwen for example. > Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b: > Context Size VRAM > 2048 5.1GB > 4096 5.4GB > 8192 6.2GB > 16384 7.7GB > 32768 10.8GB > 65536 17.3GB > 131072 29.9GB > > So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM). > > The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K: > Context Size VRAM > 2048 2.9GB > 4096 3.1GB > 8192 3.7GB > 16384 5.0GB > 32768 7.6GB > 65536 12.8GB > 131072 23.7GB

GiteaMirror commented

2025-11-12 12:15:41 -06:00

@rick-github commented on GitHub (Oct 28, 2024):

This is the script I use to gather data:

bench.sh

#!/bin/bash

num_gpu=(
    25%
    50%
    75%
    999
    -1
    0
)
contexts=(
    2048
    4096
    8192
    16384
    32768
    65536
    131072
)
versions=(
    0.3.14
)
models=(
    qwen2.5:7b-instruct-q4_K_M
    aya-expanse:8b-q4_K_M
    granite3-dense:8b-instruct-q4_K_M
    granite3-moe:3b-instruct-q4_K_M
    llama3.2:3b-instruct-q4_K_M
    nemotron-mini:4b-instruct-q4_K_M
    mistral-nemo:12b-instruct-2407-q4_K_M
    hermes3:8b-llama3.1-q4_K_M
    phi3.5:3.8b-mini-instruct-q4_K_M
    gemma2:9b-instruct-q4_K_M
    mistral:7b-instruct-v0.3-q4_K_M
    gemma2:27b-instruct-q4_K_M
)

#for m in ${models[*]} ; do
#  docker compose exec -it ollama ollama pull $m
#done

for v in ${versions[*]} ; do
  OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_DOCKER_TAG=$v OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama
  sleep 5
  for n in ${num_gpu[*]} ; do
    for m in ${models[*]} ; do
      for c in ${contexts[*]} ; do
        ngl=$n
        [[ $n == *% ]] && {
          curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":-1}}' >/dev/null
          layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=\([^ ]*\) .*/\1/p' | tail -1)
          ngl=$(dc <<< "$layers ${n%\%} * 100/p")
        }
        curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":'$ngl'}}' >/dev/null
        layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=\([^ ]*\) .*/\1/p' | tail -1)
        loaded=$(docker compose logs ollama | sed -ne 's/.*starting llama server.*--n-gpu-layers \([^ ]*\) .*/\1/p' | tail -1)
        mem=$(curl -s localhost:11434/api/ps | jq -r 'def toGB(p):.*pow(10;p)/1024/1024/1024|round/pow(10;p);.models[]|select(.model=="'$m'")|"\(.size|toGB(1))/\(.size_vram|toGB(1))"')
        tps=$(curl -s localhost:11434/api/generate -d '{"model":"'$m'","prompt":"the sky is blue because ","options":{"temperature":0,"seed":42,"num_predict":200,"num_ctx":'$c',"num_gpu":'$ngl'},"stream":false}' | jq 'def prec(p):.*pow(10;p)|round/pow(10;p);(.eval_count/(.eval_duration/1000000000)|prec(2))')
        echo $v $loaded/$layers $m $c $mem $tps
      done
    done
  done
done

This is the key for the tables below:

aya: aya-expanse:8b-q4_K_M
g2:27: gemma2:27b-instruct-q4_K_M
g2:9: gemma2:9b-instruct-q4_K_M
gd: granite3-dense:8b-instruct-q4_K_M
gm: granite3-moe:3b-instruct-q4_K_M
hl: hermes3:8b-llama3.1-q4_K_M
l3.2: llama3.2:3b-instruct-q4_K_M
mn: mistral-nemo:12b-instruct-2407-q4_K_M
mi: mistral:7b-instruct-v0.3-q4_K_M
nm: nemotron-mini:4b-instruct-q4_K_M
phi: phi3.5:3.8b-mini-instruct-q4_K_M
qwen: qwen2.5:7b-instruct-q4_K_M

The numbers in the tables are the total amount of RAM required for the model, the amount of VRAM used, and the tokens per second that the model produced: total/vram tps.

RTX 3080 16GB VRAM, i7-11800H @ 2.30GHz 64GB RAM

Context	aya	g2:27	g2:9	gd	gm	hl	l3.2	mn	mi	nm	phi	qwen
2048	6/6 64.27	-	7.1/7.1 55.88	5.7/5.7 67.07	2.6/2.6 108.79	5.4/5.4 71.27	2.9/2.9 133.56	7.8/7.8 48.79	5/5 78.76	3/3 111.61	3.6/3.6 124.78	5.1/5.1 74.53
4096	6.3/6.3 64.12	19.2/15.4 7.85	7.8/7.8 55.78	6.2/6.2 66.86	2.8/2.8 108.88	5.7/5.7 71.32	3.1/3.1 133.46	8.2/8.2 48.81	5.4/5.4 78.61	3.4/3.4 111.6	4.5/4.5 124.82	5.2/5.2 74.56
8192	6.9/6.9 64.24	20.8/15.5 6.74	9.1/9.1 55.72	7.3/7.3 66.92	3.1/3.1 108.39	6.5/6.5 71.12	3.7/3.7 133.19	9/9 48.8	6.2/6.2 78.69	4.1/4.1 111.43	6.3/6.3 124.56	5.6/5.6 74.55
16384	8.4/8.4 64.14	23.9/15.1 5.25	11.9/11.9 55.65	9.4/9.4 66.9	3.9/3.9 108.81	8/8 71.22	5/5 133.13	10.8/10.8 48.74	7.7/7.7 78.58	5.7/5.7 111.51	9.9/9.9 124.29	6.5/6.5 74.32
32768	11.5/11.5 64.03	30.8/15.2 4.05	18/15.1 18.43	13.6/13.6 66.82	5.4/5.4 108.92	11.1/11.1 71.09	7.6/7.6 133.01	14.4/14.4 48.67	10.8/10.8 78.52	8.7/8.7 111.05	17.1/15.2 51.15	8.3/8.3 74.27
65536	18.7/15.2 18.25	45.1/14.8 3.14	30.2/15.3 10.11	22.2/15.4 15.39	8.5/8.5 107.74	17.5/15.2 22.75	12.8/12.8 131.81	22/15.4 11.73	17.2/15.2 32.77	14.8/14.8 110.48	31.5/15.1 19.88	11.9/11.9 74.02
131072	32.3/15.1 9.3	73.1/14.6 2.2	54.5/14.6 7.68	39.3/15 8.51	14.6/14.6 107.15	30.3/15.5 10.38	23.7/15 27.73	36.9/15.3 6.96	-	27.2/15 23.04	60.3/14.7 14.71	21/15.2 15.24

RTX 4070 12GB VRAM, i7-13700 @ 2.5GHz 96GB RAM

Context	aya	g2:27	g2:9	gd	gm	hl	l3.2	mn	mi	nm	phi	qwen
2048	6/6 71.46	18.8/11.1 5.94	7.1/7.1 64.57	5.7/5.7 76.83	2.6/2.6 139.38	5.4/5.4 80.76	2.9/2.9 149.56	7.8/7.8 54.76	5/5 90.71	3/3 116.19	3.6/3.6 141.79	5.1/5.1 83.17
4096	6.3/6.3 71.2	19.6/11.3 5.6	7.8/7.8 64.52	6.2/6.2 76.68	2.8/2.8 139.58	5.7/5.7 80.61	3.1/3.1 149.31	8.2/8.2 55.16	5.4/5.4 90.29	3.4/3.4 118.16	4.5/4.5 142.5	5.2/5.2 83.18
8192	6.9/6.9 71.32	21.1/11.3 5.29	9.1/9.1 63.77	7.3/7.3 76.59	3.1/3.1 139.28	6.5/6.5 80.63	3.7/3.7 148.87	9/9 55.05	6.2/6.2 90.23	4.1/4.1 118	6.3/6.3 141.73	5.6/5.6 83.05
16384	8.4/8.4 71.23	24.2/11.2 4.7	12.5/11.3 26.15	9.4/9.4 76.42	3.9/3.9 138.4	8/8 80.38	5/5 146.59	10.8/10.8 55.05	7.7/7.7 90.14	5.7/5.7 117.74	9.9/9.9 141.77	6.5/6.5 83.05
32768	12/11.2 33.06	31/11.3 4.06	18.1/11.2 13.85	13.7/11.3 26.24	5.4/5.4 137.48	11.1/11.1 79.58	7.6/7.6 147.34	14.6/11.3 16.77	10.8/10.8 89.96	8.7/8.7 117.53	17.1/11.3 30.14	8.3/8.3 82.91
65536	18.8/11.2 13.56	45.2/10.7 3.5	30.3/11.1 9.71	22.3/11.1 11.58	8.5/8.5 137.03	17.6/11.2 14.65	13/11.2 47.4	22.2/11.4 9.94	17.3/11.2 18.74	14.9/11.3 33.57	31.5/11 15.69	12.8/11.3 29.7
131072	20.9/0 10.19	62.5/0 3.31	54.5/11.3 7.82	24.9/0 10.43	14.7/11.2 74.48	30.4/11.1 9.5	23.7/11.1 19.86	37/11.3 7.31	30.1/11.1 12.15	27.2/11.2 16.97	50.3/0 20.39	21.1/11.4 12.23

T4x4 64GB VRAM, Xeon(R) CPU @ 2.00GHz 60GB RAM

Context	aya	g2:27	g2:9	gd	gm	hl	l3.2	mn	mi	nm	phi	qwen
2048	6/6 24.76	25/25 7.44	7.1/7.1 20.66	5.7/5.7 27.67	2.6/2.6 78.55	5.4/5.4 24.89	2.9/2.9 64.39	7.8/7.8 15.77	5/5 32.93	4.9/4.9 45.53	3.6/3.6 64.53	5.1/5.1 33.64
4096	6.3/6.3 26.08	25.8/25.8 7.52	7.8/7.8 21.24	6.2/6.2 30.39	2.8/2.8 78.36	5.7/5.7 24.46	3.1/3.1 64.06	8.2/8.2 17.25	5.4/5.4 31.66	5.6/5.6 44.78	4.5/4.5 54.33	5.2/5.2 31.4
8192	6.9/6.9 24.7	27.4/27.4 7.62	9.1/9.1 26.05	7.3/7.3 29.52	3.1/3.1 77.01	6.5/6.5 25.56	3.7/3.7 63.36	9/9 16	6.2/6.2 31.74	7.2/7.2 46.55	6.3/6.3 64.05	5.6/5.6 33.01
16384	8.4/8.4 25.43	30.5/30.5 7.56	11.9/11.9 25.5	9.4/9.4 28.1	3.9/3.9 77.26	8/8 26.41	5/5 63.26	10.8/10.8 16.11	7.7/7.7 30.7	5.7/5.7 54.58	9.9/9.9 60.14	6.5/6.5 27.01
32768	11.5/11.5 24.65	40.1/40.1 7.57	24.5/24.5 17.98	13.6/13.6 24.29	5.4/5.4 76.31	11.1/11.1 30.19	14.7/14.7 44.17	23.5/23.5 18.5	10.8/10.8 29.28	8.7/8.7 58.33	25.7/25.7 42.67	8.3/8.3 24.93
65536	36.6/36.6 22.16	61.3/55.4 4.97	41.3/41.3 17.76	44.5/44.5 23.62	8.5/8.5 75.56	33.2/33.2 25.22	25.5/25.5 45.29	37.9/37.9 16.88	32.9/32.9 28.31	29.1/29.1 47.39	47.2/47.2 41.22	11.9/11.9 32.8
131072	66.2/55.3 9.15	104.8/52.9 1.78	74.9/54.1 7.93	24.9/0 5.87	28.9/28.9 41.45	59.7/56.8 14.71	47/47 42.2	66.8/55.7 7.95	59.4/56.8 18.33	54.1/54.1 41.66	90.3/52.6 11.71	50.5/50.5 24.71

A100 40GB VRAM, Xeon(R) CPU @ 2.20GHz 85GB RAM

Context	aya	g2:27	g2:9	gd	gm	hl	l3.2	mn	mi	nm	phi	qwen
2048	6/6 76.68	17.5/17.5 37.34	7.1/7.1 62.31	5.7/5.7 96.46	2.6/2.6 68.29	5.4/5.4 95.57	2.9/2.9 127.99	7.8/7.8 73.49	5/5 121.56	3/3 92.46	3.6/3.6 152.1	5.1/5.1 93.19
4096	6.3/6.3 77.14	18.2/18.2 37.78	7.8/7.8 60.58	6.2/6.2 96.37	2.8/2.8 66.73	5.7/5.7 96.04	3.1/3.1 128.43	8.2/8.2 73.74	5.4/5.4 119.72	3.4/3.4 90.73	4.5/4.5 154.02	5.2/5.2 93.62
8192	6.9/6.9 72.61	19.8/19.8 37.45	9.1/9.1 60.4	7.3/7.3 95.94	3.1/3.1 69.82	6.5/6.5 96.7	3.7/3.7 128.94	9/9 72.87	6.2/6.2 120.57	4.1/4.1 92.38	6.3/6.3 151.74	5.6/5.6 94.54
16384	8.4/8.4 76.12	23.2/23.2 37.71	11.9/11.9 61.32	9.4/9.4 93.68	3.9/3.9 68.43	8/8 94.69	5/5 126.95	10.8/10.8 72.8	7.7/7.7 117.92	5.7/5.7 91.66	9.9/9.9 152.65	6.5/6.5 93.03
32768	11.5/11.5 75.25	30.1/30.1 37.17	17.8/17.8 57.57	13.6/13.6 94.97	5.4/5.4 63.77	11.1/11.1 93.31	7.6/7.6 125.46	14.4/14.4 71.98	10.8/10.8 117.84	8.7/8.7 90.33	17.1/17.1 149.09	8.3/8.3 92.13
65536	17.7/17.7 74.51	-	29.6/29.6 59.08	22.1/22.1 93.36	8.5/8.5 65.75	17.3/17.3 89.6	12.8/12.8 126.88	21.6/21.6 71.24	17/17 114.87	14.8/14.8 89.81	31.4/31.4 144.09	11.9/11.9 90.56
131072	30.1/30.1 73.51	72.6/38.2 2.8	54.3/37.9 8.69	-	14.6/14.6 65	29.7/29.7 88.94	23.2/23.2 118.93	36/36 68.98	29.4/29.4 114.21	27.1/27.1 85.17	60.2/38.2 16.58	19.2/19.2 87.23

@rick-github commented on GitHub (Oct 28, 2024): This is the script I use to gather data: <details> <summary>bench.sh</summary> ```sh #!/bin/bash num_gpu=( 25% 50% 75% 999 -1 0 ) contexts=( 2048 4096 8192 16384 32768 65536 131072 ) versions=( 0.3.14 ) models=( qwen2.5:7b-instruct-q4_K_M aya-expanse:8b-q4_K_M granite3-dense:8b-instruct-q4_K_M granite3-moe:3b-instruct-q4_K_M llama3.2:3b-instruct-q4_K_M nemotron-mini:4b-instruct-q4_K_M mistral-nemo:12b-instruct-2407-q4_K_M hermes3:8b-llama3.1-q4_K_M phi3.5:3.8b-mini-instruct-q4_K_M gemma2:9b-instruct-q4_K_M mistral:7b-instruct-v0.3-q4_K_M gemma2:27b-instruct-q4_K_M ) #for m in ${models[*]} ; do # docker compose exec -it ollama ollama pull $m #done for v in ${versions[*]} ; do OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_DOCKER_TAG=$v OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama sleep 5 for n in ${num_gpu[*]} ; do for m in ${models[*]} ; do for c in ${contexts[*]} ; do ngl=$n [[ $n == *% ]] && { curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":-1}}' >/dev/null layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=$[^ ]*$ .*/\1/p' | tail -1) ngl=$(dc <<< "$layers ${n%\%} * 100/p") } curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":'$ngl'}}' >/dev/null layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=$[^ ]*$ .*/\1/p' | tail -1) loaded=$(docker compose logs ollama | sed -ne 's/.*starting llama server.*--n-gpu-layers $[^ ]*$ .*/\1/p' | tail -1) mem=$(curl -s localhost:11434/api/ps | jq -r 'def toGB(p):.*pow(10;p)/1024/1024/1024|round/pow(10;p);.models[]|select(.model=="'$m'")|"\(.size|toGB(1))/\(.size_vram|toGB(1))"') tps=$(curl -s localhost:11434/api/generate -d '{"model":"'$m'","prompt":"the sky is blue because ","options":{"temperature":0,"seed":42,"num_predict":200,"num_ctx":'$c',"num_gpu":'$ngl'},"stream":false}' | jq 'def prec(p):.*pow(10;p)|round/pow(10;p);(.eval_count/(.eval_duration/1000000000)|prec(2))') echo $v $loaded/$layers $m $c $mem $tps done done done done ``` </details> This is the key for the tables below: ``` aya: aya-expanse:8b-q4_K_M g2:27: gemma2:27b-instruct-q4_K_M g2:9: gemma2:9b-instruct-q4_K_M gd: granite3-dense:8b-instruct-q4_K_M gm: granite3-moe:3b-instruct-q4_K_M hl: hermes3:8b-llama3.1-q4_K_M l3.2: llama3.2:3b-instruct-q4_K_M mn: mistral-nemo:12b-instruct-2407-q4_K_M mi: mistral:7b-instruct-v0.3-q4_K_M nm: nemotron-mini:4b-instruct-q4_K_M phi: phi3.5:3.8b-mini-instruct-q4_K_M qwen: qwen2.5:7b-instruct-q4_K_M ``` The numbers in the tables are the total amount of RAM required for the model, the amount of VRAM used, and the tokens per second that the model produced: `total/vram tps`. # RTX 3080 16GB VRAM, i7-11800H @ 2.30GHz 64GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 64.27 | - | 7.1/7.1 55.88 | 5.7/5.7 67.07 | 2.6/2.6 108.79 | 5.4/5.4 71.27 | 2.9/2.9 133.56 | 7.8/7.8 48.79 | 5/5 78.76 | 3/3 111.61 | 3.6/3.6 124.78 | 5.1/5.1 74.53 | | 4096 | 6.3/6.3 64.12 | 19.2/15.4 7.85 | 7.8/7.8 55.78 | 6.2/6.2 66.86 | 2.8/2.8 108.88 | 5.7/5.7 71.32 | 3.1/3.1 133.46 | 8.2/8.2 48.81 | 5.4/5.4 78.61 | 3.4/3.4 111.6 | 4.5/4.5 124.82 | 5.2/5.2 74.56 | | 8192 | 6.9/6.9 64.24 | 20.8/15.5 6.74 | 9.1/9.1 55.72 | 7.3/7.3 66.92 | 3.1/3.1 108.39 | 6.5/6.5 71.12 | 3.7/3.7 133.19 | 9/9 48.8 | 6.2/6.2 78.69 | 4.1/4.1 111.43 | 6.3/6.3 124.56 | 5.6/5.6 74.55 | | 16384 | 8.4/8.4 64.14 | 23.9/15.1 5.25 | 11.9/11.9 55.65 | 9.4/9.4 66.9 | 3.9/3.9 108.81 | 8/8 71.22 | 5/5 133.13 | 10.8/10.8 48.74 | 7.7/7.7 78.58 | 5.7/5.7 111.51 | 9.9/9.9 124.29 | 6.5/6.5 74.32 | | 32768 | 11.5/11.5 64.03 | 30.8/15.2 4.05 | 18/15.1 18.43 | 13.6/13.6 66.82 | 5.4/5.4 108.92 | 11.1/11.1 71.09 | 7.6/7.6 133.01 | 14.4/14.4 48.67 | 10.8/10.8 78.52 | 8.7/8.7 111.05 | 17.1/15.2 51.15 | 8.3/8.3 74.27 | | 65536 | 18.7/15.2 18.25 | 45.1/14.8 3.14 | 30.2/15.3 10.11 | 22.2/15.4 15.39 | 8.5/8.5 107.74 | 17.5/15.2 22.75 | 12.8/12.8 131.81 | 22/15.4 11.73 | 17.2/15.2 32.77 | 14.8/14.8 110.48 | 31.5/15.1 19.88 | 11.9/11.9 74.02 | | 131072 | 32.3/15.1 9.3 | 73.1/14.6 2.2 | 54.5/14.6 7.68 | 39.3/15 8.51 | 14.6/14.6 107.15 | 30.3/15.5 10.38 | 23.7/15 27.73 | 36.9/15.3 6.96 | - | 27.2/15 23.04 | 60.3/14.7 14.71 | 21/15.2 15.24 | # RTX 4070 12GB VRAM, i7-13700 @ 2.5GHz 96GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 71.46 | 18.8/11.1 5.94 | 7.1/7.1 64.57 | 5.7/5.7 76.83 | 2.6/2.6 139.38 | 5.4/5.4 80.76 | 2.9/2.9 149.56 | 7.8/7.8 54.76 | 5/5 90.71 | 3/3 116.19 | 3.6/3.6 141.79 | 5.1/5.1 83.17 | | 4096 | 6.3/6.3 71.2 | 19.6/11.3 5.6 | 7.8/7.8 64.52 | 6.2/6.2 76.68 | 2.8/2.8 139.58 | 5.7/5.7 80.61 | 3.1/3.1 149.31 | 8.2/8.2 55.16 | 5.4/5.4 90.29 | 3.4/3.4 118.16 | 4.5/4.5 142.5 | 5.2/5.2 83.18 | | 8192 | 6.9/6.9 71.32 | 21.1/11.3 5.29 | 9.1/9.1 63.77 | 7.3/7.3 76.59 | 3.1/3.1 139.28 | 6.5/6.5 80.63 | 3.7/3.7 148.87 | 9/9 55.05 | 6.2/6.2 90.23 | 4.1/4.1 118 | 6.3/6.3 141.73 | 5.6/5.6 83.05 | | 16384 | 8.4/8.4 71.23 | 24.2/11.2 4.7 | 12.5/11.3 26.15 | 9.4/9.4 76.42 | 3.9/3.9 138.4 | 8/8 80.38 | 5/5 146.59 | 10.8/10.8 55.05 | 7.7/7.7 90.14 | 5.7/5.7 117.74 | 9.9/9.9 141.77 | 6.5/6.5 83.05 | | 32768 | 12/11.2 33.06 | 31/11.3 4.06 | 18.1/11.2 13.85 | 13.7/11.3 26.24 | 5.4/5.4 137.48 | 11.1/11.1 79.58 | 7.6/7.6 147.34 | 14.6/11.3 16.77 | 10.8/10.8 89.96 | 8.7/8.7 117.53 | 17.1/11.3 30.14 | 8.3/8.3 82.91 | | 65536 | 18.8/11.2 13.56 | 45.2/10.7 3.5 | 30.3/11.1 9.71 | 22.3/11.1 11.58 | 8.5/8.5 137.03 | 17.6/11.2 14.65 | 13/11.2 47.4 | 22.2/11.4 9.94 | 17.3/11.2 18.74 | 14.9/11.3 33.57 | 31.5/11 15.69 | 12.8/11.3 29.7 | | 131072 | 20.9/0 10.19 | 62.5/0 3.31 | 54.5/11.3 7.82 | 24.9/0 10.43 | 14.7/11.2 74.48 | 30.4/11.1 9.5 | 23.7/11.1 19.86 | 37/11.3 7.31 | 30.1/11.1 12.15 | 27.2/11.2 16.97 | 50.3/0 20.39 | 21.1/11.4 12.23 | # T4x4 64GB VRAM, Xeon(R) CPU @ 2.00GHz 60GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 24.76 | 25/25 7.44 | 7.1/7.1 20.66 | 5.7/5.7 27.67 | 2.6/2.6 78.55 | 5.4/5.4 24.89 | 2.9/2.9 64.39 | 7.8/7.8 15.77 | 5/5 32.93 | 4.9/4.9 45.53 | 3.6/3.6 64.53 | 5.1/5.1 33.64 | | 4096 | 6.3/6.3 26.08 | 25.8/25.8 7.52 | 7.8/7.8 21.24 | 6.2/6.2 30.39 | 2.8/2.8 78.36 | 5.7/5.7 24.46 | 3.1/3.1 64.06 | 8.2/8.2 17.25 | 5.4/5.4 31.66 | 5.6/5.6 44.78 | 4.5/4.5 54.33 | 5.2/5.2 31.4 | | 8192 | 6.9/6.9 24.7 | 27.4/27.4 7.62 | 9.1/9.1 26.05 | 7.3/7.3 29.52 | 3.1/3.1 77.01 | 6.5/6.5 25.56 | 3.7/3.7 63.36 | 9/9 16 | 6.2/6.2 31.74 | 7.2/7.2 46.55 | 6.3/6.3 64.05 | 5.6/5.6 33.01 | | 16384 | 8.4/8.4 25.43 | 30.5/30.5 7.56 | 11.9/11.9 25.5 | 9.4/9.4 28.1 | 3.9/3.9 77.26 | 8/8 26.41 | 5/5 63.26 | 10.8/10.8 16.11 | 7.7/7.7 30.7 | 5.7/5.7 54.58 | 9.9/9.9 60.14 | 6.5/6.5 27.01 | | 32768 | 11.5/11.5 24.65 | 40.1/40.1 7.57 | 24.5/24.5 17.98 | 13.6/13.6 24.29 | 5.4/5.4 76.31 | 11.1/11.1 30.19 | 14.7/14.7 44.17 | 23.5/23.5 18.5 | 10.8/10.8 29.28 | 8.7/8.7 58.33 | 25.7/25.7 42.67 | 8.3/8.3 24.93 | | 65536 | 36.6/36.6 22.16 | 61.3/55.4 4.97 | 41.3/41.3 17.76 | 44.5/44.5 23.62 | 8.5/8.5 75.56 | 33.2/33.2 25.22 | 25.5/25.5 45.29 | 37.9/37.9 16.88 | 32.9/32.9 28.31 | 29.1/29.1 47.39 | 47.2/47.2 41.22 | 11.9/11.9 32.8 | | 131072 | 66.2/55.3 9.15 | 104.8/52.9 1.78 | 74.9/54.1 7.93 | 24.9/0 5.87 | 28.9/28.9 41.45 | 59.7/56.8 14.71 | 47/47 42.2 | 66.8/55.7 7.95 | 59.4/56.8 18.33 | 54.1/54.1 41.66 | 90.3/52.6 11.71 | 50.5/50.5 24.71 | # A100 40GB VRAM, Xeon(R) CPU @ 2.20GHz 85GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 76.68 | 17.5/17.5 37.34 | 7.1/7.1 62.31 | 5.7/5.7 96.46 | 2.6/2.6 68.29 | 5.4/5.4 95.57 | 2.9/2.9 127.99 | 7.8/7.8 73.49 | 5/5 121.56 | 3/3 92.46 | 3.6/3.6 152.1 | 5.1/5.1 93.19 | | 4096 | 6.3/6.3 77.14 | 18.2/18.2 37.78 | 7.8/7.8 60.58 | 6.2/6.2 96.37 | 2.8/2.8 66.73 | 5.7/5.7 96.04 | 3.1/3.1 128.43 | 8.2/8.2 73.74 | 5.4/5.4 119.72 | 3.4/3.4 90.73 | 4.5/4.5 154.02 | 5.2/5.2 93.62 | | 8192 | 6.9/6.9 72.61 | 19.8/19.8 37.45 | 9.1/9.1 60.4 | 7.3/7.3 95.94 | 3.1/3.1 69.82 | 6.5/6.5 96.7 | 3.7/3.7 128.94 | 9/9 72.87 | 6.2/6.2 120.57 | 4.1/4.1 92.38 | 6.3/6.3 151.74 | 5.6/5.6 94.54 | | 16384 | 8.4/8.4 76.12 | 23.2/23.2 37.71 | 11.9/11.9 61.32 | 9.4/9.4 93.68 | 3.9/3.9 68.43 | 8/8 94.69 | 5/5 126.95 | 10.8/10.8 72.8 | 7.7/7.7 117.92 | 5.7/5.7 91.66 | 9.9/9.9 152.65 | 6.5/6.5 93.03 | | 32768 | 11.5/11.5 75.25 | 30.1/30.1 37.17 | 17.8/17.8 57.57 | 13.6/13.6 94.97 | 5.4/5.4 63.77 | 11.1/11.1 93.31 | 7.6/7.6 125.46 | 14.4/14.4 71.98 | 10.8/10.8 117.84 | 8.7/8.7 90.33 | 17.1/17.1 149.09 | 8.3/8.3 92.13 | | 65536 | 17.7/17.7 74.51 | - | 29.6/29.6 59.08 | 22.1/22.1 93.36 | 8.5/8.5 65.75 | 17.3/17.3 89.6 | 12.8/12.8 126.88 | 21.6/21.6 71.24 | 17/17 114.87 | 14.8/14.8 89.81 | 31.4/31.4 144.09 | 11.9/11.9 90.56 | | 131072 | 30.1/30.1 73.51 | 72.6/38.2 2.8 | 54.3/37.9 8.69 | - | 14.6/14.6 65 | 29.7/29.7 88.94 | 23.2/23.2 118.93 | 36/36 68.98 | 29.4/29.4 114.21 | 27.1/27.1 85.17 | 60.2/38.2 16.58 | 19.2/19.2 87.23 |

GiteaMirror commented

2025-11-12 12:15:42 -06:00

@robotom commented on GitHub (Nov 14, 2024):

This is the script I use to gather data:
bench.sh

This is the key for the tables below:

I am looking to run llama3.1:405B on 4 x H100's. I can run ollama show [model] to find out the max context window. I want to pass some really large PDF's to it for analysis. I want it to understand them thoroughly. I just don't know how to determine what amount of text converts to that context window. Because if I try to upload 2.3 million words (~16.05 million characters) of text the 70B will reject it for example, saying the argument list is too long. I'm currently downloading 405B and hoping it works there. Any advice in general? (Or do I have to start training my own models on my own data.) --- my preference is that anyone can just come drop a giant file on the model and it can handle it but perhaps this is unrealistic. Thanks!

@robotom commented on GitHub (Nov 14, 2024): > This is the script I use to gather data: > bench.sh > > This is the key for the tables below: > I am looking to run llama3.1:405B on 4 x H100's. I can run ollama show [model] to find out the max context window. I want to pass some really large PDF's to it for analysis. I want it to understand them thoroughly. I just don't know how to determine what amount of text converts to that context window. Because if I try to upload 2.3 million words (~16.05 million characters) of text the 70B will reject it for example, saying the argument list is too long. I'm currently downloading 405B and hoping it works there. Any advice in general? (Or do I have to start training my own models on my own data.) --- my preference is that anyone can just come drop a giant file on the model and it can handle it but perhaps this is unrealistic. Thanks!

GiteaMirror referenced this issue

2025-11-12 15:23:08 -06:00

[PR #4327] [MERGED] Ollama `ps` command for showing currently loaded models #10198

Sign in to join this conversation.

Branches Tags

main

parth/olmo3-thinking-renderer

drifkin/responses

jmorganca/write-int32

parth/render-template

jmorganca/intellect-3

parth/remove-prealloc-linter

nicole/olmo-model

jmorganca/cmd-eval

grace/deepseek-parser

nicole/nomic-embed-text-fix

jessegross/multi_chunk_reserve

grace/deepseek-renderer-parser

hoyyeva/proxy-ollama-serve

hoyyeva/model-capabilities-after-download

mxyng/ropeopts

mxyng/lint-2

nicole/nomic-embed-text-v2

hoyyeva/add-gemini-3-pro-preview

hoyyeva/load-model-list

hoyyeva/app-ollama-apis

mxyng/expand-path

mxyng/environ-2

hoyyeva/thinking-rendering

nicole/truncation

hoyyeva/deeplink-json-encoding

parth/improve-tool-calling-tests

hoyyeva/conversation

hoyyeva/assistant-edit-response

hoyyeva/thinking

origin/brucemacd/invalid-char-i-err

parth/improve-tool-calling

jmorganca/required-omitempty

grace/qwen3-vl-tests

mxyng/iter-client

parth/docs-readme

nicole/embed-test

pdevine/integration-benchstat

parth/remove-generate-cmd

parth/add-toolcall-id

mxyng/server-tests

jmorganca/glm-4.6

jmorganca/gin-h-compat

drifkin/stable-tool-args

pdevine/qwen3-more-thinking

parth/add-websearch-client

nicole/websearch_local

jmorganca/qwen3-coder-updates

grace/deepseek-v3-migration-tests

mxyng/fix-create

jmorganca/cloud-errors

pdevine/parser-tidy

revert-12233-parth/simplify-entrypoints-runner

parth/enable-so-gpt-oss

brucemacd/qwen3vl

jmorganca/readme-simplify

parth/gpt-oss-structured-outputs

revert-12039-jmorganca/tools-braces

mxyng/embeddings

mxyng/gguf

mxyng/benchmark

mxyng/types-null

parth/move-parsing

mxyng/gemma2

jmorganca/docs

mxyng/16-bit

mxyng/create-stdin

pdevine/authorizedkeys

mxyng/quant

parth/opt-in-error-context-window

brucemacd/cache-models

brucemacd/runner-completion

jmorganca/llama-update-6

brucemacd/benchmark-list

brucemacd/partial-read-caps

parth/deepseek-r1-tools

mxyng/omit-array

parth/tool-prefix-temp

brucemacd/runner-test

jmorganca/qwen25vl

brucemacd/model-forward-test-ext

parth/python-function-parsing

jmorganca/cuda-compression-none

drifkin/num-parallel

drifkin/chat-truncation-fix

jmorganca/sync

parth/python-tools-calling

drifkin/array-head-count

brucemacd/create-no-loop

parth/server-enable-content-stream-with-tools

qwen25omni

mxyng/v3

brucemacd/ropeconfig

jmorganca/silence-tokenizer

parth/sample-so-test

parth/sampling-structured-outputs

brucemacd/doc-go-engine

parth/constrained-sampling-json

jmorganca/mistral-wip

brucemacd/mistral-small-convert

parth/sample-unmarshal-json-for-params

brucemacd/jomorganca/mistral

pdevine/bfloat16

jmorganca/mistral

brucemacd/mistral

pdevine/logging

parth/sample-correctness-fix

parth/sample-fix-sorting

jmorgan/sample-fix-sorting-extras

jmorganca/temp-0-images

brucemacd/parallel-embed-models

brucemacd/shim-grammar

jmorganca/fix-gguf-error

bmizerany/nameswork

jmorganca/faster-releases

bmizerany/validatenames

brucemacd/err-no-vocab

brucemacd/rope-config

brucemacd/err-hint

brucemacd/qwen2_5

brucemacd/logprobs

brucemacd/new_runner_graph_bench

progress-flicker

brucemacd/forward-test

brucemacd/go_qwen2

pdevine/gemma2

jmorganca/add-missing-symlink-eval

mxyng/next-debug

parth/set-context-size-openai

brucemacd/next-bpe-bench

brucemacd/next-bpe-test

brucemacd/new_runner_e2e

brucemacd/new_runner_qwen2

pdevine/convert-cohere2

brucemacd/convert-cli

parth/log-probs

mxyng/next-mlx

mxyng/cmd-history

parth/templating

parth/tokenize-detokenize

brucemacd/check-key-register

bmizerany/grammar

jmorganca/vendor-081b29bd

mxyng/func-checks

jmorganca/fix-null-format

parth/fix-default-to-warn-json

jmorganca/qwen2vl

jmorganca/no-concat

parth/cmd-cleanup-SO

brucemacd/check-key-register-structured-err

parth/openai-stream-usage

parth/fix-referencing-so

stream-tools-stop

jmorganca/degin-1

brucemacd/install-path-clean

brucemacd/push-name-validation

brucemacd/browser-key-register

jmorganca/openai-fix-first-message

jmorganca/fix-proxy

jessegross/sample

parth/disallow-streaming-tools

dhiltgen/remove_submodule

jmorganca/ga

jmorganca/mllama

pdevine/newlines

pdevine/geems-2b

jmorganca/llama-bump

mxyng/modelname-7

mxyng/gin-slog

mxyng/modelname-6

jyan/convert-prog

jyan/quant5

paligemma-support

pdevine/import-docs

jmorganca/openai-context

jyan/paligemma

jyan/p2

jyan/palitest

bmizerany/embedspeedup

jmorganca/llama-vit

brucemacd/allow-ollama

royh/ep-methods

royh/whisper

mxyng/api-models

mxyng/fix-memory

jyan/q4_4/8

jyan/ollama-v

royh/stream-tools

roy-embed-parallel

bmizerany/hrm

revert-5963-revert-5924-mxyng/llama3.1-rope

royh/embed-viz

jyan/local2

jyan/auth

jyan/local

jyan/parse-temp

jmorganca/template-mistral

jyan/reord-g

royh-openai-suffixdocs

royh-imgembed

royh-embed-parallel

jyan/quant4

royh-precision

jyan/progress

pdevine/fix-template

jyan/quant3

pdevine/ggla

mxyng/update-registry-domain

jmorganca/ggml-static

mxyng/create-context

jyan/v0.146

mxyng/layers-from-files

build_dist

bmizerany/noseek

royh-ls

royh-name

timeout

mxyng/server-timestamp

bmizerany/nosillyggufslurps

royh-params

jmorganca/llama-cpp-7c26775

royh-openai-delete

royh-show-rigid

jmorganca/enable-fa

jmorganca/no-error-template

jyan/format

royh-testdelete

bmizerany/fastverify

language_support

pdevine/ps-glitches

brucemacd/tokenize

bruce/iq-quants

bmizerany/filepathwithcoloninhost

mxyng/split-bin

bmizerany/client-registry

jmorganca/if-none-match

native

jmorganca/native

jmorganca/batch-embeddings

jmorganca/initcmake

jmorganca/mm

pdevine/showggmlinfo

modenameenforcealphanum

bmizerany/modenameenforcealphanum

jmorganca/done-reason

jmorganca/llama-cpp-8960fe8

ollama.com

bmizerany/filepathnobuild

bmizerany/types/model/defaultfix

rmdisplaylong

nogogen

bmizerany/x

modelfile-readme

bmizerany/replacecolon

jmorganca/limit

jmorganca/execstack

jmorganca/replace-assets

mxyng/tune-concurrency

jmorganca/testing

whitespace-detection

jmorganca/options

upgrade-all

scratch

cuda-search

mattw/airenamer

mattw/allmodelsonhuggingface

mattw/quantcontext

mattw/whatneedstorun

brucemacd/llama-mem-calc

mattw/faq-context

mattw/communitylinks

mattw/noprune

mattw/python-functioncalling

rename

mxyng/install

pulse

remove-first

editor

mattw/selfqueryingretrieval

cgo

mattw/howtoquant

api

matt/streamingapi

format-config

mxyng/extra-args

shell

update-nous-hermes

cp-model

upload-progress

fix-unknown-model

fix-model-names

delete-fix

insecure-registry

ls

deletemodels

progressbar

readme-updates

license-layers

skip-list

list-models

modelpath

matt/examplemodelfiles

distribution

go-opts

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#4327