Fetch Failed Error on using OLLAMA locally with nomic-embed-text and llama3.1:8b #4327

Closed
opened 2025-11-12 12:15:38 -06:00 by GiteaMirror · 12 comments
Owner

Originally created by @saisandeepbalbari on GitHub (Sep 18, 2024).

What is the issue?

I'm using OLLAMA with Anything LLM. It's taking a lot of time to respond to the prompts

The following error I'm getting from the docker logs of anything llm

[OllamaEmbedder] Embedding 1 chunks of text with nomic-embed-text:latest.
TypeError: fetch failed
at node:internal/deps/undici/undici:12618:11
at async createOllamaStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:12:22)
at async createOllamaChatStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:61:5)
at async ChatOllama._streamResponseChunks (/app/server/node_modules/@langchain/community/dist/chat_models/ollama.cjs:399:30)
at async ChatOllama._streamIterator (/app/server/node_modules/@langchain/core/dist/language_models/chat_models.cjs:82:34)
at async ChatOllama.transform (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:382:9)
at async wrapInputForTracing (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:258:30)
at async pipeGeneratorWithSetup (/app/server/node_modules/@langchain/core/dist/utils/stream.cjs:230:19)
at async StringOutputParser._transformStreamWithConfig (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:279:26)
at async StringOutputParser.transform (/app/server/node_modules/@langchain/core/dist/output_parsers/transform.cjs:36:9) {
cause: HeadersTimeoutError: Headers Timeout Error
at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:9117:32)
at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:7148:17)
at listOnTimeout (node:internal/timers:569:17)
at process.processTimers (node:internal/timers:512:7) {
code: 'UND_ERR_HEADERS_TIMEOUT'
}
}

OS

Linux

GPU

No response

CPU

Intel

Ollama version

No response

Originally created by @saisandeepbalbari on GitHub (Sep 18, 2024). ### What is the issue? I'm using OLLAMA with Anything LLM. It's taking a lot of time to respond to the prompts The following error I'm getting from the docker logs of anything llm [OllamaEmbedder] Embedding 1 chunks of text with nomic-embed-text:latest. TypeError: fetch failed at node:internal/deps/undici/undici:12618:11 at async createOllamaStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:12:22) at async createOllamaChatStream (/app/server/node_modules/@langchain/community/dist/utils/ollama.cjs:61:5) at async ChatOllama._streamResponseChunks (/app/server/node_modules/@langchain/community/dist/chat_models/ollama.cjs:399:30) at async ChatOllama._streamIterator (/app/server/node_modules/@langchain/core/dist/language_models/chat_models.cjs:82:34) at async ChatOllama.transform (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:382:9) at async wrapInputForTracing (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:258:30) at async pipeGeneratorWithSetup (/app/server/node_modules/@langchain/core/dist/utils/stream.cjs:230:19) at async StringOutputParser._transformStreamWithConfig (/app/server/node_modules/@langchain/core/dist/runnables/base.cjs:279:26) at async StringOutputParser.transform (/app/server/node_modules/@langchain/core/dist/output_parsers/transform.cjs:36:9) { cause: HeadersTimeoutError: Headers Timeout Error at Timeout.onParserTimeout [as callback] (node:internal/deps/undici/undici:9117:32) at Timeout.onTimeout [as _onTimeout] (node:internal/deps/undici/undici:7148:17) at listOnTimeout (node:internal/timers:569:17) at process.processTimers (node:internal/timers:512:7) { code: 'UND_ERR_HEADERS_TIMEOUT' } } ### OS Linux ### GPU _No response_ ### CPU Intel ### Ollama version _No response_
GiteaMirror added the
bug
label 2025-11-12 12:15:38 -06:00
Author
Owner

@rick-github commented on GitHub (Sep 18, 2024):

What are the logs from the ollama container? What's the Anything LLM config you are setting to talk to ollama?

@rick-github commented on GitHub (Sep 18, 2024): What are the logs from the ollama container? What's the Anything LLM config you are setting to talk to ollama?
Author
Owner

@saisandeepbalbari commented on GitHub (Sep 19, 2024):

The following are the logs

Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.799Z level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model"
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: special tokens cache size = 256
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: token to piece cache size = 0.7999 MB
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: format = GGUF V3 (latest)
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: arch = llama
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab type = BPE
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_vocab = 128256
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_merges = 280147
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab_only = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_train = 131072
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd = 4096
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_layer = 32
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head = 32
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head_kv = 8
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_rot = 128
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_swa = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_k = 128
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_v = 128
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_gqa = 4
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_k_gqa = 1024
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_v_gqa = 1024
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_eps = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_clamp_kqv = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_logit_scale = 0.0e+00
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ff = 14336
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert_used = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: causal attn = 1
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: pooling type = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope type = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope scaling = linear
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_base_train = 500000.0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_scale_train = 1
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_orig_yarn = 131072
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope_finetuned = unknown
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_conv = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_inner = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_state = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_rank = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_b_c_rms = 0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model type = 8B
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model ftype = Q4_0
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model params = 8.03 B
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: LF token = 128 'Ä'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: max token length = 256
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: ggml ctx size = 0.14 MiB
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root).
Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: CPU buffer size = 4437.81 MiB
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ctx = 8192
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_batch = 512
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ubatch = 512
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: flash_attn = 0
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_base = 500000.0
Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_scale = 1
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU output buffer size = 2.02 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU compute buffer size = 560.01 MiB
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph nodes = 1030
Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph splits = 1
Sep 19 05:04:14 ip-15-0-125-54 ollama[186217]: INFO [main] model loaded | tid="129247310618752" timestamp=1726722254
Sep 19 05:04:14 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:14.422Z level=INFO source=server.go:629 msg="llama runner started in 4.87 seconds"
Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat"

@saisandeepbalbari commented on GitHub (Sep 19, 2024): The following are the logs Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.799Z level=INFO source=server.go:624 msg="waiting for server to become available" status="llm server loading model" Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: special tokens cache size = 256 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_vocab: token to piece cache size = 0.7999 MB Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: format = GGUF V3 (latest) Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: arch = llama Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab type = BPE Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_vocab = 128256 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_merges = 280147 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: vocab_only = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_train = 131072 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd = 4096 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_layer = 32 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head = 32 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_head_kv = 8 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_rot = 128 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_swa = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_k = 128 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_head_v = 128 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_gqa = 4 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_k_gqa = 1024 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_embd_v_gqa = 1024 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_eps = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_norm_rms_eps = 1.0e-05 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_clamp_kqv = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_max_alibi_bias = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: f_logit_scale = 0.0e+00 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ff = 14336 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_expert_used = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: causal attn = 1 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: pooling type = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope type = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope scaling = linear Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_base_train = 500000.0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: freq_scale_train = 1 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: n_ctx_orig_yarn = 131072 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: rope_finetuned = unknown Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_conv = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_inner = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_d_state = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_rank = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: ssm_dt_b_c_rms = 0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model type = 8B Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model ftype = Q4_0 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model params = 8.03 B Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: general.name = Meta Llama 3.1 8B Instruct Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOS token = 128009 '<|eot_id|>' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: LF token = 128 'Ä' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: EOT token = 128009 '<|eot_id|>' Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_print_meta: max token length = 256 Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: ggml ctx size = 0.14 MiB Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: Try increasing RLIMIT_MEMLOCK ('ulimit -l' as root). Sep 19 05:04:10 ip-15-0-125-54 ollama[39143]: llm_load_tensors: CPU buffer size = 4437.81 MiB Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ctx = 8192 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_batch = 512 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: n_ubatch = 512 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: flash_attn = 0 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_base = 500000.0 Sep 19 05:04:12 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: freq_scale = 1 Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_kv_cache_init: CPU KV buffer size = 1024.00 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU output buffer size = 2.02 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: CPU compute buffer size = 560.01 MiB Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph nodes = 1030 Sep 19 05:04:13 ip-15-0-125-54 ollama[39143]: llama_new_context_with_model: graph splits = 1 Sep 19 05:04:14 ip-15-0-125-54 ollama[186217]: INFO [main] model loaded | tid="129247310618752" timestamp=1726722254 Sep 19 05:04:14 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:14.422Z level=INFO source=server.go:629 msg="llama runner started in 4.87 seconds" Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat"
Author
Owner

@rick-github commented on GitHub (Sep 19, 2024):

Need the full log.

@rick-github commented on GitHub (Sep 19, 2024): Need the full log.
Author
Owner

@saisandeepbalbari commented on GitHub (Sep 19, 2024):

Here I'm sharing the complete log file

pfa

ollama_logs_from_server_start.txt

@saisandeepbalbari commented on GitHub (Sep 19, 2024): Here I'm sharing the complete log file pfa [ollama_logs_from_server_start.txt](https://github.com/user-attachments/files/17069943/ollama_logs_from_server_start.txt)
Author
Owner

@rick-github commented on GitHub (Sep 21, 2024):

I see two issues here.

mlock

You have no GPU, 16GB of RAM most of which is free, and no swap.

Sep 16 17:30:27 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:30:27.551Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered"
Sep 16 17:37:20 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:37:20.132Z level=INFO source=server.go:101 msg="system memory" total="15.4 GiB" free="14.4 GiB" free_swap="0 B"

The embedding model loads fine.

Sep 16 17:37:21 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:37:21 | 200 |  1.574410247s |      172.17.0.3 | POST     "/api/embeddings"

The llama3.1:8b model wants to load in to memory and needs 5.8G

Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=memory.go:326 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[13.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB"

The model is loaded with --mlock. This is probably because the client (Anything LLM) is sending "use_mlock":true in the API call

Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 38237"

On linux systems, a common limit to how much memory a process can lock is ~4G

$ echo $(dc <<< "$(ulimit -l) 1k 1024/1024/p")G
3.8G

ollama tries to lock 4.3G and fails:

ollama-1  | warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory

This is not a fatal error, and can be remedied by setting the value of RLIMIT_MEMLOCK to something larger. It looks like you are using docker, so you can do that with:

services:
  ollama:
    ulimits:
      memlock:
        soft: 8192000000
        hard: 8192000000

timeouts

Almost all of the calls to /api/chat are taking more than 5 minutes.

Sep 16 17:42:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:42:36 | 200 |          4m2s |      172.17.0.3 | POST     "/api/chat"
Sep 16 17:50:05 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:50:05 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 16 18:01:17 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:01:17 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 16 18:10:40 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:10:40 | 200 |         1m39s |      172.17.0.2 | POST     "/api/chat"
Sep 16 18:18:12 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:18:12 | 200 |         6m14s |      172.17.0.2 | POST     "/api/chat"
Sep 17 05:58:56 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/17 - 05:58:56 | 200 |         6m18s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:06:31 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:06:31 | 200 |         5m30s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:38:35 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:38:35 | 200 |         5m37s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:39:02 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:39:02 | 200 | 16.797917219s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:51:45 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:51:45 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:52:50 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:52:50 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 05:59:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:59:36 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:12:24 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:12:24 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:14:37 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:37 | 200 |        12m12s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:14:43 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:43 | 200 |          5m0s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:32:39 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:32:39 | 200 |         6m52s |      172.17.0.3 | POST     "/api/chat"
Sep 18 06:37:49 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:37:49 | 200 |          5m1s |      172.17.0.2 | POST     "/api/chat"
Sep 18 06:40:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:40:07 | 200 |         5m15s |      172.17.0.2 | POST     "/api/chat"
Sep 18 06:45:38 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:45:38 | 200 |          5m0s |      172.17.0.2 | POST     "/api/chat"
Sep 18 07:55:16 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:55:16 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 18 07:56:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:56:07 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 19 05:03:10 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:03:10 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"
Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 |          5m1s |      172.17.0.3 | POST     "/api/chat"

Many complete just after 5 minutes, so it's possible there's a timeout on the client side for some of these calls. Since you don't have a GPU, inference will take longer than a system with one, but > 5m is a poor experience. Since you are using an embedding model. I'm going to guess that you have some sort of RAG system and are sending a bunch of context along with the query. More prompt means more prompt processing, so bigger queries will take longer to return a result. To mitigate this, you can try sending less context with the queries.

Another possibility is that you are exceeding the size of the context window, which is using the default of 2k per model thread:

Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.547Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 37153"

Exceeding the context window causes a model to slow down as it discards a bunch of the tokens in the input stream, which may be contributing to the long response times. If you enable extra debugging with OLLAMA_DEBUG=1 in the server process, we can get a better idea of what's causing the slow responses.

If it is a case of exceeding the context window, you can either adjust the client to request a larger context by setting "options":{"num_ctx":8192} in the API call, or you can create a model with a larger context window and use that instead of llama3.1:8b:

$ cat > Modelfile <<EOF
FROM llama3.1:8b
PARAMETER num_ctx 8192
EOF
$ ollama create llama3.1:8b-8kcontext

Note that increasing the context window will require more RAM, so be aware of the available RAM on your system.

@rick-github commented on GitHub (Sep 21, 2024): I see two issues here. ## mlock You have no GPU, 16GB of RAM most of which is free, and no swap. ``` Sep 16 17:30:27 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:30:27.551Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered" Sep 16 17:37:20 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:37:20.132Z level=INFO source=server.go:101 msg="system memory" total="15.4 GiB" free="14.4 GiB" free_swap="0 B" ``` The embedding model loads fine. ``` Sep 16 17:37:21 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:37:21 | 200 | 1.574410247s | 172.17.0.3 | POST "/api/embeddings" ``` The llama3.1:8b model wants to load in to memory and needs 5.8G ``` Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=memory.go:326 msg="offload to cpu" layers.requested=-1 layers.model=33 layers.offload=0 layers.split="" memory.available="[13.8 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.8 GiB" memory.required.partial="0 B" memory.required.kv="1.0 GiB" memory.required.allocations="[5.8 GiB]" memory.weights.total="4.7 GiB" memory.weights.repeating="4.3 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="560.0 MiB" memory.graph.partial="677.5 MiB" ``` The model is loaded with `--mlock`. This is probably because the client (Anything LLM) is sending `"use_mlock":true` in the API call ``` Sep 16 17:38:33 ip-15-0-125-54 ollama[39143]: time=2024-09-16T17:38:33.543Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 38237" ``` On linux systems, a common limit to how much memory a process can lock is ~4G ``` $ echo $(dc <<< "$(ulimit -l) 1k 1024/1024/p")G 3.8G ``` ollama tries to lock 4.3G and fails: ``` ollama-1 | warning: failed to mlock 4653387776-byte buffer (after previously locking 0 bytes): Cannot allocate memory ``` This is not a fatal error, and can be remedied by setting the value of `RLIMIT_MEMLOCK` to something larger. It looks like you are using docker, so you can do that with: ```yaml services: ollama: ulimits: memlock: soft: 8192000000 hard: 8192000000 ``` ## timeouts Almost all of the calls to `/api/chat` are taking more than 5 minutes. ``` Sep 16 17:42:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:42:36 | 200 | 4m2s | 172.17.0.3 | POST "/api/chat" Sep 16 17:50:05 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 17:50:05 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 16 18:01:17 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:01:17 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 16 18:10:40 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:10:40 | 200 | 1m39s | 172.17.0.2 | POST "/api/chat" Sep 16 18:18:12 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/16 - 18:18:12 | 200 | 6m14s | 172.17.0.2 | POST "/api/chat" Sep 17 05:58:56 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/17 - 05:58:56 | 200 | 6m18s | 172.17.0.3 | POST "/api/chat" Sep 18 05:06:31 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:06:31 | 200 | 5m30s | 172.17.0.3 | POST "/api/chat" Sep 18 05:38:35 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:38:35 | 200 | 5m37s | 172.17.0.3 | POST "/api/chat" Sep 18 05:39:02 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:39:02 | 200 | 16.797917219s | 172.17.0.3 | POST "/api/chat" Sep 18 05:51:45 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:51:45 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 05:52:50 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:52:50 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 05:59:36 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 05:59:36 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 06:12:24 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:12:24 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 18 06:14:37 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:37 | 200 | 12m12s | 172.17.0.3 | POST "/api/chat" Sep 18 06:14:43 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:14:43 | 200 | 5m0s | 172.17.0.3 | POST "/api/chat" Sep 18 06:32:39 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:32:39 | 200 | 6m52s | 172.17.0.3 | POST "/api/chat" Sep 18 06:37:49 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:37:49 | 200 | 5m1s | 172.17.0.2 | POST "/api/chat" Sep 18 06:40:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:40:07 | 200 | 5m15s | 172.17.0.2 | POST "/api/chat" Sep 18 06:45:38 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 06:45:38 | 200 | 5m0s | 172.17.0.2 | POST "/api/chat" Sep 18 07:55:16 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:55:16 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 18 07:56:07 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/18 - 07:56:07 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 19 05:03:10 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:03:10 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" Sep 19 05:09:00 ip-15-0-125-54 ollama[39143]: [GIN] 2024/09/19 - 05:09:00 | 200 | 5m1s | 172.17.0.3 | POST "/api/chat" ``` Many complete just after 5 minutes, so it's possible there's a timeout on the client side for some of these calls. Since you don't have a GPU, inference will take longer than a system with one, but > 5m is a poor experience. Since you are using an embedding model. I'm going to guess that you have some sort of RAG system and are sending a bunch of context along with the query. More prompt means more prompt processing, so bigger queries will take longer to return a result. To mitigate this, you can try sending less context with the queries. Another possibility is that you are exceeding the size of the context window, which is using the default of 2k per model thread: ``` Sep 19 05:04:09 ip-15-0-125-54 ollama[39143]: time=2024-09-19T05:04:09.547Z level=INFO source=server.go:391 msg="starting llama server" cmd="/tmp/ollama320348475/runners/cpu_avx2/ollama_llama_server --model /usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe --ctx-size 8192 --batch-size 512 --embedding --log-disable --no-mmap --mlock --parallel 4 --port 37153" ``` Exceeding the context window causes a model to slow down as it discards a bunch of the tokens in the input stream, which may be contributing to the long response times. If you enable extra debugging with `OLLAMA_DEBUG=1` in the server process, we can get a better idea of what's causing the slow responses. If it is a case of exceeding the context window, you can either adjust the client to request a larger context by setting `"options":{"num_ctx":8192}` in the API call, or you can create a model with a larger context window and use that instead of llama3.1:8b: ``` $ cat > Modelfile <<EOF FROM llama3.1:8b PARAMETER num_ctx 8192 EOF $ ollama create llama3.1:8b-8kcontext ``` Note that increasing the context window will require more RAM, so be aware of the available RAM on your system.
Author
Owner

@saisandeepbalbari commented on GitHub (Sep 30, 2024):

Thank you @rick-github

I have made the changes suggested by you. But I'm still not able to load the llama3.1:8b model

However, I'm able to load and run a smaller model of OLLAMA qwen2 and getting the response.

But in both the cases I'm getting the cannot allocate memory error. Is it because the system is having only 6.4 GB of Free memory after allocating 8 GB of Swap?

Here are the logs in using both these models

ollama_logs_27_sep_2024_llama3.1.txt
ollama_logs_27_sep_2024_qwen2.txt

@saisandeepbalbari commented on GitHub (Sep 30, 2024): Thank you @rick-github I have made the changes suggested by you. But I'm still not able to load the llama3.1:8b model However, I'm able to load and run a smaller model of OLLAMA qwen2 and getting the response. But in both the cases I'm getting the cannot allocate memory error. Is it because the system is having only 6.4 GB of Free memory after allocating 8 GB of Swap? Here are the logs in using both these models [ollama_logs_27_sep_2024_llama3.1.txt](https://github.com/user-attachments/files/17184124/ollama_logs_27_sep_2024_llama3.1.txt) [ollama_logs_27_sep_2024_qwen2.txt](https://github.com/user-attachments/files/17184125/ollama_logs_27_sep_2024_qwen2.txt)
Author
Owner

@rick-github commented on GitHub (Sep 30, 2024):

time=2024-09-27T09:49:47.649Z level=INFO source=server.go:629 msg="llama runner started in 50.78 seconds"
[GIN] 2024/09/27 - 09:53:58 | 200 |          5m1s |      172.17.0.2 | POST     "/api/chat"

The llama3.1:8b model loads, but you don't get a response because it's taking longer than the 5 minute timeout you have in the client. You need to increase the timeout in the client, send prompts that are easier to respond to, or get a GPU so that the model runs faster. You could also try the new llama3.2 models, Meta are saying they are as good as the llama3.1 models but because they are smaller, they will run faster.

Regarding the memory error, it only indicates the the ollama server wasn't able to pin the model in RAM to prevent it from going to swap. It's just to prevent some slowdown in processing (unless you have a really slow disk that you are swapping to, it won't make much difference to inference speed). But it's better to lock the model in RAM if possible. Did you add the ulimits parameters to the docker file? What's the output of docker container inspect ollama | jq '.[].HostConfig.Ulimits'?

@rick-github commented on GitHub (Sep 30, 2024): ``` time=2024-09-27T09:49:47.649Z level=INFO source=server.go:629 msg="llama runner started in 50.78 seconds" [GIN] 2024/09/27 - 09:53:58 | 200 | 5m1s | 172.17.0.2 | POST "/api/chat" ``` The llama3.1:8b model loads, but you don't get a response because it's taking longer than the 5 minute timeout you have in the client. You need to increase the timeout in the client, send prompts that are easier to respond to, or get a GPU so that the model runs faster. You could also try the new [llama3.2](https://ollama.com/library/llama3.2) models, Meta are saying they are as good as the llama3.1 models but because they are smaller, they will run faster. Regarding the memory error, it only indicates the the ollama server wasn't able to pin the model in RAM to prevent it from going to swap. It's just to prevent some slowdown in processing (unless you have a really slow disk that you are swapping to, it won't make much difference to inference speed). But it's better to lock the model in RAM if possible. Did you add the `ulimits` parameters to the docker file? What's the output of `docker container inspect ollama | jq '.[].HostConfig.Ulimits'`?
Author
Owner

@saisandeepbalbari commented on GitHub (Sep 30, 2024):

Thank you for your quick response.

I will try to increase the timeout and also try using the llama3.2 models and if it doesn't work then use a machine with GPUs

In case I need to use GPU, what is the minimum required configuration I should have?

I'm not running OLLAMA using docker. I'm running it in background in the system.

I have modified the /etc/security/limits.conf file by adding these two lines

  • soft memlock 8192000000
  • hard memlock 8192000000
@saisandeepbalbari commented on GitHub (Sep 30, 2024): Thank you for your quick response. I will try to increase the timeout and also try using the llama3.2 models and if it doesn't work then use a machine with GPUs In case I need to use GPU, what is the minimum required configuration I should have? I'm not running OLLAMA using docker. I'm running it in background in the system. I have modified the /etc/security/limits.conf file by adding these two lines * soft memlock 8192000000 * hard memlock 8192000000
Author
Owner

@rick-github commented on GitHub (Sep 30, 2024):

Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b:

Context Size VRAM
2048 5.1GB
4096 5.4GB
8192 6.2GB
16384 7.7GB
32768 10.8GB
65536 17.3GB
131072 29.9GB

So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM).

The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K:

Context Size VRAM
2048 2.9GB
4096 3.1GB
8192 3.7GB
16384 5.0GB
32768 7.6GB
65536 12.8GB
131072 23.7GB
@rick-github commented on GitHub (Sep 30, 2024): Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b: | Context Size | VRAM | |--------------|------| | 2048 | 5.1GB | | 4096 | 5.4GB | | 8192 | 6.2GB | |16384 | 7.7GB | |32768 |10.8GB | |65536 | 17.3GB | | 131072 | 29.9GB | So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM). The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K: | Context Size | VRAM | |--------------|------| | 2048 | 2.9GB | | 4096 | 3.1GB | | 8192 | 3.7GB | |16384 | 5.0GB | |32768 |7.6GB | |65536 | 12.8GB | | 131072 | 23.7GB |
Author
Owner

@robotom commented on GitHub (Oct 26, 2024):

Really helpful. Could you tell me how to determine this for any other model? Like a Qwen for example.

Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b:
Context Size VRAM
2048 5.1GB
4096 5.4GB
8192 6.2GB
16384 7.7GB
32768 10.8GB
65536 17.3GB
131072 29.9GB

So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM).

The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K:
Context Size VRAM
2048 2.9GB
4096 3.1GB
8192 3.7GB
16384 5.0GB
32768 7.6GB
65536 12.8GB
131072 23.7GB

@robotom commented on GitHub (Oct 26, 2024): Really helpful. Could you tell me how to determine this for any other model? Like a Qwen for example. > Any GPU will speed up the inference process. Since it looks like you are doing some sort of RAG with large context, your main concern will probably be VRAM. This is a small table of context size vs VRAM for llama3.1:8b: > Context Size VRAM > 2048 5.1GB > 4096 5.4GB > 8192 6.2GB > 16384 7.7GB > 32768 10.8GB > 65536 17.3GB > 131072 29.9GB > > So a x060 Nvidia card (8GB VRAM) would be good up to about a context size of 8K, x070 (12GB) and x080 (16GB) for 32K, x090 (24GB) for 64K. You can make do with a smaller card with a larger context size, just be aware that some inference will then be done in RAM and will be slower (but still faster than all in RAM). > > The llama3.2 3B model is smaller and requires less VRAM (note there is some overhead not shown here, about 400M) so a x080 would be good for a context size of 64K: > Context Size VRAM > 2048 2.9GB > 4096 3.1GB > 8192 3.7GB > 16384 5.0GB > 32768 7.6GB > 65536 12.8GB > 131072 23.7GB
Author
Owner

@rick-github commented on GitHub (Oct 28, 2024):

This is the script I use to gather data:

bench.sh
#!/bin/bash

num_gpu=(
    25%
    50%
    75%
    999
    -1
    0
)
contexts=(
    2048
    4096
    8192
    16384
    32768
    65536
    131072
)
versions=(
    0.3.14
)
models=(
    qwen2.5:7b-instruct-q4_K_M
    aya-expanse:8b-q4_K_M
    granite3-dense:8b-instruct-q4_K_M
    granite3-moe:3b-instruct-q4_K_M
    llama3.2:3b-instruct-q4_K_M
    nemotron-mini:4b-instruct-q4_K_M
    mistral-nemo:12b-instruct-2407-q4_K_M
    hermes3:8b-llama3.1-q4_K_M
    phi3.5:3.8b-mini-instruct-q4_K_M
    gemma2:9b-instruct-q4_K_M
    mistral:7b-instruct-v0.3-q4_K_M
    gemma2:27b-instruct-q4_K_M
)

#for m in ${models[*]} ; do
#  docker compose exec -it ollama ollama pull $m
#done

for v in ${versions[*]} ; do
  OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_DOCKER_TAG=$v OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama
  sleep 5
  for n in ${num_gpu[*]} ; do
    for m in ${models[*]} ; do
      for c in ${contexts[*]} ; do
        ngl=$n
        [[ $n == *% ]] && {
          curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":-1}}' >/dev/null
          layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=\([^ ]*\) .*/\1/p' | tail -1)
          ngl=$(dc <<< "$layers ${n%\%} * 100/p")
        }
        curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":'$ngl'}}' >/dev/null
        layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=\([^ ]*\) .*/\1/p' | tail -1)
        loaded=$(docker compose logs ollama | sed -ne 's/.*starting llama server.*--n-gpu-layers \([^ ]*\) .*/\1/p' | tail -1)
        mem=$(curl -s localhost:11434/api/ps | jq -r 'def toGB(p):.*pow(10;p)/1024/1024/1024|round/pow(10;p);.models[]|select(.model=="'$m'")|"\(.size|toGB(1))/\(.size_vram|toGB(1))"')
        tps=$(curl -s localhost:11434/api/generate -d '{"model":"'$m'","prompt":"the sky is blue because ","options":{"temperature":0,"seed":42,"num_predict":200,"num_ctx":'$c',"num_gpu":'$ngl'},"stream":false}' | jq 'def prec(p):.*pow(10;p)|round/pow(10;p);(.eval_count/(.eval_duration/1000000000)|prec(2))')
        echo $v $loaded/$layers $m $c $mem $tps
      done
    done
  done
done

This is the key for the tables below:

aya: aya-expanse:8b-q4_K_M
g2:27: gemma2:27b-instruct-q4_K_M
g2:9: gemma2:9b-instruct-q4_K_M
gd: granite3-dense:8b-instruct-q4_K_M
gm: granite3-moe:3b-instruct-q4_K_M
hl: hermes3:8b-llama3.1-q4_K_M
l3.2: llama3.2:3b-instruct-q4_K_M
mn: mistral-nemo:12b-instruct-2407-q4_K_M
mi: mistral:7b-instruct-v0.3-q4_K_M
nm: nemotron-mini:4b-instruct-q4_K_M
phi: phi3.5:3.8b-mini-instruct-q4_K_M
qwen: qwen2.5:7b-instruct-q4_K_M

The numbers in the tables are the total amount of RAM required for the model, the amount of VRAM used, and the tokens per second that the model produced: total/vram tps.

RTX 3080 16GB VRAM, i7-11800H @ 2.30GHz 64GB RAM

Context aya g2:27 g2:9 gd gm hl l3.2 mn mi nm phi qwen
2048 6/6 64.27 - 7.1/7.1 55.88 5.7/5.7 67.07 2.6/2.6 108.79 5.4/5.4 71.27 2.9/2.9 133.56 7.8/7.8 48.79 5/5 78.76 3/3 111.61 3.6/3.6 124.78 5.1/5.1 74.53
4096 6.3/6.3 64.12 19.2/15.4 7.85 7.8/7.8 55.78 6.2/6.2 66.86 2.8/2.8 108.88 5.7/5.7 71.32 3.1/3.1 133.46 8.2/8.2 48.81 5.4/5.4 78.61 3.4/3.4 111.6 4.5/4.5 124.82 5.2/5.2 74.56
8192 6.9/6.9 64.24 20.8/15.5 6.74 9.1/9.1 55.72 7.3/7.3 66.92 3.1/3.1 108.39 6.5/6.5 71.12 3.7/3.7 133.19 9/9 48.8 6.2/6.2 78.69 4.1/4.1 111.43 6.3/6.3 124.56 5.6/5.6 74.55
16384 8.4/8.4 64.14 23.9/15.1 5.25 11.9/11.9 55.65 9.4/9.4 66.9 3.9/3.9 108.81 8/8 71.22 5/5 133.13 10.8/10.8 48.74 7.7/7.7 78.58 5.7/5.7 111.51 9.9/9.9 124.29 6.5/6.5 74.32
32768 11.5/11.5 64.03 30.8/15.2 4.05 18/15.1 18.43 13.6/13.6 66.82 5.4/5.4 108.92 11.1/11.1 71.09 7.6/7.6 133.01 14.4/14.4 48.67 10.8/10.8 78.52 8.7/8.7 111.05 17.1/15.2 51.15 8.3/8.3 74.27
65536 18.7/15.2 18.25 45.1/14.8 3.14 30.2/15.3 10.11 22.2/15.4 15.39 8.5/8.5 107.74 17.5/15.2 22.75 12.8/12.8 131.81 22/15.4 11.73 17.2/15.2 32.77 14.8/14.8 110.48 31.5/15.1 19.88 11.9/11.9 74.02
131072 32.3/15.1 9.3 73.1/14.6 2.2 54.5/14.6 7.68 39.3/15 8.51 14.6/14.6 107.15 30.3/15.5 10.38 23.7/15 27.73 36.9/15.3 6.96 - 27.2/15 23.04 60.3/14.7 14.71 21/15.2 15.24

RTX 4070 12GB VRAM, i7-13700 @ 2.5GHz 96GB RAM

Context aya g2:27 g2:9 gd gm hl l3.2 mn mi nm phi qwen
2048 6/6 71.46 18.8/11.1 5.94 7.1/7.1 64.57 5.7/5.7 76.83 2.6/2.6 139.38 5.4/5.4 80.76 2.9/2.9 149.56 7.8/7.8 54.76 5/5 90.71 3/3 116.19 3.6/3.6 141.79 5.1/5.1 83.17
4096 6.3/6.3 71.2 19.6/11.3 5.6 7.8/7.8 64.52 6.2/6.2 76.68 2.8/2.8 139.58 5.7/5.7 80.61 3.1/3.1 149.31 8.2/8.2 55.16 5.4/5.4 90.29 3.4/3.4 118.16 4.5/4.5 142.5 5.2/5.2 83.18
8192 6.9/6.9 71.32 21.1/11.3 5.29 9.1/9.1 63.77 7.3/7.3 76.59 3.1/3.1 139.28 6.5/6.5 80.63 3.7/3.7 148.87 9/9 55.05 6.2/6.2 90.23 4.1/4.1 118 6.3/6.3 141.73 5.6/5.6 83.05
16384 8.4/8.4 71.23 24.2/11.2 4.7 12.5/11.3 26.15 9.4/9.4 76.42 3.9/3.9 138.4 8/8 80.38 5/5 146.59 10.8/10.8 55.05 7.7/7.7 90.14 5.7/5.7 117.74 9.9/9.9 141.77 6.5/6.5 83.05
32768 12/11.2 33.06 31/11.3 4.06 18.1/11.2 13.85 13.7/11.3 26.24 5.4/5.4 137.48 11.1/11.1 79.58 7.6/7.6 147.34 14.6/11.3 16.77 10.8/10.8 89.96 8.7/8.7 117.53 17.1/11.3 30.14 8.3/8.3 82.91
65536 18.8/11.2 13.56 45.2/10.7 3.5 30.3/11.1 9.71 22.3/11.1 11.58 8.5/8.5 137.03 17.6/11.2 14.65 13/11.2 47.4 22.2/11.4 9.94 17.3/11.2 18.74 14.9/11.3 33.57 31.5/11 15.69 12.8/11.3 29.7
131072 20.9/0 10.19 62.5/0 3.31 54.5/11.3 7.82 24.9/0 10.43 14.7/11.2 74.48 30.4/11.1 9.5 23.7/11.1 19.86 37/11.3 7.31 30.1/11.1 12.15 27.2/11.2 16.97 50.3/0 20.39 21.1/11.4 12.23

T4x4 64GB VRAM, Xeon(R) CPU @ 2.00GHz 60GB RAM

Context aya g2:27 g2:9 gd gm hl l3.2 mn mi nm phi qwen
2048 6/6 24.76 25/25 7.44 7.1/7.1 20.66 5.7/5.7 27.67 2.6/2.6 78.55 5.4/5.4 24.89 2.9/2.9 64.39 7.8/7.8 15.77 5/5 32.93 4.9/4.9 45.53 3.6/3.6 64.53 5.1/5.1 33.64
4096 6.3/6.3 26.08 25.8/25.8 7.52 7.8/7.8 21.24 6.2/6.2 30.39 2.8/2.8 78.36 5.7/5.7 24.46 3.1/3.1 64.06 8.2/8.2 17.25 5.4/5.4 31.66 5.6/5.6 44.78 4.5/4.5 54.33 5.2/5.2 31.4
8192 6.9/6.9 24.7 27.4/27.4 7.62 9.1/9.1 26.05 7.3/7.3 29.52 3.1/3.1 77.01 6.5/6.5 25.56 3.7/3.7 63.36 9/9 16 6.2/6.2 31.74 7.2/7.2 46.55 6.3/6.3 64.05 5.6/5.6 33.01
16384 8.4/8.4 25.43 30.5/30.5 7.56 11.9/11.9 25.5 9.4/9.4 28.1 3.9/3.9 77.26 8/8 26.41 5/5 63.26 10.8/10.8 16.11 7.7/7.7 30.7 5.7/5.7 54.58 9.9/9.9 60.14 6.5/6.5 27.01
32768 11.5/11.5 24.65 40.1/40.1 7.57 24.5/24.5 17.98 13.6/13.6 24.29 5.4/5.4 76.31 11.1/11.1 30.19 14.7/14.7 44.17 23.5/23.5 18.5 10.8/10.8 29.28 8.7/8.7 58.33 25.7/25.7 42.67 8.3/8.3 24.93
65536 36.6/36.6 22.16 61.3/55.4 4.97 41.3/41.3 17.76 44.5/44.5 23.62 8.5/8.5 75.56 33.2/33.2 25.22 25.5/25.5 45.29 37.9/37.9 16.88 32.9/32.9 28.31 29.1/29.1 47.39 47.2/47.2 41.22 11.9/11.9 32.8
131072 66.2/55.3 9.15 104.8/52.9 1.78 74.9/54.1 7.93 24.9/0 5.87 28.9/28.9 41.45 59.7/56.8 14.71 47/47 42.2 66.8/55.7 7.95 59.4/56.8 18.33 54.1/54.1 41.66 90.3/52.6 11.71 50.5/50.5 24.71

A100 40GB VRAM, Xeon(R) CPU @ 2.20GHz 85GB RAM

Context aya g2:27 g2:9 gd gm hl l3.2 mn mi nm phi qwen
2048 6/6 76.68 17.5/17.5 37.34 7.1/7.1 62.31 5.7/5.7 96.46 2.6/2.6 68.29 5.4/5.4 95.57 2.9/2.9 127.99 7.8/7.8 73.49 5/5 121.56 3/3 92.46 3.6/3.6 152.1 5.1/5.1 93.19
4096 6.3/6.3 77.14 18.2/18.2 37.78 7.8/7.8 60.58 6.2/6.2 96.37 2.8/2.8 66.73 5.7/5.7 96.04 3.1/3.1 128.43 8.2/8.2 73.74 5.4/5.4 119.72 3.4/3.4 90.73 4.5/4.5 154.02 5.2/5.2 93.62
8192 6.9/6.9 72.61 19.8/19.8 37.45 9.1/9.1 60.4 7.3/7.3 95.94 3.1/3.1 69.82 6.5/6.5 96.7 3.7/3.7 128.94 9/9 72.87 6.2/6.2 120.57 4.1/4.1 92.38 6.3/6.3 151.74 5.6/5.6 94.54
16384 8.4/8.4 76.12 23.2/23.2 37.71 11.9/11.9 61.32 9.4/9.4 93.68 3.9/3.9 68.43 8/8 94.69 5/5 126.95 10.8/10.8 72.8 7.7/7.7 117.92 5.7/5.7 91.66 9.9/9.9 152.65 6.5/6.5 93.03
32768 11.5/11.5 75.25 30.1/30.1 37.17 17.8/17.8 57.57 13.6/13.6 94.97 5.4/5.4 63.77 11.1/11.1 93.31 7.6/7.6 125.46 14.4/14.4 71.98 10.8/10.8 117.84 8.7/8.7 90.33 17.1/17.1 149.09 8.3/8.3 92.13
65536 17.7/17.7 74.51 - 29.6/29.6 59.08 22.1/22.1 93.36 8.5/8.5 65.75 17.3/17.3 89.6 12.8/12.8 126.88 21.6/21.6 71.24 17/17 114.87 14.8/14.8 89.81 31.4/31.4 144.09 11.9/11.9 90.56
131072 30.1/30.1 73.51 72.6/38.2 2.8 54.3/37.9 8.69 - 14.6/14.6 65 29.7/29.7 88.94 23.2/23.2 118.93 36/36 68.98 29.4/29.4 114.21 27.1/27.1 85.17 60.2/38.2 16.58 19.2/19.2 87.23
@rick-github commented on GitHub (Oct 28, 2024): This is the script I use to gather data: <details> <summary>bench.sh</summary> ```sh #!/bin/bash num_gpu=( 25% 50% 75% 999 -1 0 ) contexts=( 2048 4096 8192 16384 32768 65536 131072 ) versions=( 0.3.14 ) models=( qwen2.5:7b-instruct-q4_K_M aya-expanse:8b-q4_K_M granite3-dense:8b-instruct-q4_K_M granite3-moe:3b-instruct-q4_K_M llama3.2:3b-instruct-q4_K_M nemotron-mini:4b-instruct-q4_K_M mistral-nemo:12b-instruct-2407-q4_K_M hermes3:8b-llama3.1-q4_K_M phi3.5:3.8b-mini-instruct-q4_K_M gemma2:9b-instruct-q4_K_M mistral:7b-instruct-v0.3-q4_K_M gemma2:27b-instruct-q4_K_M ) #for m in ${models[*]} ; do # docker compose exec -it ollama ollama pull $m #done for v in ${versions[*]} ; do OLLAMA_DEBUG=1 OLLAMA_KEEP_ALIVE=-1 OLLAMA_DOCKER_TAG=$v OLLAMA_NUM_PARALLEL=1 docker compose up -d ollama sleep 5 for n in ${num_gpu[*]} ; do for m in ${models[*]} ; do for c in ${contexts[*]} ; do ngl=$n [[ $n == *% ]] && { curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":-1}}' >/dev/null layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=\([^ ]*\) .*/\1/p' | tail -1) ngl=$(dc <<< "$layers ${n%\%} * 100/p") } curl -s localhost:11434/api/generate -d '{"model":"'$m'","options":{"num_ctx":'$c',"num_gpu":'$ngl'}}' >/dev/null layers=$(docker compose logs ollama | sed -ne 's/.*layers.model=\([^ ]*\) .*/\1/p' | tail -1) loaded=$(docker compose logs ollama | sed -ne 's/.*starting llama server.*--n-gpu-layers \([^ ]*\) .*/\1/p' | tail -1) mem=$(curl -s localhost:11434/api/ps | jq -r 'def toGB(p):.*pow(10;p)/1024/1024/1024|round/pow(10;p);.models[]|select(.model=="'$m'")|"\(.size|toGB(1))/\(.size_vram|toGB(1))"') tps=$(curl -s localhost:11434/api/generate -d '{"model":"'$m'","prompt":"the sky is blue because ","options":{"temperature":0,"seed":42,"num_predict":200,"num_ctx":'$c',"num_gpu":'$ngl'},"stream":false}' | jq 'def prec(p):.*pow(10;p)|round/pow(10;p);(.eval_count/(.eval_duration/1000000000)|prec(2))') echo $v $loaded/$layers $m $c $mem $tps done done done done ``` </details> This is the key for the tables below: ``` aya: aya-expanse:8b-q4_K_M g2:27: gemma2:27b-instruct-q4_K_M g2:9: gemma2:9b-instruct-q4_K_M gd: granite3-dense:8b-instruct-q4_K_M gm: granite3-moe:3b-instruct-q4_K_M hl: hermes3:8b-llama3.1-q4_K_M l3.2: llama3.2:3b-instruct-q4_K_M mn: mistral-nemo:12b-instruct-2407-q4_K_M mi: mistral:7b-instruct-v0.3-q4_K_M nm: nemotron-mini:4b-instruct-q4_K_M phi: phi3.5:3.8b-mini-instruct-q4_K_M qwen: qwen2.5:7b-instruct-q4_K_M ``` The numbers in the tables are the total amount of RAM required for the model, the amount of VRAM used, and the tokens per second that the model produced: `total/vram tps`. # RTX 3080 16GB VRAM, i7-11800H @ 2.30GHz 64GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 64.27 | - | 7.1/7.1 55.88 | 5.7/5.7 67.07 | 2.6/2.6 108.79 | 5.4/5.4 71.27 | 2.9/2.9 133.56 | 7.8/7.8 48.79 | 5/5 78.76 | 3/3 111.61 | 3.6/3.6 124.78 | 5.1/5.1 74.53 | | 4096 | 6.3/6.3 64.12 | 19.2/15.4 7.85 | 7.8/7.8 55.78 | 6.2/6.2 66.86 | 2.8/2.8 108.88 | 5.7/5.7 71.32 | 3.1/3.1 133.46 | 8.2/8.2 48.81 | 5.4/5.4 78.61 | 3.4/3.4 111.6 | 4.5/4.5 124.82 | 5.2/5.2 74.56 | | 8192 | 6.9/6.9 64.24 | 20.8/15.5 6.74 | 9.1/9.1 55.72 | 7.3/7.3 66.92 | 3.1/3.1 108.39 | 6.5/6.5 71.12 | 3.7/3.7 133.19 | 9/9 48.8 | 6.2/6.2 78.69 | 4.1/4.1 111.43 | 6.3/6.3 124.56 | 5.6/5.6 74.55 | | 16384 | 8.4/8.4 64.14 | 23.9/15.1 5.25 | 11.9/11.9 55.65 | 9.4/9.4 66.9 | 3.9/3.9 108.81 | 8/8 71.22 | 5/5 133.13 | 10.8/10.8 48.74 | 7.7/7.7 78.58 | 5.7/5.7 111.51 | 9.9/9.9 124.29 | 6.5/6.5 74.32 | | 32768 | 11.5/11.5 64.03 | 30.8/15.2 4.05 | 18/15.1 18.43 | 13.6/13.6 66.82 | 5.4/5.4 108.92 | 11.1/11.1 71.09 | 7.6/7.6 133.01 | 14.4/14.4 48.67 | 10.8/10.8 78.52 | 8.7/8.7 111.05 | 17.1/15.2 51.15 | 8.3/8.3 74.27 | | 65536 | 18.7/15.2 18.25 | 45.1/14.8 3.14 | 30.2/15.3 10.11 | 22.2/15.4 15.39 | 8.5/8.5 107.74 | 17.5/15.2 22.75 | 12.8/12.8 131.81 | 22/15.4 11.73 | 17.2/15.2 32.77 | 14.8/14.8 110.48 | 31.5/15.1 19.88 | 11.9/11.9 74.02 | | 131072 | 32.3/15.1 9.3 | 73.1/14.6 2.2 | 54.5/14.6 7.68 | 39.3/15 8.51 | 14.6/14.6 107.15 | 30.3/15.5 10.38 | 23.7/15 27.73 | 36.9/15.3 6.96 | - | 27.2/15 23.04 | 60.3/14.7 14.71 | 21/15.2 15.24 | # RTX 4070 12GB VRAM, i7-13700 @ 2.5GHz 96GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 71.46 | 18.8/11.1 5.94 | 7.1/7.1 64.57 | 5.7/5.7 76.83 | 2.6/2.6 139.38 | 5.4/5.4 80.76 | 2.9/2.9 149.56 | 7.8/7.8 54.76 | 5/5 90.71 | 3/3 116.19 | 3.6/3.6 141.79 | 5.1/5.1 83.17 | | 4096 | 6.3/6.3 71.2 | 19.6/11.3 5.6 | 7.8/7.8 64.52 | 6.2/6.2 76.68 | 2.8/2.8 139.58 | 5.7/5.7 80.61 | 3.1/3.1 149.31 | 8.2/8.2 55.16 | 5.4/5.4 90.29 | 3.4/3.4 118.16 | 4.5/4.5 142.5 | 5.2/5.2 83.18 | | 8192 | 6.9/6.9 71.32 | 21.1/11.3 5.29 | 9.1/9.1 63.77 | 7.3/7.3 76.59 | 3.1/3.1 139.28 | 6.5/6.5 80.63 | 3.7/3.7 148.87 | 9/9 55.05 | 6.2/6.2 90.23 | 4.1/4.1 118 | 6.3/6.3 141.73 | 5.6/5.6 83.05 | | 16384 | 8.4/8.4 71.23 | 24.2/11.2 4.7 | 12.5/11.3 26.15 | 9.4/9.4 76.42 | 3.9/3.9 138.4 | 8/8 80.38 | 5/5 146.59 | 10.8/10.8 55.05 | 7.7/7.7 90.14 | 5.7/5.7 117.74 | 9.9/9.9 141.77 | 6.5/6.5 83.05 | | 32768 | 12/11.2 33.06 | 31/11.3 4.06 | 18.1/11.2 13.85 | 13.7/11.3 26.24 | 5.4/5.4 137.48 | 11.1/11.1 79.58 | 7.6/7.6 147.34 | 14.6/11.3 16.77 | 10.8/10.8 89.96 | 8.7/8.7 117.53 | 17.1/11.3 30.14 | 8.3/8.3 82.91 | | 65536 | 18.8/11.2 13.56 | 45.2/10.7 3.5 | 30.3/11.1 9.71 | 22.3/11.1 11.58 | 8.5/8.5 137.03 | 17.6/11.2 14.65 | 13/11.2 47.4 | 22.2/11.4 9.94 | 17.3/11.2 18.74 | 14.9/11.3 33.57 | 31.5/11 15.69 | 12.8/11.3 29.7 | | 131072 | 20.9/0 10.19 | 62.5/0 3.31 | 54.5/11.3 7.82 | 24.9/0 10.43 | 14.7/11.2 74.48 | 30.4/11.1 9.5 | 23.7/11.1 19.86 | 37/11.3 7.31 | 30.1/11.1 12.15 | 27.2/11.2 16.97 | 50.3/0 20.39 | 21.1/11.4 12.23 | # T4x4 64GB VRAM, Xeon(R) CPU @ 2.00GHz 60GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 24.76 | 25/25 7.44 | 7.1/7.1 20.66 | 5.7/5.7 27.67 | 2.6/2.6 78.55 | 5.4/5.4 24.89 | 2.9/2.9 64.39 | 7.8/7.8 15.77 | 5/5 32.93 | 4.9/4.9 45.53 | 3.6/3.6 64.53 | 5.1/5.1 33.64 | | 4096 | 6.3/6.3 26.08 | 25.8/25.8 7.52 | 7.8/7.8 21.24 | 6.2/6.2 30.39 | 2.8/2.8 78.36 | 5.7/5.7 24.46 | 3.1/3.1 64.06 | 8.2/8.2 17.25 | 5.4/5.4 31.66 | 5.6/5.6 44.78 | 4.5/4.5 54.33 | 5.2/5.2 31.4 | | 8192 | 6.9/6.9 24.7 | 27.4/27.4 7.62 | 9.1/9.1 26.05 | 7.3/7.3 29.52 | 3.1/3.1 77.01 | 6.5/6.5 25.56 | 3.7/3.7 63.36 | 9/9 16 | 6.2/6.2 31.74 | 7.2/7.2 46.55 | 6.3/6.3 64.05 | 5.6/5.6 33.01 | | 16384 | 8.4/8.4 25.43 | 30.5/30.5 7.56 | 11.9/11.9 25.5 | 9.4/9.4 28.1 | 3.9/3.9 77.26 | 8/8 26.41 | 5/5 63.26 | 10.8/10.8 16.11 | 7.7/7.7 30.7 | 5.7/5.7 54.58 | 9.9/9.9 60.14 | 6.5/6.5 27.01 | | 32768 | 11.5/11.5 24.65 | 40.1/40.1 7.57 | 24.5/24.5 17.98 | 13.6/13.6 24.29 | 5.4/5.4 76.31 | 11.1/11.1 30.19 | 14.7/14.7 44.17 | 23.5/23.5 18.5 | 10.8/10.8 29.28 | 8.7/8.7 58.33 | 25.7/25.7 42.67 | 8.3/8.3 24.93 | | 65536 | 36.6/36.6 22.16 | 61.3/55.4 4.97 | 41.3/41.3 17.76 | 44.5/44.5 23.62 | 8.5/8.5 75.56 | 33.2/33.2 25.22 | 25.5/25.5 45.29 | 37.9/37.9 16.88 | 32.9/32.9 28.31 | 29.1/29.1 47.39 | 47.2/47.2 41.22 | 11.9/11.9 32.8 | | 131072 | 66.2/55.3 9.15 | 104.8/52.9 1.78 | 74.9/54.1 7.93 | 24.9/0 5.87 | 28.9/28.9 41.45 | 59.7/56.8 14.71 | 47/47 42.2 | 66.8/55.7 7.95 | 59.4/56.8 18.33 | 54.1/54.1 41.66 | 90.3/52.6 11.71 | 50.5/50.5 24.71 | # A100 40GB VRAM, Xeon(R) CPU @ 2.20GHz 85GB RAM | Context | aya | g2:27 | g2:9 | gd | gm | hl | l3.2 | mn | mi | nm | phi | qwen | |---|---|---|---|---|---|---|---|---|---|---|---|---| | 2048 | 6/6 76.68 | 17.5/17.5 37.34 | 7.1/7.1 62.31 | 5.7/5.7 96.46 | 2.6/2.6 68.29 | 5.4/5.4 95.57 | 2.9/2.9 127.99 | 7.8/7.8 73.49 | 5/5 121.56 | 3/3 92.46 | 3.6/3.6 152.1 | 5.1/5.1 93.19 | | 4096 | 6.3/6.3 77.14 | 18.2/18.2 37.78 | 7.8/7.8 60.58 | 6.2/6.2 96.37 | 2.8/2.8 66.73 | 5.7/5.7 96.04 | 3.1/3.1 128.43 | 8.2/8.2 73.74 | 5.4/5.4 119.72 | 3.4/3.4 90.73 | 4.5/4.5 154.02 | 5.2/5.2 93.62 | | 8192 | 6.9/6.9 72.61 | 19.8/19.8 37.45 | 9.1/9.1 60.4 | 7.3/7.3 95.94 | 3.1/3.1 69.82 | 6.5/6.5 96.7 | 3.7/3.7 128.94 | 9/9 72.87 | 6.2/6.2 120.57 | 4.1/4.1 92.38 | 6.3/6.3 151.74 | 5.6/5.6 94.54 | | 16384 | 8.4/8.4 76.12 | 23.2/23.2 37.71 | 11.9/11.9 61.32 | 9.4/9.4 93.68 | 3.9/3.9 68.43 | 8/8 94.69 | 5/5 126.95 | 10.8/10.8 72.8 | 7.7/7.7 117.92 | 5.7/5.7 91.66 | 9.9/9.9 152.65 | 6.5/6.5 93.03 | | 32768 | 11.5/11.5 75.25 | 30.1/30.1 37.17 | 17.8/17.8 57.57 | 13.6/13.6 94.97 | 5.4/5.4 63.77 | 11.1/11.1 93.31 | 7.6/7.6 125.46 | 14.4/14.4 71.98 | 10.8/10.8 117.84 | 8.7/8.7 90.33 | 17.1/17.1 149.09 | 8.3/8.3 92.13 | | 65536 | 17.7/17.7 74.51 | - | 29.6/29.6 59.08 | 22.1/22.1 93.36 | 8.5/8.5 65.75 | 17.3/17.3 89.6 | 12.8/12.8 126.88 | 21.6/21.6 71.24 | 17/17 114.87 | 14.8/14.8 89.81 | 31.4/31.4 144.09 | 11.9/11.9 90.56 | | 131072 | 30.1/30.1 73.51 | 72.6/38.2 2.8 | 54.3/37.9 8.69 | - | 14.6/14.6 65 | 29.7/29.7 88.94 | 23.2/23.2 118.93 | 36/36 68.98 | 29.4/29.4 114.21 | 27.1/27.1 85.17 | 60.2/38.2 16.58 | 19.2/19.2 87.23 |
Author
Owner

@robotom commented on GitHub (Nov 14, 2024):

This is the script I use to gather data:
bench.sh

This is the key for the tables below:

I am looking to run llama3.1:405B on 4 x H100's. I can run ollama show [model] to find out the max context window. I want to pass some really large PDF's to it for analysis. I want it to understand them thoroughly. I just don't know how to determine what amount of text converts to that context window. Because if I try to upload 2.3 million words (~16.05 million characters) of text the 70B will reject it for example, saying the argument list is too long. I'm currently downloading 405B and hoping it works there. Any advice in general? (Or do I have to start training my own models on my own data.) --- my preference is that anyone can just come drop a giant file on the model and it can handle it but perhaps this is unrealistic. Thanks!

@robotom commented on GitHub (Nov 14, 2024): > This is the script I use to gather data: > bench.sh > > This is the key for the tables below: > I am looking to run llama3.1:405B on 4 x H100's. I can run ollama show [model] to find out the max context window. I want to pass some really large PDF's to it for analysis. I want it to understand them thoroughly. I just don't know how to determine what amount of text converts to that context window. Because if I try to upload 2.3 million words (~16.05 million characters) of text the 70B will reject it for example, saying the argument list is too long. I'm currently downloading 405B and hoping it works there. Any advice in general? (Or do I have to start training my own models on my own data.) --- my preference is that anyone can just come drop a giant file on the model and it can handle it but perhaps this is unrealistic. Thanks!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama-ollama#4327
No description provided.