[GH-ISSUE #12542] Is it normal for the CPU to have high utilization while the GPU remains almost idle? #70382

Open
opened 2026-05-04 21:21:48 -05:00 by GiteaMirror · 15 comments
Owner

Originally created by @StevenD07 on GitHub (Oct 8, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/12542

What is the issue?

I'm using Ollama as a backend for GraphRAG.
I've noticed that during the process, my CPU usage is consistently near 100%, but the GPU shows almost no activity. Is this expected behavior, or does it suggest a potential configuration issue?

Image

Here is my gpu:
the model is deployed on gpu3

Image

Relevant log output


OS

Linux

GPU

Nvidia

CPU

Intel, Other

Ollama version

No response

Originally created by @StevenD07 on GitHub (Oct 8, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/12542 ### What is the issue? I'm using Ollama as a backend for GraphRAG. I've noticed that during the process, my CPU usage is consistently near 100%, but the GPU shows almost no activity. Is this expected behavior, or does it suggest a potential configuration issue? <img width="465" height="179" alt="Image" src="https://github.com/user-attachments/assets/23d79d96-3e1f-46c3-83a3-d421936648b1" /> Here is my gpu: the model is deployed on gpu3 <img width="620" height="394" alt="Image" src="https://github.com/user-attachments/assets/679ac3c5-5681-42c4-85ba-22bdb449085f" /> ### Relevant log output ```shell ``` ### OS Linux ### GPU Nvidia ### CPU Intel, Other ### Ollama version _No response_
GiteaMirror added the bug label 2026-05-04 21:21:48 -05:00
Author
Owner

@rick-github commented on GitHub (Oct 8, 2025):

Output of ollama ps?

<!-- gh-comment-id:3383339411 --> @rick-github commented on GitHub (Oct 8, 2025): Output of `ollama ps`?
Author
Owner

@StevenD07 commented on GitHub (Oct 8, 2025):

yes. It is the result of "ps aux | grep ollama | grep -v grep"

To provide more information. Here is my settings

This config file contains required core defaults that must be set, along with a handful of common optional settings.

For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/

LLM settings

There are a number of settings to tune the threading and token limits for LLM calls - check the docs.

parallelization.num_threads: 4

models:
default_chat_model:
type: openai_chat # or azure_openai_chat
api_base: http://localhost:11435/v1/
# api_version: 2024-05-01-preview
auth_type: api_key # or azure_managed_identity
api_key: ${GRAPHRAG_API_KEY}
# set this in the generated .env file
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: llama3.1:8b
# deployment_name: <azure_model_deployment_name>
encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: 5 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
request_timeout: 1200
retry_strategy: native
max_retries: 10
tokens_per_minute: auto # set to null to disable rate limiting
requests_per_minute: auto # set to null to disable rate limiting
default_embedding_model:
type: openai_embedding # or azure_openai_embedding
api_base: http://localhost:11435/v1/
# api_version: 2024-05-01-preview
auth_type: api_key # or azure_managed_identity
api_key: ${GRAPHRAG_API_KEY}
# audience: "https://cognitiveservices.azure.com/.default"
# organization: <organization_id>
model: nomic-embed-text:latest
# deployment_name: <azure_model_deployment_name>
encoding_model: cl100k_base # automatically set by tiktoken if left undefined
model_supports_json: true # recommended if this is available for your model.
concurrent_requests: 25 # max number of simultaneous LLM requests allowed
async_mode: threaded # or asyncio
retry_strategy: native
max_retries: 10
tokens_per_minute: null # set to null to disable rate limiting or auto for dynamic
requests_per_minute: null # set to null to disable rate limiting or auto for dynamic

Input settings

input:
storage:
type: file # or blob
base_dir: "input"
file_type: text # [csv, text, json]

chunks:
size: 500
overlap: 30
group_by_columns: [id]

Output/storage settings

If blob storage is specified in the following four sections,

connection_string and container_name must be provided

output:
type: file # [file, blob, cosmosdb]
base_dir: "output"

cache:
type: file # [file, blob, cosmosdb]
base_dir: "cache"

reporting:
type: file # [file, blob]
base_dir: "logs"

vector_store:
default_vector_store:
type: lancedb
db_uri: output/lancedb
container_name: default
overwrite: True

Workflow settings

embed_text:
model_id: default_embedding_model
vector_store_id: default_vector_store

extract_graph:
model_id: default_chat_model
prompt: "prompts/extract_graph.txt"
entity_types: [protein, protein name, GO term function, function, organism]
max_gleanings: 1

summarize_descriptions:
model_id: default_chat_model
prompt: "prompts/summarize_descriptions.txt"
max_length: 500

extract_graph_nlp:
text_analyzer:
extractor_type: regex_english # [regex_english, syntactic_parser, cfg]
async_mode: threaded # or asyncio

cluster_graph:
max_cluster_size: 10

extract_claims:
enabled: false
model_id: default_chat_model
prompt: "prompts/extract_claims.txt"
description: "Any claims or facts that could be relevant to information discovery."
max_gleanings: 1

community_reports:
enabled: false
model_id: default_chat_model
graph_prompt: "prompts/community_report_graph.txt"
text_prompt: "prompts/community_report_text.txt"
max_length: 2000
max_input_length: 8000

embed_graph:
enabled: false # if true, will generate node2vec embeddings for nodes

umap:
enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled)

snapshots:
graphml: false
embeddings: false

Query settings

The prompt locations are required here, but each search method has a number of optional knobs that can be tuned.

See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query

local_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/local_search_system_prompt.txt"

global_search:
chat_model_id: default_chat_model
map_prompt: "prompts/global_search_map_system_prompt.txt"
reduce_prompt: "prompts/global_search_reduce_system_prompt.txt"
knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt"

drift_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/drift_search_system_prompt.txt"
reduce_prompt: "prompts/drift_search_reduce_prompt.txt"

basic_search:
chat_model_id: default_chat_model
embedding_model_id: default_embedding_model
prompt: "prompts/basic_search_system_prompt.txt"

<!-- gh-comment-id:3383368329 --> @StevenD07 commented on GitHub (Oct 8, 2025): yes. It is the result of "ps aux | grep ollama | grep -v grep" To provide more information. Here is my settings ### This config file contains required core defaults that must be set, along with a handful of common optional settings. ### For a full list of available settings, see https://microsoft.github.io/graphrag/config/yaml/ ### LLM settings ### ## There are a number of settings to tune the threading and token limits for LLM calls - check the docs. parallelization.num_threads: 4 models: default_chat_model: type: openai_chat # or azure_openai_chat api_base: http://localhost:11435/v1/ # api_version: 2024-05-01-preview auth_type: api_key # or azure_managed_identity api_key: ${GRAPHRAG_API_KEY} # set this in the generated .env file # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> model: llama3.1:8b # deployment_name: <azure_model_deployment_name> encoding_model: cl100k_base # automatically set by tiktoken if left undefined model_supports_json: true # recommended if this is available for your model. concurrent_requests: 5 # max number of simultaneous LLM requests allowed async_mode: threaded # or asyncio request_timeout: 1200 retry_strategy: native max_retries: 10 tokens_per_minute: auto # set to null to disable rate limiting requests_per_minute: auto # set to null to disable rate limiting default_embedding_model: type: openai_embedding # or azure_openai_embedding api_base: http://localhost:11435/v1/ # api_version: 2024-05-01-preview auth_type: api_key # or azure_managed_identity api_key: ${GRAPHRAG_API_KEY} # audience: "https://cognitiveservices.azure.com/.default" # organization: <organization_id> model: nomic-embed-text:latest # deployment_name: <azure_model_deployment_name> encoding_model: cl100k_base # automatically set by tiktoken if left undefined model_supports_json: true # recommended if this is available for your model. concurrent_requests: 25 # max number of simultaneous LLM requests allowed async_mode: threaded # or asyncio retry_strategy: native max_retries: 10 tokens_per_minute: null # set to null to disable rate limiting or auto for dynamic requests_per_minute: null # set to null to disable rate limiting or auto for dynamic ### Input settings ### input: storage: type: file # or blob base_dir: "input" file_type: text # [csv, text, json] chunks: size: 500 overlap: 30 group_by_columns: [id] ### Output/storage settings ### ## If blob storage is specified in the following four sections, ## connection_string and container_name must be provided output: type: file # [file, blob, cosmosdb] base_dir: "output" cache: type: file # [file, blob, cosmosdb] base_dir: "cache" reporting: type: file # [file, blob] base_dir: "logs" vector_store: default_vector_store: type: lancedb db_uri: output/lancedb container_name: default overwrite: True ### Workflow settings ### embed_text: model_id: default_embedding_model vector_store_id: default_vector_store extract_graph: model_id: default_chat_model prompt: "prompts/extract_graph.txt" entity_types: [protein, protein name, GO term function, function, organism] max_gleanings: 1 summarize_descriptions: model_id: default_chat_model prompt: "prompts/summarize_descriptions.txt" max_length: 500 extract_graph_nlp: text_analyzer: extractor_type: regex_english # [regex_english, syntactic_parser, cfg] async_mode: threaded # or asyncio cluster_graph: max_cluster_size: 10 extract_claims: enabled: false model_id: default_chat_model prompt: "prompts/extract_claims.txt" description: "Any claims or facts that could be relevant to information discovery." max_gleanings: 1 community_reports: enabled: false model_id: default_chat_model graph_prompt: "prompts/community_report_graph.txt" text_prompt: "prompts/community_report_text.txt" max_length: 2000 max_input_length: 8000 embed_graph: enabled: false # if true, will generate node2vec embeddings for nodes umap: enabled: false # if true, will generate UMAP embeddings for nodes (embed_graph must also be enabled) snapshots: graphml: false embeddings: false ### Query settings ### ## The prompt locations are required here, but each search method has a number of optional knobs that can be tuned. ## See the config docs: https://microsoft.github.io/graphrag/config/yaml/#query local_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/local_search_system_prompt.txt" global_search: chat_model_id: default_chat_model map_prompt: "prompts/global_search_map_system_prompt.txt" reduce_prompt: "prompts/global_search_reduce_system_prompt.txt" knowledge_prompt: "prompts/global_search_knowledge_system_prompt.txt" drift_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/drift_search_system_prompt.txt" reduce_prompt: "prompts/drift_search_reduce_prompt.txt" basic_search: chat_model_id: default_chat_model embedding_model_id: default_embedding_model prompt: "prompts/basic_search_system_prompt.txt"
Author
Owner

@rick-github commented on GitHub (Oct 8, 2025):

Run ollama ps and paste the result.

<!-- gh-comment-id:3383374980 --> @rick-github commented on GitHub (Oct 8, 2025): Run `ollama ps` and paste the result.
Author
Owner

@StevenD07 commented on GitHub (Oct 8, 2025):

I can not see any output, is this might be the key of my problem?

Image
<!-- gh-comment-id:3383391576 --> @StevenD07 commented on GitHub (Oct 8, 2025): I can not see any output, is this might be the key of my problem? <img width="916" height="180" alt="Image" src="https://github.com/user-attachments/assets/fc3a70b3-db6e-4f88-84b5-f1d8c0c99e72" />
Author
Owner

@rick-github commented on GitHub (Oct 8, 2025):

You cut off the bit of the screenshot that would be useful. Run ollama ps and paste the output into this thread. A text paste is preferable to a screenshot. If you paste text, wrap it in a markdown code block (``` before and after) to preserve formatting.

<!-- gh-comment-id:3383397361 --> @rick-github commented on GitHub (Oct 8, 2025): You cut off the bit of the screenshot that would be useful. Run `ollama ps` and paste the output into this thread. A text paste is preferable to a screenshot. If you paste text, wrap it in a markdown code block (\`\`\` before and after) to preserve formatting.
Author
Owner

@rick-github commented on GitHub (Oct 8, 2025):

Server logs may also help in debugging.

<!-- gh-comment-id:3383444866 --> @rick-github commented on GitHub (Oct 8, 2025): [Server logs](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may also help in debugging.
Author
Owner

@rick-github commented on GitHub (Oct 8, 2025):

Why are you running three ollama servers? Paste the ollama ps output for the ollama server that is loading the llama3.1:8b model:

OLLAMA_HOST=:11435 ollama -v
OLLAMA_HOST=:11435 ollama ps
<!-- gh-comment-id:3383456712 --> @rick-github commented on GitHub (Oct 8, 2025): Why are you running three ollama servers? Paste the `ollama ps` output for the ollama server that is loading the llama3.1:8b model: ``` OLLAMA_HOST=:11435 ollama -v OLLAMA_HOST=:11435 ollama ps ```
Author
Owner

@StevenD07 commented on GitHub (Oct 9, 2025):

  1. I am running on the 11436 port, here is the ollama ps
    (ollama) dyding@mayura:~/graphrag$ OLLAMA_HOST=:11436 ollama ps
    NAME ID SIZE PROCESSOR UNTIL
    llama3.1:8b 46e0c10c039e 8.6 GB 100% GPU Stopping...

(ollama) dyding@mayura:~/graphrag$ ps aux | grep ollama | grep -v grep
ppunuru 1785566 0.0 0.0 12483204 195708 ? Sl Oct06 1:32 ollama serve
dyding 2498699 0.4 0.0 26100 3072 ? Ss Oct07 7:11 tmux new -s ollama_11434
dyding 2886325 0.2 0.3 21403484 797436 pts/37 Sl+ 00:49 3:05 ollama serve
dyding 3430812 0.7 0.0 20661924 165548 pts/72 Sl+ 16:41 2:31 ollama serve
dyding 3444432 96.5 0.4 52493036 1185520 pts/72 Sl+ 16:57 307:36 /net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 34021

  1. (ollama) dyding@mayura:~/graphrag$ ps aux | grep ollama | grep -v grep
    ppunuru 1785566 0.0 0.0 12483204 195708 ? Sl Oct06 1:32 ollama serve
    dyding 2498699 0.4 0.0 26100 4148 ? Ss Oct07 7:11 tmux new -s ollama_11434
    dyding 3430812 0.7 0.0 20661924 165548 pts/72 Sl+ 16:41 2:31 ollama serve
    dyding 3444432 96.5 0.4 52493036 1187824 pts/72 Sl+ 16:57 308:31 /net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 34021

I run the server in tmux to ensure that it won't be disconnected

<!-- gh-comment-id:3383826785 --> @StevenD07 commented on GitHub (Oct 9, 2025): 1. I am running on the 11436 port, here is the ollama ps (ollama) dyding@mayura:~/graphrag$ OLLAMA_HOST=:11436 ollama ps NAME ID SIZE PROCESSOR UNTIL llama3.1:8b 46e0c10c039e 8.6 GB 100% GPU Stopping... 2. (ollama) dyding@mayura:~/graphrag$ ps aux | grep ollama | grep -v grep ppunuru 1785566 0.0 0.0 12483204 195708 ? Sl Oct06 1:32 ollama serve dyding 2498699 0.4 0.0 26100 3072 ? Ss Oct07 7:11 tmux new -s ollama_11434 dyding 2886325 0.2 0.3 21403484 797436 pts/37 Sl+ 00:49 3:05 ollama serve dyding 3430812 0.7 0.0 20661924 165548 pts/72 Sl+ 16:41 2:31 ollama serve dyding 3444432 96.5 0.4 52493036 1185520 pts/72 Sl+ 16:57 307:36 /net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 34021 3. (ollama) dyding@mayura:~/graphrag$ ps aux | grep ollama | grep -v grep ppunuru 1785566 0.0 0.0 12483204 195708 ? Sl Oct06 1:32 ollama serve dyding 2498699 0.4 0.0 26100 4148 ? Ss Oct07 7:11 tmux new -s ollama_11434 dyding 3430812 0.7 0.0 20661924 165548 pts/72 Sl+ 16:41 2:31 ollama serve dyding 3444432 96.5 0.4 52493036 1187824 pts/72 Sl+ 16:57 308:31 /net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 34021 I run the server in tmux to ensure that it won't be disconnected
Author
Owner

@optivisionlab commented on GitHub (Oct 9, 2025):

I also had the same problem with ollama when running gpu, although the system reported that it recognized gpu, when executing I saw that the CPU was working almost 100% but the GPU was not working. Does anyone have a solution to this problem?

docker-compose.yml:

version: "3.9"
services:
  ollama:
    container_name: ollama_gpu
    image: ollama/ollama:latest
    runtime: nvidia
    environment:
      - OLLAMA_LLM_LIBRARY=cuda
      - NVIDIA_VISIBLE_DEVICES=0
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - CUDA_VISIBLE_DEVICES=0
      - LOG_LEVEL=debug
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              # count: all
              device_ids: ['1']
    volumes:
      - ./ollama:/root/.ollama
      - ./models:/models
    ports:
      - "11434:11434"
    logging:
      driver: json-file
      options:
        max-size: "5m"
        max-file: "2"
    restart: unless-stopped
Image

Log docker:

time=2025-10-09T06:32:27.298Z level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-10-09T06:32:27.299Z level=INFO source=images.go:518 msg="total blobs: 6"
time=2025-10-09T06:32:27.299Z level=INFO source=images.go:525 msg="total unused blobs removed: 0"
time=2025-10-09T06:32:27.300Z level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.3)"
time=2025-10-09T06:32:27.300Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-10-09T06:32:28.059Z level=INFO source=types.go:131 msg="inference compute" id=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 library=cuda variant=v12 compute=6.1 driver=11.4 name="Quadro P5000" total="15.9 GiB" available="15.8 GiB"
time=2025-10-09T06:32:28.059Z level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB"

Run task CURL:

curl --location 'http://172.18.72.72:11434/api/generate' \
--header 'Content-Type: application/json' \
--data '{ "model": "llama3.2:3b", "prompt": "describe the cat", "stream": false }'

CPU htop:

Image

GPU nvidia-smi -l

Image

Cuda version:

Image
<!-- gh-comment-id:3384326128 --> @optivisionlab commented on GitHub (Oct 9, 2025): I also had the same problem with ollama when running gpu, although the system reported that it recognized gpu, when executing I saw that the CPU was working almost 100% but the GPU was not working. Does anyone have a solution to this problem? docker-compose.yml: ``` version: "3.9" services: ollama: container_name: ollama_gpu image: ollama/ollama:latest runtime: nvidia environment: - OLLAMA_LLM_LIBRARY=cuda - NVIDIA_VISIBLE_DEVICES=0 - NVIDIA_DRIVER_CAPABILITIES=compute,utility - CUDA_VISIBLE_DEVICES=0 - LOG_LEVEL=debug deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] # count: all device_ids: ['1'] volumes: - ./ollama:/root/.ollama - ./models:/models ports: - "11434:11434" logging: driver: json-file options: max-size: "5m" max-file: "2" restart: unless-stopped ``` <img width="1044" height="65" alt="Image" src="https://github.com/user-attachments/assets/641f57c4-4312-43fa-883c-043ebdde7416" /> Log docker: ``` time=2025-10-09T06:32:27.298Z level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY:cuda OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-10-09T06:32:27.299Z level=INFO source=images.go:518 msg="total blobs: 6" time=2025-10-09T06:32:27.299Z level=INFO source=images.go:525 msg="total unused blobs removed: 0" time=2025-10-09T06:32:27.300Z level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.3)" time=2025-10-09T06:32:27.300Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-10-09T06:32:28.059Z level=INFO source=types.go:131 msg="inference compute" id=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 library=cuda variant=v12 compute=6.1 driver=11.4 name="Quadro P5000" total="15.9 GiB" available="15.8 GiB" time=2025-10-09T06:32:28.059Z level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB" ``` Run task CURL: ``` curl --location 'http://172.18.72.72:11434/api/generate' \ --header 'Content-Type: application/json' \ --data '{ "model": "llama3.2:3b", "prompt": "describe the cat", "stream": false }' ```` CPU htop: <img width="1420" height="304" alt="Image" src="https://github.com/user-attachments/assets/fcff4b59-9db8-4bc6-9bb8-70e3b29fe5a3" /> GPU nvidia-smi -l <img width="604" height="422" alt="Image" src="https://github.com/user-attachments/assets/e66183e6-d0b8-42aa-8f1e-5b052f40cc5e" /> Cuda version: <img width="458" height="127" alt="Image" src="https://github.com/user-attachments/assets/b75f9e71-a2b4-45ec-a396-f8f388014841" />
Author
Owner

@rick-github commented on GitHub (Oct 9, 2025):

Don't set OLLAMA_LLM_LIBRARY. A full server log may also help in debugging.

<!-- gh-comment-id:3384916315 --> @rick-github commented on GitHub (Oct 9, 2025): Don't set `OLLAMA_LLM_LIBRARY`. A full [server log](https://github.com/ollama/ollama/blob/main/docs/troubleshooting.md#how-to-troubleshoot-issues) may also help in debugging.
Author
Owner

@StevenD07 commented on GitHub (Oct 9, 2025):

Here is the log of running ollama serve

ollama serve
time=2025-10-08T22:44:21.869-04:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11436 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/net/kihara/home/dyding/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2025-10-08T22:44:22.041-04:00 level=INFO source=images.go:479 msg="total blobs: 30"
time=2025-10-08T22:44:22.055-04:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0"
time=2025-10-08T22:44:22.074-04:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11436 (version 0.9.0)"
time=2025-10-08T22:44:22.101-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
time=2025-10-08T22:44:22.797-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-e78c0530-8c90-da02-5fc0-4ea5f469dcb1 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="4.5 GiB"
time=2025-10-08T22:44:22.797-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-4cdedc48-70c5-91c9-6493-51e3f262a283 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="3.3 GiB"
time=2025-10-08T22:44:22.797-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 library=cuda variant=v12 compute=7.5 driver=12.2 name="Quadro RTX 8000" total="47.5 GiB" available="47.2 GiB"
time=2025-10-08T22:44:22.798-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ee9d5906-deef-f6ea-0703-3ca87b5983a5 library=cuda variant=v12 compute=7.5 driver=12.2 name="Quadro RTX 8000" total="47.5 GiB" available="19.6 GiB"
time=2025-10-08T22:49:20.601-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 parallel=4 available=50648973312 required="8.0 GiB"
time=2025-10-08T22:49:21.298-04:00 level=INFO source=server.go:135 msg="system memory" total="250.5 GiB" free="148.7 GiB" free_swap="3.4 GiB"
time=2025-10-08T22:49:21.299-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[8.0 GiB]" memory.weights.total="4.3 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.58 GiB (4.89 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-10-08T22:49:21.730-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 37437"
time=2025-10-08T22:49:21.732-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-10-08T22:49:21.732-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-10-08T22:49:21.734-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-10-08T22:49:21.761-04:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /net/kihara/home/dyding/ollama/lib/ollama/libggml-cpu-skylakex.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from /net/kihara/home/dyding/ollama/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-10-08T22:49:26.408-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-10-08T22:49:26.409-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:37437"
time=2025-10-08T22:49:26.500-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 8000) - 48302 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.58 GiB (4.89 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CUDA0 model buffer size = 4403.49 MiB
load_tensors: CPU_Mapped model buffer size = 281.81 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 2.02 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
llama_kv_cache_unified: CUDA0 KV buffer size = 2048.00 MiB
llama_kv_cache_unified: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: CUDA0 compute buffer size = 1088.00 MiB
llama_context: CUDA_Host compute buffer size = 40.01 MiB
llama_context: graph nodes = 1094
llama_context: graph splits = 2
time=2025-10-08T22:50:50.573-04:00 level=INFO source=server.go:630 msg="llama runner started in 88.84 seconds"
[GIN] 2025/10/08 - 22:50:53 | 200 | 1m35s | 127.0.0.1 | POST "/v1/chat/completions"
time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-e78c0530-8c90-da02-5fc0-4ea5f469dcb1 library=cuda total="47.5 GiB" available="4.8 GiB"
time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-4cdedc48-70c5-91c9-6493-51e3f262a283 library=cuda total="47.5 GiB" available="3.3 GiB"
time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 library=cuda total="47.5 GiB" available="39.5 GiB"
time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-ee9d5906-deef-f6ea-0703-3ca87b5983a5 library=cuda total="47.5 GiB" available="25.0 GiB"
time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 parallel=1 available=42362869760 required="809.9 MiB"
time=2025-10-08T22:50:55.723-04:00 level=INFO source=server.go:135 msg="system memory" total="250.5 GiB" free="149.0 GiB" free_swap="3.4 GiB"
time=2025-10-08T22:50:55.724-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[39.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="809.9 MiB" memory.required.partial="809.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[809.9 MiB]" memory.weights.total="260.9 MiB" memory.weights.repeating="216.1 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB"
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 1
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101
llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type f16: 61 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 260.86 MiB (16.00 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
llama_model_load: vocab only - skipping tensors
time=2025-10-08T22:50:55.798-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 8192 --batch-size 512 --n-gpu-layers 13 --threads 24 --parallel 1 --port 40079"
time=2025-10-08T22:50:55.799-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=2
time=2025-10-08T22:50:55.799-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-10-08T22:50:55.799-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-10-08T22:50:55.835-04:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /net/kihara/home/dyding/ollama/lib/ollama/libggml-cpu-skylakex.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from /net/kihara/home/dyding/ollama/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-10-08T22:50:55.985-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-10-08T22:50:55.986-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:40079"
time=2025-10-08T22:50:56.052-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 8000) - 40566 MiB free
llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = nomic-bert
llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5
llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12
llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048
llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768
llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072
llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12
llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000
llama_model_loader: - kv 8: general.file_type u32 = 1
llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false
llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1
llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000
llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2
llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101
llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102
llama_model_loader: - kv 15: tokenizer.ggml.model str = bert
llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "...
llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100
llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101
llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103
llama_model_loader: - type f32: 51 tensors
llama_model_loader: - type f16: 61 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = F16
print_info: file size = 260.86 MiB (16.00 BPW)
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 5
load: token to piece cache size = 0.2032 MB
print_info: arch = nomic-bert
print_info: vocab_only = 0
print_info: n_ctx_train = 2048
print_info: n_embd = 768
print_info: n_layer = 12
print_info: n_head = 12
print_info: n_head_kv = 12
print_info: n_rot = 64
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 64
print_info: n_embd_head_v = 64
print_info: n_gqa = 1
print_info: n_embd_k_gqa = 768
print_info: n_embd_v_gqa = 768
print_info: f_norm_eps = 1.0e-12
print_info: f_norm_rms_eps = 0.0e+00
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 3072
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 0
print_info: pooling type = 1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 1000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 2048
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 137M
print_info: model params = 136.73 M
print_info: general.name = nomic-embed-text-v1.5
print_info: vocab type = WPM
print_info: n_vocab = 30522
print_info: n_merges = 0
print_info: BOS token = 101 '[CLS]'
print_info: EOS token = 102 '[SEP]'
print_info: UNK token = 100 '[UNK]'
print_info: SEP token = 102 '[SEP]'
print_info: PAD token = 0 '[PAD]'
print_info: MASK token = 103 '[MASK]'
print_info: LF token = 0 '[PAD]'
print_info: EOG token = 102 '[SEP]'
print_info: max token length = 21
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 12 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 13/13 layers to GPU
load_tensors: CUDA0 model buffer size = 216.14 MiB
load_tensors: CPU_Mapped model buffer size = 44.72 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 8192
llama_context: n_ctx_per_seq = 8192
llama_context: n_batch = 512
llama_context: n_ubatch = 512
llama_context: causal_attn = 0
llama_context: flash_attn = 0
llama_context: freq_base = 1000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (8192) > n_ctx_train (2048) -- possible training context overflow
llama_context: CUDA_Host output buffer size = 0.00 MiB
time=2025-10-08T22:50:59.817-04:00 level=INFO source=server.go:630 msg="llama runner started in 4.02 seconds"
decode: cannot decode batches with this context (use llama_encode() instead)
[GIN] 2025/10/08 - 22:51:00 | 200 | 6.799472166s | 127.0.0.1 | POST "/v1/embeddings"
[GIN] 2025/10/08 - 22:51:06 | 200 | 1.340028957s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 22:52:06 | 200 | 1.379434213s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 22:53:06 | 200 | 1.389828266s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 22:54:07 | 200 | 1.36709898s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 22:55:07 | 200 | 1.314655571s | 127.0.0.1 | POST "/v1/chat/completions"
time=2025-10-08T23:01:07.835-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 parallel=4 available=50648973312 required="8.0 GiB"
time=2025-10-08T23:01:08.593-04:00 level=INFO source=server.go:135 msg="system memory" total="250.5 GiB" free="142.1 GiB" free_swap="3.4 GiB"
time=2025-10-08T23:01:08.594-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[8.0 GiB]" memory.weights.total="4.3 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB"
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.58 GiB (4.89 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 1
print_info: model type = ?B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
llama_model_load: vocab only - skipping tensors
time=2025-10-08T23:01:09.033-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 46347"
time=2025-10-08T23:01:09.172-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1
time=2025-10-08T23:01:09.189-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding"
time=2025-10-08T23:01:09.311-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding"
time=2025-10-08T23:01:09.716-04:00 level=INFO source=runner.go:815 msg="starting go runner"
load_backend: loaded CPU backend from /net/kihara/home/dyding/ollama/lib/ollama/libggml-cpu-skylakex.so
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes
load_backend: loaded CUDA backend from /net/kihara/home/dyding/ollama/lib/ollama/cuda_v12/libggml-cuda.so
time=2025-10-08T23:01:12.469-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
time=2025-10-08T23:01:12.472-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46347"
time=2025-10-08T23:01:12.574-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model"
llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 8000) - 48302 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: general.file_type u32 = 15
llama_model_loader: - kv 18: llama.vocab_size u32 = 128256
llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 28: general.quantization_version u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_K: 193 tensors
llama_model_loader: - type q6_K: 33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_K - Medium
print_info: file size = 4.58 GiB (4.89 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_swa_pattern = 1
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: CUDA0 model buffer size = 4403.49 MiB
load_tensors: CPU_Mapped model buffer size = 281.81 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 2.02 MiB
llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32
llama_kv_cache_unified: CUDA0 KV buffer size = 2048.00 MiB
llama_kv_cache_unified: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: CUDA0 compute buffer size = 1088.00 MiB
llama_context: CUDA_Host compute buffer size = 40.01 MiB
llama_context: graph nodes = 1094
llama_context: graph splits = 2
time=2025-10-08T23:02:29.389-04:00 level=INFO source=server.go:630 msg="llama runner started in 80.21 seconds"
[GIN] 2025/10/08 - 23:06:23 | 200 | 4m17s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:07:27 | 200 | 3m21s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:08:05 | 200 | 4m59s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:08:20 | 200 | 56.089380907s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:10:05 | 200 | 4m59s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:12:30 | 200 | 6.228748253s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:13:22 | 200 | 1m57s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:14:46 | 200 | 4m22s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:17:43 | 200 | 3m18s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:18:26 | 200 | 2m1s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:18:32 | 200 | 7.920812776s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:18:54 | 200 | 3m30s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:21:06 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:24:44 | 200 | 3m20s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:25:27 | 200 | 2m2s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:25:46 | 200 | 1m21s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:27:15 | 200 | 6m9s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:27:51 | 200 | 1m26s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:31:31 | 200 | 4m6s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:31:37 | 200 | 12.156950789s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:31:57 | 200 | 3m32s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:34:10 | 200 | 3m45s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:36:48 | 200 | 3m23s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:38:06 | 200 | 40.664919093s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:39:37 | 200 | 4m11s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:40:16 | 200 | 1m51s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:40:24 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:42:59 | 200 | 2m34s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:45:07 | 200 | 3m41s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:47:38 | 200 | 3m13s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:49:24 | 200 | 1m58s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:51:26 | 200 | 3m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:53:45 | 200 | 2m19s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:55:36 | 200 | 5m10s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:56:11 | 200 | 2m45s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/08 - 23:56:42 | 200 | 5.903962148s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:00:25 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:00:54 | 200 | 3m17s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:01:05 | 200 | 6m39s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:02:23 | 200 | 1m57s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:03:25 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:04:27 | 200 | 2m33s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:04:38 | 200 | 6m1s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:05:01 | 200 | 7.50353454s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:07:18 | 200 | 1m23s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:07:27 | 200 | 4m1s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:10:45 | 200 | 1m51s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:11:00 | 200 | 4m6s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:15:05 | 200 | 10.885801527s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:15:53 | 200 | 2m59s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:20:01 | 200 | 6.3710891s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:22:06 | 200 | 1m11s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:23:14 | 200 | 5m19s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:26:20 | 200 | 4m25s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:26:26 | 200 | 1m31s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:28:47 | 200 | 1m27s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:31:54 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:32:57 | 200 | 4m36s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:34:33 | 200 | 2m38s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:35:35 | 200 | 1m38s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:36:23 | 200 | 7m2s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:36:54 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:38:28 | 200 | 1m33s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:38:49 | 200 | 2m51s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:38:59 | 200 | 4m2s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:39:50 | 200 | 1m52s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:42:55 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:44:30 | 200 | 1m32s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:44:36 | 200 | 5m39s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:45:49 | 200 | 2m54s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:46:11 | 200 | 2m14s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:48:39 | 200 | 6m41s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:49:01 | 200 | 3.990042407s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:49:06 | 200 | 2m8s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:51:26 | 200 | 29.180774341s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:51:59 | 200 | 2m1s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:55:51 | 200 | 2m53s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:56:02 | 200 | 4.046557167s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:58:52 | 200 | 3m54s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 00:59:42 | 200 | 1m44s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:02:15 | 200 | 1m17s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:04:04 | 200 | 1m5s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:05:02 | 200 | 3.783215871s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:07:43 | 200 | 1m45s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:09:56 | 200 | 2m58s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:10:35 | 200 | 2m37s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:12:33 | 200 | 3m35s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:12:58 | 200 | 1m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:15:05 | 200 | 1m7s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:17:49 | 200 | 1m50s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:18:37 | 200 | 38.816895579s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:19:58 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:22:43 | 200 | 3m44s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:23:45 | 200 | 3m46s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:26:19 | 200 | 6m21s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:26:53 | 200 | 54.208551356s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:27:11 | 200 | 4m12s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:27:44 | 200 | 5m45s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:28:28 | 200 | 1m29s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:32:06 | 200 | 6.866033285s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:33:18 | 200 | 4m19s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:34:43 | 200 | 1m44s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:34:43 | 200 | 44.475846784s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:35:03 | 200 | 4.275677369s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:38:13 | 200 | 2m14s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:41:21 | 200 | 21.784501563s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:42:09 | 200 | 2m10s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:42:19 | 200 | 3m19s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:43:44 | 200 | 1m44s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:45:42 | 200 | 1m42s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:48:59 | 200 | 1m59s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:50:11 | 200 | 1m11s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:55:31 | 200 | 3m31s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:56:14 | 200 | 3m14s | 127.0.0.1 | POST "/v1/chat/completions"
[GIN] 2025/10/09 - 01:56:37 | 200 | 2m37s | 127.0.0.1 | POST "/v1/chat/completions"

<!-- gh-comment-id:3386954244 --> @StevenD07 commented on GitHub (Oct 9, 2025): Here is the log of running ollama serve ollama serve time=2025-10-08T22:44:21.869-04:00 level=INFO source=routes.go:1234 msg="server config" env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:INFO OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://127.0.0.1:11436 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/net/kihara/home/dyding/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:4 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2025-10-08T22:44:22.041-04:00 level=INFO source=images.go:479 msg="total blobs: 30" time=2025-10-08T22:44:22.055-04:00 level=INFO source=images.go:486 msg="total unused blobs removed: 0" time=2025-10-08T22:44:22.074-04:00 level=INFO source=routes.go:1287 msg="Listening on 127.0.0.1:11436 (version 0.9.0)" time=2025-10-08T22:44:22.101-04:00 level=INFO source=gpu.go:217 msg="looking for compatible GPUs" time=2025-10-08T22:44:22.797-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-e78c0530-8c90-da02-5fc0-4ea5f469dcb1 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="4.5 GiB" time=2025-10-08T22:44:22.797-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-4cdedc48-70c5-91c9-6493-51e3f262a283 library=cuda variant=v12 compute=8.6 driver=12.2 name="NVIDIA RTX A6000" total="47.5 GiB" available="3.3 GiB" time=2025-10-08T22:44:22.797-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 library=cuda variant=v12 compute=7.5 driver=12.2 name="Quadro RTX 8000" total="47.5 GiB" available="47.2 GiB" time=2025-10-08T22:44:22.798-04:00 level=INFO source=types.go:130 msg="inference compute" id=GPU-ee9d5906-deef-f6ea-0703-3ca87b5983a5 library=cuda variant=v12 compute=7.5 driver=12.2 name="Quadro RTX 8000" total="47.5 GiB" available="19.6 GiB" time=2025-10-08T22:49:20.601-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 parallel=4 available=50648973312 required="8.0 GiB" time=2025-10-08T22:49:21.298-04:00 level=INFO source=server.go:135 msg="system memory" total="250.5 GiB" free="148.7 GiB" free_swap="3.4 GiB" time=2025-10-08T22:49:21.299-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[8.0 GiB]" memory.weights.total="4.3 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.58 GiB (4.89 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.03 B print_info: general.name = Meta Llama 3.1 8B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-10-08T22:49:21.730-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 37437" time=2025-10-08T22:49:21.732-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-10-08T22:49:21.732-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-10-08T22:49:21.734-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-10-08T22:49:21.761-04:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /net/kihara/home/dyding/ollama/lib/ollama/libggml-cpu-skylakex.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes load_backend: loaded CUDA backend from /net/kihara/home/dyding/ollama/lib/ollama/cuda_v12/libggml-cuda.so time=2025-10-08T22:49:26.408-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-10-08T22:49:26.409-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:37437" time=2025-10-08T22:49:26.500-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 8000) - 48302 MiB free llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.58 GiB (4.89 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta Llama 3.1 8B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 32 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 33/33 layers to GPU load_tensors: CUDA0 model buffer size = 4403.49 MiB load_tensors: CPU_Mapped model buffer size = 281.81 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 2.02 MiB llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32 llama_kv_cache_unified: CUDA0 KV buffer size = 2048.00 MiB llama_kv_cache_unified: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_context: CUDA0 compute buffer size = 1088.00 MiB llama_context: CUDA_Host compute buffer size = 40.01 MiB llama_context: graph nodes = 1094 llama_context: graph splits = 2 time=2025-10-08T22:50:50.573-04:00 level=INFO source=server.go:630 msg="llama runner started in 88.84 seconds" [GIN] 2025/10/08 - 22:50:53 | 200 | 1m35s | 127.0.0.1 | POST "/v1/chat/completions" time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-e78c0530-8c90-da02-5fc0-4ea5f469dcb1 library=cuda total="47.5 GiB" available="4.8 GiB" time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-4cdedc48-70c5-91c9-6493-51e3f262a283 library=cuda total="47.5 GiB" available="3.3 GiB" time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 library=cuda total="47.5 GiB" available="39.5 GiB" time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:548 msg="updated VRAM based on existing loaded models" gpu=GPU-ee9d5906-deef-f6ea-0703-3ca87b5983a5 library=cuda total="47.5 GiB" available="25.0 GiB" time=2025-10-08T22:50:54.620-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 parallel=1 available=42362869760 required="809.9 MiB" time=2025-10-08T22:50:55.723-04:00 level=INFO source=server.go:135 msg="system memory" total="250.5 GiB" free="149.0 GiB" free_swap="3.4 GiB" time=2025-10-08T22:50:55.724-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=13 layers.offload=13 layers.split="" memory.available="[39.5 GiB]" memory.gpu_overhead="0 B" memory.required.full="809.9 MiB" memory.required.partial="809.9 MiB" memory.required.kv="24.0 MiB" memory.required.allocations="[809.9 MiB]" memory.weights.total="260.9 MiB" memory.weights.repeating="216.1 MiB" memory.weights.nonrepeating="44.7 MiB" memory.graph.full="48.0 MiB" memory.graph.partial="48.0 MiB" llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = nomic-bert llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 15: tokenizer.ggml.model str = bert llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 51 tensors llama_model_loader: - type f16: 61 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 260.86 MiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 5 load: token to piece cache size = 0.2032 MB print_info: arch = nomic-bert print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 136.73 M print_info: general.name = nomic-embed-text-v1.5 print_info: vocab type = WPM print_info: n_vocab = 30522 print_info: n_merges = 0 print_info: BOS token = 101 '[CLS]' print_info: EOS token = 102 '[SEP]' print_info: UNK token = 100 '[UNK]' print_info: SEP token = 102 '[SEP]' print_info: PAD token = 0 '[PAD]' print_info: MASK token = 103 '[MASK]' print_info: LF token = 0 '[PAD]' print_info: EOG token = 102 '[SEP]' print_info: max token length = 21 llama_model_load: vocab only - skipping tensors time=2025-10-08T22:50:55.798-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 --ctx-size 8192 --batch-size 512 --n-gpu-layers 13 --threads 24 --parallel 1 --port 40079" time=2025-10-08T22:50:55.799-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=2 time=2025-10-08T22:50:55.799-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-10-08T22:50:55.799-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-10-08T22:50:55.835-04:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /net/kihara/home/dyding/ollama/lib/ollama/libggml-cpu-skylakex.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes load_backend: loaded CUDA backend from /net/kihara/home/dyding/ollama/lib/ollama/cuda_v12/libggml-cuda.so time=2025-10-08T22:50:55.985-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-10-08T22:50:55.986-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:40079" time=2025-10-08T22:50:56.052-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 8000) - 40566 MiB free llama_model_loader: loaded meta data with 24 key-value pairs and 112 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-970aa74c0a90ef7482477cf803618e776e173c007bf957f635f1015bfcfef0e6 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = nomic-bert llama_model_loader: - kv 1: general.name str = nomic-embed-text-v1.5 llama_model_loader: - kv 2: nomic-bert.block_count u32 = 12 llama_model_loader: - kv 3: nomic-bert.context_length u32 = 2048 llama_model_loader: - kv 4: nomic-bert.embedding_length u32 = 768 llama_model_loader: - kv 5: nomic-bert.feed_forward_length u32 = 3072 llama_model_loader: - kv 6: nomic-bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: nomic-bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 1 llama_model_loader: - kv 9: nomic-bert.attention.causal bool = false llama_model_loader: - kv 10: nomic-bert.pooling_type u32 = 1 llama_model_loader: - kv 11: nomic-bert.rope.freq_base f32 = 1000.000000 llama_model_loader: - kv 12: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 13: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 14: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 15: tokenizer.ggml.model str = bert llama_model_loader: - kv 16: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 17: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 19: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 20: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 22: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 23: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - type f32: 51 tensors llama_model_loader: - type f16: 61 tensors print_info: file format = GGUF V3 (latest) print_info: file type = F16 print_info: file size = 260.86 MiB (16.00 BPW) load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect load: special tokens cache size = 5 load: token to piece cache size = 0.2032 MB print_info: arch = nomic-bert print_info: vocab_only = 0 print_info: n_ctx_train = 2048 print_info: n_embd = 768 print_info: n_layer = 12 print_info: n_head = 12 print_info: n_head_kv = 12 print_info: n_rot = 64 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 64 print_info: n_embd_head_v = 64 print_info: n_gqa = 1 print_info: n_embd_k_gqa = 768 print_info: n_embd_v_gqa = 768 print_info: f_norm_eps = 1.0e-12 print_info: f_norm_rms_eps = 0.0e+00 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 3072 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 0 print_info: pooling type = 1 print_info: rope type = 2 print_info: rope scaling = linear print_info: freq_base_train = 1000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 2048 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 137M print_info: model params = 136.73 M print_info: general.name = nomic-embed-text-v1.5 print_info: vocab type = WPM print_info: n_vocab = 30522 print_info: n_merges = 0 print_info: BOS token = 101 '[CLS]' print_info: EOS token = 102 '[SEP]' print_info: UNK token = 100 '[UNK]' print_info: SEP token = 102 '[SEP]' print_info: PAD token = 0 '[PAD]' print_info: MASK token = 103 '[MASK]' print_info: LF token = 0 '[PAD]' print_info: EOG token = 102 '[SEP]' print_info: max token length = 21 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 12 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 13/13 layers to GPU load_tensors: CUDA0 model buffer size = 216.14 MiB load_tensors: CPU_Mapped model buffer size = 44.72 MiB llama_context: constructing llama_context llama_context: n_seq_max = 1 llama_context: n_ctx = 8192 llama_context: n_ctx_per_seq = 8192 llama_context: n_batch = 512 llama_context: n_ubatch = 512 llama_context: causal_attn = 0 llama_context: flash_attn = 0 llama_context: freq_base = 1000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (8192) > n_ctx_train (2048) -- possible training context overflow llama_context: CUDA_Host output buffer size = 0.00 MiB time=2025-10-08T22:50:59.817-04:00 level=INFO source=server.go:630 msg="llama runner started in 4.02 seconds" decode: cannot decode batches with this context (use llama_encode() instead) [GIN] 2025/10/08 - 22:51:00 | 200 | 6.799472166s | 127.0.0.1 | POST "/v1/embeddings" [GIN] 2025/10/08 - 22:51:06 | 200 | 1.340028957s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 22:52:06 | 200 | 1.379434213s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 22:53:06 | 200 | 1.389828266s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 22:54:07 | 200 | 1.36709898s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 22:55:07 | 200 | 1.314655571s | 127.0.0.1 | POST "/v1/chat/completions" time=2025-10-08T23:01:07.835-04:00 level=INFO source=sched.go:788 msg="new model will fit in available VRAM in single GPU, loading" model=/net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 gpu=GPU-b4c32ffe-4492-3901-98d2-a063fc3f5047 parallel=4 available=50648973312 required="8.0 GiB" time=2025-10-08T23:01:08.593-04:00 level=INFO source=server.go:135 msg="system memory" total="250.5 GiB" free="142.1 GiB" free_swap="3.4 GiB" time=2025-10-08T23:01:08.594-04:00 level=INFO source=server.go:168 msg=offload library=cuda layers.requested=-1 layers.model=33 layers.offload=33 layers.split="" memory.available="[47.2 GiB]" memory.gpu_overhead="0 B" memory.required.full="8.0 GiB" memory.required.partial="8.0 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[8.0 GiB]" memory.weights.total="4.3 GiB" memory.weights.repeating="3.9 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="1.1 GiB" memory.graph.partial="1.1 GiB" llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.58 GiB (4.89 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 1 print_info: model type = ?B print_info: model params = 8.03 B print_info: general.name = Meta Llama 3.1 8B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 llama_model_load: vocab only - skipping tensors time=2025-10-08T23:01:09.033-04:00 level=INFO source=server.go:431 msg="starting llama server" cmd="/net/kihara/home/dyding/ollama/bin/ollama runner --model /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 --ctx-size 16384 --batch-size 512 --n-gpu-layers 33 --threads 24 --parallel 4 --port 46347" time=2025-10-08T23:01:09.172-04:00 level=INFO source=sched.go:483 msg="loaded runners" count=1 time=2025-10-08T23:01:09.189-04:00 level=INFO source=server.go:591 msg="waiting for llama runner to start responding" time=2025-10-08T23:01:09.311-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server not responding" time=2025-10-08T23:01:09.716-04:00 level=INFO source=runner.go:815 msg="starting go runner" load_backend: loaded CPU backend from /net/kihara/home/dyding/ollama/lib/ollama/libggml-cpu-skylakex.so ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: Quadro RTX 8000, compute capability 7.5, VMM: yes load_backend: loaded CUDA backend from /net/kihara/home/dyding/ollama/lib/ollama/cuda_v12/libggml-cuda.so time=2025-10-08T23:01:12.469-04:00 level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) time=2025-10-08T23:01:12.472-04:00 level=INFO source=runner.go:874 msg="Server listening on 127.0.0.1:46347" time=2025-10-08T23:01:12.574-04:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server loading model" llama_model_load_from_file_impl: using device CUDA0 (Quadro RTX 8000) - 48302 MiB free llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from /net/kihara/home/dyding/.ollama/models/blobs/sha256-667b0c1932bc6ffc593ed1d03f895bf2dc8dc6df21db3042284a6f4416b06a29 (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.type str = model llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct llama_model_loader: - kv 3: general.finetune str = Instruct llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1 llama_model_loader: - kv 5: general.size_label str = 8B llama_model_loader: - kv 6: general.license str = llama3.1 llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam... llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ... llama_model_loader: - kv 9: llama.block_count u32 = 32 llama_model_loader: - kv 10: llama.context_length u32 = 131072 llama_model_loader: - kv 11: llama.embedding_length u32 = 4096 llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 13: llama.attention.head_count u32 = 32 llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 17: general.file_type u32 = 15 llama_model_loader: - kv 18: llama.vocab_size u32 = 128256 llama_model_loader: - kv 19: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 20: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 21: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 22: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 23: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 24: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 25: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 27: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ... llama_model_loader: - kv 28: general.quantization_version u32 = 2 llama_model_loader: - type f32: 66 tensors llama_model_loader: - type q4_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors print_info: file format = GGUF V3 (latest) print_info: file type = Q4_K - Medium print_info: file size = 4.58 GiB (4.89 BPW) load: special tokens cache size = 256 load: token to piece cache size = 0.7999 MB print_info: arch = llama print_info: vocab_only = 0 print_info: n_ctx_train = 131072 print_info: n_embd = 4096 print_info: n_layer = 32 print_info: n_head = 32 print_info: n_head_kv = 8 print_info: n_rot = 128 print_info: n_swa = 0 print_info: n_swa_pattern = 1 print_info: n_embd_head_k = 128 print_info: n_embd_head_v = 128 print_info: n_gqa = 4 print_info: n_embd_k_gqa = 1024 print_info: n_embd_v_gqa = 1024 print_info: f_norm_eps = 0.0e+00 print_info: f_norm_rms_eps = 1.0e-05 print_info: f_clamp_kqv = 0.0e+00 print_info: f_max_alibi_bias = 0.0e+00 print_info: f_logit_scale = 0.0e+00 print_info: f_attn_scale = 0.0e+00 print_info: n_ff = 14336 print_info: n_expert = 0 print_info: n_expert_used = 0 print_info: causal attn = 1 print_info: pooling type = 0 print_info: rope type = 0 print_info: rope scaling = linear print_info: freq_base_train = 500000.0 print_info: freq_scale_train = 1 print_info: n_ctx_orig_yarn = 131072 print_info: rope_finetuned = unknown print_info: ssm_d_conv = 0 print_info: ssm_d_inner = 0 print_info: ssm_d_state = 0 print_info: ssm_dt_rank = 0 print_info: ssm_dt_b_c_rms = 0 print_info: model type = 8B print_info: model params = 8.03 B print_info: general.name = Meta Llama 3.1 8B Instruct print_info: vocab type = BPE print_info: n_vocab = 128256 print_info: n_merges = 280147 print_info: BOS token = 128000 '<|begin_of_text|>' print_info: EOS token = 128009 '<|eot_id|>' print_info: EOT token = 128009 '<|eot_id|>' print_info: EOM token = 128008 '<|eom_id|>' print_info: LF token = 198 'Ċ' print_info: EOG token = 128008 '<|eom_id|>' print_info: EOG token = 128009 '<|eot_id|>' print_info: max token length = 256 load_tensors: loading model tensors, this can take a while... (mmap = true) load_tensors: offloading 32 repeating layers to GPU load_tensors: offloading output layer to GPU load_tensors: offloaded 33/33 layers to GPU load_tensors: CUDA0 model buffer size = 4403.49 MiB load_tensors: CPU_Mapped model buffer size = 281.81 MiB llama_context: constructing llama_context llama_context: n_seq_max = 4 llama_context: n_ctx = 16384 llama_context: n_ctx_per_seq = 4096 llama_context: n_batch = 2048 llama_context: n_ubatch = 512 llama_context: causal_attn = 1 llama_context: flash_attn = 0 llama_context: freq_base = 500000.0 llama_context: freq_scale = 1 llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized llama_context: CUDA_Host output buffer size = 2.02 MiB llama_kv_cache_unified: kv_size = 16384, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1, padding = 32 llama_kv_cache_unified: CUDA0 KV buffer size = 2048.00 MiB llama_kv_cache_unified: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB llama_context: CUDA0 compute buffer size = 1088.00 MiB llama_context: CUDA_Host compute buffer size = 40.01 MiB llama_context: graph nodes = 1094 llama_context: graph splits = 2 time=2025-10-08T23:02:29.389-04:00 level=INFO source=server.go:630 msg="llama runner started in 80.21 seconds" [GIN] 2025/10/08 - 23:06:23 | 200 | 4m17s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:07:27 | 200 | 3m21s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:08:05 | 200 | 4m59s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:08:20 | 200 | 56.089380907s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:10:05 | 200 | 4m59s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:12:30 | 200 | 6.228748253s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:13:22 | 200 | 1m57s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:14:46 | 200 | 4m22s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:17:43 | 200 | 3m18s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:18:26 | 200 | 2m1s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:18:32 | 200 | 7.920812776s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:18:54 | 200 | 3m30s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:21:06 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:24:44 | 200 | 3m20s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:25:27 | 200 | 2m2s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:25:46 | 200 | 1m21s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:27:15 | 200 | 6m9s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:27:51 | 200 | 1m26s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:31:31 | 200 | 4m6s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:31:37 | 200 | 12.156950789s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:31:57 | 200 | 3m32s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:34:10 | 200 | 3m45s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:36:48 | 200 | 3m23s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:38:06 | 200 | 40.664919093s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:39:37 | 200 | 4m11s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:40:16 | 200 | 1m51s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:40:24 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:42:59 | 200 | 2m34s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:45:07 | 200 | 3m41s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:47:38 | 200 | 3m13s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:49:24 | 200 | 1m58s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:51:26 | 200 | 3m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:53:45 | 200 | 2m19s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:55:36 | 200 | 5m10s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:56:11 | 200 | 2m45s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/08 - 23:56:42 | 200 | 5.903962148s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:00:25 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:00:54 | 200 | 3m17s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:01:05 | 200 | 6m39s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:02:23 | 200 | 1m57s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:03:25 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:04:27 | 200 | 2m33s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:04:38 | 200 | 6m1s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:05:01 | 200 | 7.50353454s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:07:18 | 200 | 1m23s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:07:27 | 200 | 4m1s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:10:45 | 200 | 1m51s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:11:00 | 200 | 4m6s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:15:05 | 200 | 10.885801527s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:15:53 | 200 | 2m59s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:20:01 | 200 | 6.3710891s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:22:06 | 200 | 1m11s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:23:14 | 200 | 5m19s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:26:20 | 200 | 4m25s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:26:26 | 200 | 1m31s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:28:47 | 200 | 1m27s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:31:54 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:32:57 | 200 | 4m36s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:34:33 | 200 | 2m38s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:35:35 | 200 | 1m38s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:36:23 | 200 | 7m2s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:36:54 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:38:28 | 200 | 1m33s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:38:49 | 200 | 2m51s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:38:59 | 200 | 4m2s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:39:50 | 200 | 1m52s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:42:55 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:44:30 | 200 | 1m32s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:44:36 | 200 | 5m39s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:45:49 | 200 | 2m54s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:46:11 | 200 | 2m14s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:48:39 | 200 | 6m41s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:49:01 | 200 | 3.990042407s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:49:06 | 200 | 2m8s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:51:26 | 200 | 29.180774341s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:51:59 | 200 | 2m1s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:55:51 | 200 | 2m53s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:56:02 | 200 | 4.046557167s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:58:52 | 200 | 3m54s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 00:59:42 | 200 | 1m44s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:02:15 | 200 | 1m17s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:04:04 | 200 | 1m5s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:05:02 | 200 | 3.783215871s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:07:43 | 200 | 1m45s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:09:56 | 200 | 2m58s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:10:35 | 200 | 2m37s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:12:33 | 200 | 3m35s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:12:58 | 200 | 1m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:15:05 | 200 | 1m7s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:17:49 | 200 | 1m50s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:18:37 | 200 | 38.816895579s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:19:58 | 500 | 20m0s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:22:43 | 200 | 3m44s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:23:45 | 200 | 3m46s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:26:19 | 200 | 6m21s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:26:53 | 200 | 54.208551356s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:27:11 | 200 | 4m12s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:27:44 | 200 | 5m45s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:28:28 | 200 | 1m29s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:32:06 | 200 | 6.866033285s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:33:18 | 200 | 4m19s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:34:43 | 200 | 1m44s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:34:43 | 200 | 44.475846784s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:35:03 | 200 | 4.275677369s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:38:13 | 200 | 2m14s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:41:21 | 200 | 21.784501563s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:42:09 | 200 | 2m10s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:42:19 | 200 | 3m19s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:43:44 | 200 | 1m44s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:45:42 | 200 | 1m42s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:48:59 | 200 | 1m59s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:50:11 | 200 | 1m11s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:55:31 | 200 | 3m31s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:56:14 | 200 | 3m14s | 127.0.0.1 | POST "/v1/chat/completions" [GIN] 2025/10/09 - 01:56:37 | 200 | 2m37s | 127.0.0.1 | POST "/v1/chat/completions"
Author
Owner

@optivisionlab commented on GitHub (Oct 14, 2025):

@rick-github here is message logs from request curl

curl --location '<domain>/api/generate' \
--header 'Content-Type: application/json' \
--data '{ "model": "gemma3:4b", "prompt": "describe the cat", "stream": false }'

Logs:

Attaching to ollama_gpu
ollama_gpu | time=2025-10-14T02:49:33.777Z level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
ollama_gpu | time=2025-10-14T02:49:33.778Z level=INFO source=images.go:518 msg="total blobs: 16"
ollama_gpu | time=2025-10-14T02:49:33.779Z level=INFO source=images.go:525 msg="total unused blobs removed: 0"
ollama_gpu | time=2025-10-14T02:49:33.779Z level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.3)"
ollama_gpu | time=2025-10-14T02:49:33.779Z level=DEBUG source=sched.go:121 msg="starting llm scheduler"
ollama_gpu | time=2025-10-14T02:49:33.779Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs"
ollama_gpu | time=2025-10-14T02:49:33.787Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA"
ollama_gpu | time=2025-10-14T02:49:33.787Z level=DEBUG source=gpu.go:520 msg="Searching for GPU library" name=libcuda.so*
ollama_gpu | time=2025-10-14T02:49:33.787Z level=DEBUG source=gpu.go:544 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]"
ollama_gpu | time=2025-10-14T02:49:33.788Z level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03]
ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03
ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0
ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0
ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940
ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970
ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0
ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0
ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910
ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520
ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90
ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60
ollama_gpu | calling cuInit
ollama_gpu | calling cuDriverGetVersion
ollama_gpu | raw version 0x2b20
ollama_gpu | CUDA driver version: 11.4
ollama_gpu | calling cuDeviceGetCount
ollama_gpu | device count 1
ollama_gpu | time=2025-10-14T02:49:33.809Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03
ollama_gpu | [GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232] CUDA totalMem 16278mb
ollama_gpu | [GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232] CUDA freeMem 16173mb
ollama_gpu | [GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232] Compute Capability 6.1
ollama_gpu | time=2025-10-14T02:49:33.932Z level=DEBUG source=amd_linux.go:423 msg="amdgpu driver not detected /sys/module/amdgpu"
ollama_gpu | releasing cuda driver library
ollama_gpu | time=2025-10-14T02:49:33.932Z level=INFO source=types.go:131 msg="inference compute" id=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 library=cuda variant=v12 compute=6.1 driver=11.4 name="Quadro P5000" total="15.9 GiB" available="15.8 GiB"
ollama_gpu | time=2025-10-14T02:49:33.932Z level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB"
ollama_gpu | time=2025-10-14T02:49:39.736Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="62.8 GiB" before.free="59.7 GiB" before.free_swap="976.0 MiB" now.total="62.8 GiB" now.free="59.7 GiB" now.free_swap="976.0 MiB"
ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03
ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0
ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0
ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940
ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970
ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0
ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0
ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910
ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520
ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90
ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60
ollama_gpu | calling cuInit
ollama_gpu | calling cuDriverGetVersion
ollama_gpu | raw version 0x2b20
ollama_gpu | CUDA driver version: 11.4
ollama_gpu | calling cuDeviceGetCount
ollama_gpu | device count 1
ollama_gpu | time=2025-10-14T02:49:39.861Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 name="Quadro P5000" overhead="0 B" before.total="15.9 GiB" before.free="15.8 GiB" now.total="15.9 GiB" now.free="15.8 GiB" now.used="105.1 MiB"
ollama_gpu | releasing cuda driver library
ollama_gpu | time=2025-10-14T02:49:39.861Z level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1
ollama_gpu | time=2025-10-14T02:49:39.915Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
ollama_gpu | time=2025-10-14T02:49:39.917Z level=DEBUG source=sched.go:208 msg="loading first model" model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25
ollama_gpu | time=2025-10-14T02:49:40.165Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
ollama_gpu | time=2025-10-14T02:49:40.167Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0
ollama_gpu | time=2025-10-14T02:49:40.167Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
ollama_gpu | time=2025-10-14T02:49:40.167Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000
ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06
ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256
ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="62.8 GiB" before.free="59.7 GiB" before.free_swap="976.0 MiB" now.total="62.8 GiB" now.free="59.7 GiB" now.free_swap="976.0 MiB"
ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03
ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0
ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0
ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940
ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970
ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0
ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0
ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910
ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520
ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90
ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60
ollama_gpu | calling cuInit
ollama_gpu | calling cuDriverGetVersion
ollama_gpu | raw version 0x2b20
ollama_gpu | CUDA driver version: 11.4
ollama_gpu | calling cuDeviceGetCount
ollama_gpu | device count 1
ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 name="Quadro P5000" overhead="0 B" before.total="15.9 GiB" before.free="15.8 GiB" now.total="15.9 GiB" now.free="15.8 GiB" now.used="105.1 MiB"
ollama_gpu | releasing cuda driver library
ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=server.go:324 msg="adding gpu library" path=/usr/lib/ollama/cuda_v12
ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=server.go:332 msg="adding gpu dependency paths" paths=[/usr/lib/ollama/cuda_v12]
ollama_gpu | time=2025-10-14T02:49:40.270Z level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --port 41975"
ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=server.go:400 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CUDA_VISIBLE_DEVICES=0 OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama OLLAMA_HOST=0.0.0.0:11434 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12
ollama_gpu | time=2025-10-14T02:49:40.271Z level=INFO source=server.go:672 msg="loading model" "model layers"=35 requested=-1
ollama_gpu | time=2025-10-14T02:49:40.271Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="62.8 GiB" before.free="59.7 GiB" before.free_swap="976.0 MiB" now.total="62.8 GiB" now.free="59.6 GiB" now.free_swap="976.0 MiB"
ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03
ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0
ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0
ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940
ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970
ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0
ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0
ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910
ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520
ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90
ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60
ollama_gpu | calling cuInit
ollama_gpu | calling cuDriverGetVersion
ollama_gpu | raw version 0x2b20
ollama_gpu | CUDA driver version: 11.4
ollama_gpu | calling cuDeviceGetCount
ollama_gpu | device count 1
ollama_gpu | time=2025-10-14T02:49:40.296Z level=INFO source=runner.go:1252 msg="starting ollama engine"
ollama_gpu | time=2025-10-14T02:49:40.296Z level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:41975"
ollama_gpu | time=2025-10-14T02:49:40.363Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 name="Quadro P5000" overhead="0 B" before.total="15.9 GiB" before.free="15.8 GiB" now.total="15.9 GiB" now.free="15.8 GiB" now.used="105.1 MiB"
ollama_gpu | releasing cuda driver library
ollama_gpu | time=2025-10-14T02:49:40.363Z level=INFO source=server.go:678 msg="system memory" total="62.8 GiB" free="59.6 GiB" free_swap="976.0 MiB"
ollama_gpu | time=2025-10-14T02:49:40.363Z level=INFO source=server.go:686 msg="gpu memory" id=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 available="15.3 GiB" free="15.8 GiB" minimum="457.0 MiB" overhead="0 B"
ollama_gpu | time=2025-10-14T02:49:40.364Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:35[ID:GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 Layers:35(0..34)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_gpu | time=2025-10-14T02:49:40.492Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
ollama_gpu | time=2025-10-14T02:49:40.494Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.name default=""
ollama_gpu | time=2025-10-14T02:49:40.494Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.description default=""
ollama_gpu | time=2025-10-14T02:49:40.494Z level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36
ollama_gpu | time=2025-10-14T02:49:40.494Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama
ollama_gpu | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so
ollama_gpu | time=2025-10-14T02:49:40.522Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12
ollama_gpu | ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version
ollama_gpu | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
ollama_gpu | time=2025-10-14T02:49:40.592Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc)
ollama_gpu | time=2025-10-14T02:49:40.601Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0
ollama_gpu | time=2025-10-14T02:49:40.601Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
ollama_gpu | time=2025-10-14T02:49:40.601Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000
ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06
ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256
ollama_gpu | time=2025-10-14T02:49:40.955Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=972 splits=1
ollama_gpu | time=2025-10-14T02:49:40.957Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1471 splits=1
ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB"
ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB"
ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB"
ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:342 msg="total memory" size="4.9 GiB"
ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=550502400U required.CPU.Weights="[60561408U 60561408U 60561408U 60561408U 53803008U 53803008U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 59885568U 60561408U 59885568U 59885568U 60561408U 1390946752U]" required.CPU.Cache="[6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 0U]" required.CPU.Graph=1174011904U
ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=server.go:969 msg="insufficient VRAM to load any model layers"
ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=server.go:728 msg="new layout created" layers=[]
ollama_gpu | time=2025-10-14T02:49:40.959Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_gpu | time=2025-10-14T02:49:41.076Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
ollama_gpu | time=2025-10-14T02:49:41.093Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0
ollama_gpu | time=2025-10-14T02:49:41.093Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
ollama_gpu | time=2025-10-14T02:49:41.093Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000
ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06
ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256
ollama_gpu | time=2025-10-14T02:49:41.388Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=972 splits=1
ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1471 splits=1
ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB"
ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB"
ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB"
ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=backend.go:342 msg="total memory" size="4.9 GiB"
ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=550502400U required.CPU.Weights="[60561408U 60561408U 60561408U 60561408U 53803008U 53803008U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 59885568U 60561408U 59885568U 59885568U 60561408U 1390946752U]" required.CPU.Cache="[6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 0U]" required.CPU.Graph=1174011904U
ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=server.go:969 msg="insufficient VRAM to load any model layers"
ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=server.go:728 msg="new layout created" layers=[]
ollama_gpu | time=2025-10-14T02:49:41.392Z level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_gpu | time=2025-10-14T02:49:41.494Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32
ollama_gpu | time=2025-10-14T02:49:41.505Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0
ollama_gpu | time=2025-10-14T02:49:41.505Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106
ollama_gpu | time=2025-10-14T02:49:41.505Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}"
ollama_gpu | time=2025-10-14T02:49:41.518Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07
ollama_gpu | time=2025-10-14T02:49:41.518Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000
ollama_gpu | time=2025-10-14T02:49:41.519Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06
ollama_gpu | time=2025-10-14T02:49:41.519Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256
ollama_gpu | time=2025-10-14T02:49:41.797Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=972 splits=1
ollama_gpu | time=2025-10-14T02:49:41.933Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1471 splits=1
ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB"
ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB"
ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB"
ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:342 msg="total memory" size="4.9 GiB"
ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=550502400A required.CPU.Weights="[60561408A 60561408A 60561408A 60561408A 53803008A 53803008A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 59885568A 60561408A 59885568A 59885568A 60561408A 1390946752A]" required.CPU.Cache="[6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 0U]" required.CPU.Graph=1174011904A
ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=server.go:969 msg="insufficient VRAM to load any model layers"
ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=server.go:728 msg="new layout created" layers=[]
ollama_gpu | time=2025-10-14T02:49:41.935Z level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=ggml.go:491 msg="offloading output layer to CPU"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=ggml.go:498 msg="offloaded 0/35 layers to GPU"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:342 msg="total memory" size="4.9 GiB"
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=sched.go:470 msg="loaded runners" count=1
ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=server.go:1251 msg="waiting for llama runner to start responding"
ollama_gpu | time=2025-10-14T02:49:41.937Z level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model"
ollama_gpu | time=2025-10-14T02:49:41.937Z level=DEBUG source=server.go:1295 msg="model load progress 0.00"
ollama_gpu | time=2025-10-14T02:49:42.188Z level=DEBUG source=server.go:1295 msg="model load progress 0.85"
ollama_gpu | time=2025-10-14T02:49:42.439Z level=DEBUG source=server.go:1295 msg="model load progress 0.91"
ollama_gpu | time=2025-10-14T02:49:42.690Z level=DEBUG source=server.go:1295 msg="model load progress 0.98"
ollama_gpu | time=2025-10-14T02:49:42.779Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0
ollama_gpu | time=2025-10-14T02:49:42.941Z level=INFO source=server.go:1289 msg="llama runner started in 2.67 seconds"
ollama_gpu | time=2025-10-14T02:49:42.941Z level=DEBUG source=sched.go:482 msg="finished setting up" runner.name=registry.ollama.ai/library/gemma3:4b runner.inference=cuda runner.devices=1 runner.size="4.9 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=34 runner.model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 runner.num_ctx=4096
ollama_gpu | time=2025-10-14T02:49:42.941Z level=DEBUG source=server.go:1388 msg="completion request" images=0 prompt=71 format=""
ollama_gpu | time=2025-10-14T02:49:42.983Z level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=2
ollama_gpu | time=2025-10-14T02:49:42.983Z level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=12 used=0 remaining=12
ollama_gpu | time=2025-10-14T02:49:58.199Z level=DEBUG source=sched.go:490 msg="context for request finished"
ollama_gpu | [GIN] 2025/10/14 - 02:49:58 | 200 |  18.75975421s |  192.168.140.54 | POST     "/api/generate"
ollama_gpu | time=2025-10-14T02:49:58.199Z level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gemma3:4b runner.inference=cuda runner.devices=1 runner.size="4.9 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=34 runner.model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 runner.num_ctx=4096 duration=5m0s
ollama_gpu | time=2025-10-14T02:49:58.199Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gemma3:4b runner.inference=cuda runner.devices=1 runner.size="4.9 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=34 runner.model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 runner.num_ctx=4096 refCount=0
<!-- gh-comment-id:3399918557 --> @optivisionlab commented on GitHub (Oct 14, 2025): @rick-github here is message logs from request curl ``` curl --location '<domain>/api/generate' \ --header 'Content-Type: application/json' \ --data '{ "model": "gemma3:4b", "prompt": "describe the cat", "stream": false }' ``` Logs: ``` Attaching to ollama_gpu ollama_gpu | time=2025-10-14T02:49:33.777Z level=INFO source=routes.go:1475 msg="server config" env="map[CUDA_VISIBLE_DEVICES:0 GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION: HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_CONTEXT_LENGTH:4096 OLLAMA_DEBUG:DEBUG OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_KV_CACHE_TYPE: OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NEW_ENGINE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:1 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://* vscode-webview://* vscode-file://*] OLLAMA_REMOTES:[ollama.com] OLLAMA_SCHED_SPREAD:false ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" ollama_gpu | time=2025-10-14T02:49:33.778Z level=INFO source=images.go:518 msg="total blobs: 16" ollama_gpu | time=2025-10-14T02:49:33.779Z level=INFO source=images.go:525 msg="total unused blobs removed: 0" ollama_gpu | time=2025-10-14T02:49:33.779Z level=INFO source=routes.go:1528 msg="Listening on [::]:11434 (version 0.12.3)" ollama_gpu | time=2025-10-14T02:49:33.779Z level=DEBUG source=sched.go:121 msg="starting llm scheduler" ollama_gpu | time=2025-10-14T02:49:33.779Z level=INFO source=gpu.go:217 msg="looking for compatible GPUs" ollama_gpu | time=2025-10-14T02:49:33.787Z level=DEBUG source=gpu.go:98 msg="searching for GPU discovery libraries for NVIDIA" ollama_gpu | time=2025-10-14T02:49:33.787Z level=DEBUG source=gpu.go:520 msg="Searching for GPU library" name=libcuda.so* ollama_gpu | time=2025-10-14T02:49:33.787Z level=DEBUG source=gpu.go:544 msg="gpu library search" globs="[/usr/lib/ollama/libcuda.so* /usr/local/nvidia/lib/libcuda.so* /usr/local/nvidia/lib64/libcuda.so* /usr/local/cuda*/targets/*/lib/libcuda.so* /usr/lib/*-linux-gnu/nvidia/current/libcuda.so* /usr/lib/*-linux-gnu/libcuda.so* /usr/lib/wsl/lib/libcuda.so* /usr/lib/wsl/drivers/*/libcuda.so* /opt/cuda/lib*/libcuda.so* /usr/local/cuda/lib*/libcuda.so* /usr/lib*/libcuda.so* /usr/local/lib*/libcuda.so*]" ollama_gpu | time=2025-10-14T02:49:33.788Z level=DEBUG source=gpu.go:577 msg="discovered GPU libraries" paths=[/usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03] ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03 ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0 ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0 ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940 ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970 ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0 ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0 ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910 ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520 ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90 ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60 ollama_gpu | calling cuInit ollama_gpu | calling cuDriverGetVersion ollama_gpu | raw version 0x2b20 ollama_gpu | CUDA driver version: 11.4 ollama_gpu | calling cuDeviceGetCount ollama_gpu | device count 1 ollama_gpu | time=2025-10-14T02:49:33.809Z level=DEBUG source=gpu.go:125 msg="detected GPUs" count=1 library=/usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03 ollama_gpu | [GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232] CUDA totalMem 16278mb ollama_gpu | [GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232] CUDA freeMem 16173mb ollama_gpu | [GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232] Compute Capability 6.1 ollama_gpu | time=2025-10-14T02:49:33.932Z level=DEBUG source=amd_linux.go:423 msg="amdgpu driver not detected /sys/module/amdgpu" ollama_gpu | releasing cuda driver library ollama_gpu | time=2025-10-14T02:49:33.932Z level=INFO source=types.go:131 msg="inference compute" id=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 library=cuda variant=v12 compute=6.1 driver=11.4 name="Quadro P5000" total="15.9 GiB" available="15.8 GiB" ollama_gpu | time=2025-10-14T02:49:33.932Z level=INFO source=routes.go:1569 msg="entering low vram mode" "total vram"="15.9 GiB" threshold="20.0 GiB" ollama_gpu | time=2025-10-14T02:49:39.736Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="62.8 GiB" before.free="59.7 GiB" before.free_swap="976.0 MiB" now.total="62.8 GiB" now.free="59.7 GiB" now.free_swap="976.0 MiB" ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03 ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0 ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0 ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940 ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970 ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0 ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0 ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910 ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520 ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90 ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60 ollama_gpu | calling cuInit ollama_gpu | calling cuDriverGetVersion ollama_gpu | raw version 0x2b20 ollama_gpu | CUDA driver version: 11.4 ollama_gpu | calling cuDeviceGetCount ollama_gpu | device count 1 ollama_gpu | time=2025-10-14T02:49:39.861Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 name="Quadro P5000" overhead="0 B" before.total="15.9 GiB" before.free="15.8 GiB" now.total="15.9 GiB" now.free="15.8 GiB" now.used="105.1 MiB" ollama_gpu | releasing cuda driver library ollama_gpu | time=2025-10-14T02:49:39.861Z level=DEBUG source=sched.go:188 msg="updating default concurrency" OLLAMA_MAX_LOADED_MODELS=3 gpu_count=1 ollama_gpu | time=2025-10-14T02:49:39.915Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 ollama_gpu | time=2025-10-14T02:49:39.917Z level=DEBUG source=sched.go:208 msg="loading first model" model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 ollama_gpu | time=2025-10-14T02:49:40.165Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 ollama_gpu | time=2025-10-14T02:49:40.167Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0 ollama_gpu | time=2025-10-14T02:49:40.167Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 ollama_gpu | time=2025-10-14T02:49:40.167Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000 ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06 ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256 ollama_gpu | time=2025-10-14T02:49:40.174Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="62.8 GiB" before.free="59.7 GiB" before.free_swap="976.0 MiB" now.total="62.8 GiB" now.free="59.7 GiB" now.free_swap="976.0 MiB" ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03 ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0 ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0 ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940 ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970 ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0 ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0 ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910 ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520 ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90 ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60 ollama_gpu | calling cuInit ollama_gpu | calling cuDriverGetVersion ollama_gpu | raw version 0x2b20 ollama_gpu | CUDA driver version: 11.4 ollama_gpu | calling cuDeviceGetCount ollama_gpu | device count 1 ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 name="Quadro P5000" overhead="0 B" before.total="15.9 GiB" before.free="15.8 GiB" now.total="15.9 GiB" now.free="15.8 GiB" now.used="105.1 MiB" ollama_gpu | releasing cuda driver library ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=server.go:324 msg="adding gpu library" path=/usr/lib/ollama/cuda_v12 ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=server.go:332 msg="adding gpu dependency paths" paths=[/usr/lib/ollama/cuda_v12] ollama_gpu | time=2025-10-14T02:49:40.270Z level=INFO source=server.go:399 msg="starting runner" cmd="/usr/bin/ollama runner --ollama-engine --model /root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 --port 41975" ollama_gpu | time=2025-10-14T02:49:40.270Z level=DEBUG source=server.go:400 msg=subprocess PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin CUDA_VISIBLE_DEVICES=0 OLLAMA_DEBUG=1 LD_LIBRARY_PATH=/usr/lib/ollama/cuda_v12:/usr/lib/ollama/cuda_v12:/usr/lib/ollama:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/lib/ollama OLLAMA_HOST=0.0.0.0:11434 OLLAMA_MAX_LOADED_MODELS=3 OLLAMA_LIBRARY_PATH=/usr/lib/ollama:/usr/lib/ollama/cuda_v12 ollama_gpu | time=2025-10-14T02:49:40.271Z level=INFO source=server.go:672 msg="loading model" "model layers"=35 requested=-1 ollama_gpu | time=2025-10-14T02:49:40.271Z level=DEBUG source=gpu.go:410 msg="updating system memory data" before.total="62.8 GiB" before.free="59.7 GiB" before.free_swap="976.0 MiB" now.total="62.8 GiB" now.free="59.6 GiB" now.free_swap="976.0 MiB" ollama_gpu | initializing /usr/lib/x86_64-linux-gnu/libcuda.so.470.182.03 ollama_gpu | dlsym: cuInit - 0x7f6b76a399d0 ollama_gpu | dlsym: cuDriverGetVersion - 0x7f6b76a399a0 ollama_gpu | dlsym: cuDeviceGetCount - 0x7f6b76a39940 ollama_gpu | dlsym: cuDeviceGet - 0x7f6b76a39970 ollama_gpu | dlsym: cuDeviceGetAttribute - 0x7f6b76a397f0 ollama_gpu | dlsym: cuDeviceGetUuid - 0x7f6b76a398e0 ollama_gpu | dlsym: cuDeviceGetName - 0x7f6b76a39910 ollama_gpu | dlsym: cuCtxCreate_v3 - 0x7f6b76a39520 ollama_gpu | dlsym: cuMemGetInfo_v2 - 0x7f6b76a38e90 ollama_gpu | dlsym: cuCtxDestroy - 0x7f6b76a5db60 ollama_gpu | calling cuInit ollama_gpu | calling cuDriverGetVersion ollama_gpu | raw version 0x2b20 ollama_gpu | CUDA driver version: 11.4 ollama_gpu | calling cuDeviceGetCount ollama_gpu | device count 1 ollama_gpu | time=2025-10-14T02:49:40.296Z level=INFO source=runner.go:1252 msg="starting ollama engine" ollama_gpu | time=2025-10-14T02:49:40.296Z level=INFO source=runner.go:1287 msg="Server listening on 127.0.0.1:41975" ollama_gpu | time=2025-10-14T02:49:40.363Z level=DEBUG source=gpu.go:460 msg="updating cuda memory data" gpu=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 name="Quadro P5000" overhead="0 B" before.total="15.9 GiB" before.free="15.8 GiB" now.total="15.9 GiB" now.free="15.8 GiB" now.used="105.1 MiB" ollama_gpu | releasing cuda driver library ollama_gpu | time=2025-10-14T02:49:40.363Z level=INFO source=server.go:678 msg="system memory" total="62.8 GiB" free="59.6 GiB" free_swap="976.0 MiB" ollama_gpu | time=2025-10-14T02:49:40.363Z level=INFO source=server.go:686 msg="gpu memory" id=GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 available="15.3 GiB" free="15.8 GiB" minimum="457.0 MiB" overhead="0 B" ollama_gpu | time=2025-10-14T02:49:40.364Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:35[ID:GPU-0a2dd432-b6b7-4f79-f9c6-01489d497232 Layers:35(0..34)] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_gpu | time=2025-10-14T02:49:40.492Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 ollama_gpu | time=2025-10-14T02:49:40.494Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.name default="" ollama_gpu | time=2025-10-14T02:49:40.494Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.description default="" ollama_gpu | time=2025-10-14T02:49:40.494Z level=INFO source=ggml.go:131 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=36 ollama_gpu | time=2025-10-14T02:49:40.494Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama ollama_gpu | load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-haswell.so ollama_gpu | time=2025-10-14T02:49:40.522Z level=DEBUG source=ggml.go:94 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12 ollama_gpu | ggml_cuda_init: failed to initialize CUDA: CUDA driver version is insufficient for CUDA runtime version ollama_gpu | load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so ollama_gpu | time=2025-10-14T02:49:40.592Z level=INFO source=ggml.go:104 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.BMI2=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 compiler=cgo(gcc) ollama_gpu | time=2025-10-14T02:49:40.601Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0 ollama_gpu | time=2025-10-14T02:49:40.601Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 ollama_gpu | time=2025-10-14T02:49:40.601Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000 ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06 ollama_gpu | time=2025-10-14T02:49:40.607Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256 ollama_gpu | time=2025-10-14T02:49:40.955Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=972 splits=1 ollama_gpu | time=2025-10-14T02:49:40.957Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1471 splits=1 ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB" ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB" ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB" ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=backend.go:342 msg="total memory" size="4.9 GiB" ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=550502400U required.CPU.Weights="[60561408U 60561408U 60561408U 60561408U 53803008U 53803008U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 59885568U 60561408U 59885568U 59885568U 60561408U 1390946752U]" required.CPU.Cache="[6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 0U]" required.CPU.Graph=1174011904U ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=server.go:969 msg="insufficient VRAM to load any model layers" ollama_gpu | time=2025-10-14T02:49:40.958Z level=DEBUG source=server.go:728 msg="new layout created" layers=[] ollama_gpu | time=2025-10-14T02:49:40.959Z level=INFO source=runner.go:1171 msg=load request="{Operation:fit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_gpu | time=2025-10-14T02:49:41.076Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 ollama_gpu | time=2025-10-14T02:49:41.093Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0 ollama_gpu | time=2025-10-14T02:49:41.093Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 ollama_gpu | time=2025-10-14T02:49:41.093Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000 ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06 ollama_gpu | time=2025-10-14T02:49:41.100Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256 ollama_gpu | time=2025-10-14T02:49:41.388Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=972 splits=1 ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1471 splits=1 ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB" ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB" ollama_gpu | time=2025-10-14T02:49:41.391Z level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB" ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=backend.go:342 msg="total memory" size="4.9 GiB" ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=550502400U required.CPU.Weights="[60561408U 60561408U 60561408U 60561408U 53803008U 53803008U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 53127168U 60561408U 53127168U 59885568U 60561408U 59885568U 59885568U 60561408U 1390946752U]" required.CPU.Cache="[6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 6291456U 16777216U 6291456U 6291456U 6291456U 6291456U 0U]" required.CPU.Graph=1174011904U ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=server.go:969 msg="insufficient VRAM to load any model layers" ollama_gpu | time=2025-10-14T02:49:41.392Z level=DEBUG source=server.go:728 msg="new layout created" layers=[] ollama_gpu | time=2025-10-14T02:49:41.392Z level=INFO source=runner.go:1171 msg=load request="{Operation:alloc LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_gpu | time=2025-10-14T02:49:41.494Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=general.alignment default=32 ollama_gpu | time=2025-10-14T02:49:41.505Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0 ollama_gpu | time=2025-10-14T02:49:41.505Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eot_token_id default=106 ollama_gpu | time=2025-10-14T02:49:41.505Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=tokenizer.ggml.eos_token_ids default="&{size:0 values:[]}" ollama_gpu | time=2025-10-14T02:49:41.518Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.attention.layer_norm_rms_epsilon default=9.999999974752427e-07 ollama_gpu | time=2025-10-14T02:49:41.518Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.local.freq_base default=10000 ollama_gpu | time=2025-10-14T02:49:41.519Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.rope.global.freq_base default=1e+06 ollama_gpu | time=2025-10-14T02:49:41.519Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.mm_tokens_per_image default=256 ollama_gpu | time=2025-10-14T02:49:41.797Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=972 splits=1 ollama_gpu | time=2025-10-14T02:49:41.933Z level=DEBUG source=ggml.go:794 msg="compute graph" nodes=1471 splits=1 ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB" ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB" ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB" ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=backend.go:342 msg="total memory" size="4.9 GiB" ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=server.go:717 msg=memory success=true required.InputWeights=550502400A required.CPU.Weights="[60561408A 60561408A 60561408A 60561408A 53803008A 53803008A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 53127168A 60561408A 53127168A 59885568A 60561408A 59885568A 59885568A 60561408A 1390946752A]" required.CPU.Cache="[6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 6291456A 16777216A 6291456A 6291456A 6291456A 6291456A 0U]" required.CPU.Graph=1174011904A ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=server.go:969 msg="insufficient VRAM to load any model layers" ollama_gpu | time=2025-10-14T02:49:41.935Z level=DEBUG source=server.go:728 msg="new layout created" layers=[] ollama_gpu | time=2025-10-14T02:49:41.935Z level=INFO source=runner.go:1171 msg=load request="{Operation:commit LoraPath:[] Parallel:1 BatchSize:512 FlashAttention:false KvSize:4096 KvCacheType: NumThreads:20 GPULayers:[] MultiUserCache:false ProjectorPath: MainGPU:0 UseMmap:false}" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=ggml.go:487 msg="offloading 0 repeating layers to GPU" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=ggml.go:491 msg="offloading output layer to CPU" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=ggml.go:498 msg="offloaded 0/35 layers to GPU" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:315 msg="model weights" device=CPU size="3.6 GiB" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:326 msg="kv cache" device=CPU size="254.0 MiB" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:337 msg="compute graph" device=CPU size="1.1 GiB" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=backend.go:342 msg="total memory" size="4.9 GiB" ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=sched.go:470 msg="loaded runners" count=1 ollama_gpu | time=2025-10-14T02:49:41.936Z level=INFO source=server.go:1251 msg="waiting for llama runner to start responding" ollama_gpu | time=2025-10-14T02:49:41.937Z level=INFO source=server.go:1285 msg="waiting for server to become available" status="llm server loading model" ollama_gpu | time=2025-10-14T02:49:41.937Z level=DEBUG source=server.go:1295 msg="model load progress 0.00" ollama_gpu | time=2025-10-14T02:49:42.188Z level=DEBUG source=server.go:1295 msg="model load progress 0.85" ollama_gpu | time=2025-10-14T02:49:42.439Z level=DEBUG source=server.go:1295 msg="model load progress 0.91" ollama_gpu | time=2025-10-14T02:49:42.690Z level=DEBUG source=server.go:1295 msg="model load progress 0.98" ollama_gpu | time=2025-10-14T02:49:42.779Z level=DEBUG source=ggml.go:276 msg="key with type not found" key=gemma3.pooling_type default=0 ollama_gpu | time=2025-10-14T02:49:42.941Z level=INFO source=server.go:1289 msg="llama runner started in 2.67 seconds" ollama_gpu | time=2025-10-14T02:49:42.941Z level=DEBUG source=sched.go:482 msg="finished setting up" runner.name=registry.ollama.ai/library/gemma3:4b runner.inference=cuda runner.devices=1 runner.size="4.9 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=34 runner.model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 runner.num_ctx=4096 ollama_gpu | time=2025-10-14T02:49:42.941Z level=DEBUG source=server.go:1388 msg="completion request" images=0 prompt=71 format="" ollama_gpu | time=2025-10-14T02:49:42.983Z level=DEBUG source=vocabulary.go:52 msg="adding bos token to prompt" id=2 ollama_gpu | time=2025-10-14T02:49:42.983Z level=DEBUG source=cache.go:142 msg="loading cache slot" id=0 cache=0 prompt=12 used=0 remaining=12 ollama_gpu | time=2025-10-14T02:49:58.199Z level=DEBUG source=sched.go:490 msg="context for request finished" ollama_gpu | [GIN] 2025/10/14 - 02:49:58 | 200 | 18.75975421s | 192.168.140.54 | POST "/api/generate" ollama_gpu | time=2025-10-14T02:49:58.199Z level=DEBUG source=sched.go:286 msg="runner with non-zero duration has gone idle, adding timer" runner.name=registry.ollama.ai/library/gemma3:4b runner.inference=cuda runner.devices=1 runner.size="4.9 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=34 runner.model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 runner.num_ctx=4096 duration=5m0s ollama_gpu | time=2025-10-14T02:49:58.199Z level=DEBUG source=sched.go:304 msg="after processing request finished event" runner.name=registry.ollama.ai/library/gemma3:4b runner.inference=cuda runner.devices=1 runner.size="4.9 GiB" runner.vram="0 B" runner.parallel=1 runner.pid=34 runner.model=/root/.ollama/models/blobs/sha256-aeda25e63ebd698fab8638ffb778e68bed908b960d39d0becc650fa981609d25 runner.num_ctx=4096 refCount=0 ```
Author
Owner

@optivisionlab commented on GitHub (Oct 14, 2025):

docker-compose.yml

version: "3.9"
services:
  ollama:
    container_name: ollama_gpu
    image: ollama/ollama:latest
    runtime: nvidia
    environment:
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - CUDA_VISIBLE_DEVICES=0
      - OLLAMA_DEBUG=1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              capabilities: [gpu]
              # count: all
              device_ids: ['1']
    volumes:
      - ./ollama:/root/.ollama
      - ./models:/models
    ports:
      - "11434:11434"
    logging:
      driver: json-file
      options:
        max-size: "5m"
        max-file: "2"
    restart: unless-stopped
<!-- gh-comment-id:3399919178 --> @optivisionlab commented on GitHub (Oct 14, 2025): docker-compose.yml ``` version: "3.9" services: ollama: container_name: ollama_gpu image: ollama/ollama:latest runtime: nvidia environment: - NVIDIA_DRIVER_CAPABILITIES=compute,utility - CUDA_VISIBLE_DEVICES=0 - OLLAMA_DEBUG=1 deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] # count: all device_ids: ['1'] volumes: - ./ollama:/root/.ollama - ./models:/models ports: - "11434:11434" logging: driver: json-file options: max-size: "5m" max-file: "2" restart: unless-stopped ```
Author
Owner

@optivisionlab commented on GitHub (Oct 14, 2025):

I think my problem might be Cuda Driver Version

Image
<!-- gh-comment-id:3399930746 --> @optivisionlab commented on GitHub (Oct 14, 2025): I think my problem might be Cuda Driver Version <img width="761" height="422" alt="Image" src="https://github.com/user-attachments/assets/fbe18150-42e1-42ed-82d3-9baee108ab0f" />
Author
Owner

@optivisionlab commented on GitHub (Oct 14, 2025):

I upgrade cuda driver version 11.4 -> 12.0, ollam it is work with GPU :D

Image
<!-- gh-comment-id:3400022526 --> @optivisionlab commented on GitHub (Oct 14, 2025): I upgrade cuda driver version 11.4 -> 12.0, ollam it is work with GPU :D <img width="615" height="405" alt="Image" src="https://github.com/user-attachments/assets/d60ba9fd-8821-4437-aa94-128178d0ed4e" />
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#70382