[GH-ISSUE #9691] Ollama 0.6.0 with gemma3 can't load models from mounted Cloud Storage bucket on Cloud Run #32086

Closed
opened 2026-04-22 13:00:25 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @steren on GitHub (Mar 12, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9691

What is the issue?

Ollama is able to write the downloaded model to my mounted GCS bucket, but not to read it on new instances startup.

It only happens with gemma3, llama3 works.

As you see, I mount a GCS bucket at /root/.ollama/, use Direct VPC + Egress=all.

On first run, Ollama properly pulls the model from the internet and stores it in the bucket. I can see it in the GCS bucket UI.

On a second run, Ollama is attempting to load the model from GCS, but never succeeds.

I get:

$ OLLAMA_HOST=https://ollama-gemma-250756049697.us-central1.run.app ollama run gemma3 --verbose
Error: unmarshal: invalid character 'u' looking for beginning of value

Here's my Cloud Run YAML:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: ollama-gemma
  namespace: '250756049697'
  selfLink: /apis/serving.knative.dev/v1/namespaces/250756049697/services/ollama-gemma
  uid: 0adc720d-749b-4645-bc0f-fc0d5917867b
  resourceVersion: AAYwIxodEac
  generation: 7
  creationTimestamp: '2025-03-11T14:06:14.692930Z'
  labels:
    cloud.googleapis.com/location: us-central1
  annotations:
    serving.knative.dev/creator: steren@google.com
    serving.knative.dev/lastModifier: steren@google.com
    run.googleapis.com/client-name: cloud-console
    run.googleapis.com/launch-stage: BETA
    run.googleapis.com/operation-id: 42b2cd7d-a01e-49fc-b540-0705667e9761
    run.googleapis.com/ingress: all
    run.googleapis.com/ingress-status: all
    run.googleapis.com/urls: >-
      ["https://ollama-gemma-250756049697.us-central1.run.app","https://ollama-gemma-ikoam4ya6a-uc.a.run.app"]
spec:
  template:
    metadata:
      labels:
        client.knative.dev/nonce: 18b59871-a0a2-489c-89b2-bae2d39d3172
        run.googleapis.com/startupProbeType: Default
      annotations:
        autoscaling.knative.dev/maxScale: '200'
        run.googleapis.com/client-name: cloud-console
        run.googleapis.com/vpc-access-egress: all-traffic
        run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]'
        run.googleapis.com/cpu-throttling: 'false'
        run.googleapis.com/startup-cpu-boost: 'true'
    spec:
      containerConcurrency: 4
      timeoutSeconds: 300
      serviceAccountName: 250756049697-compute@developer.gserviceaccount.com
      containers:
      - name: ollama-1
        image: 'ollama/ollama:0.6.0'
        ports:
        - name: http1
          containerPort: 11434
        resources:
          limits:
            cpu: 8000m
            nvidia.com/gpu: '1'
            memory: 32Gi
        volumeMounts:
        - name: gcs-1
          mountPath: /root/.ollama/
        startupProbe:
          timeoutSeconds: 240
          periodSeconds: 240
          failureThreshold: 1
          tcpSocket:
            port: 11434
      volumes:
      - name: gcs-1
        csi:
          driver: gcsfuse.run.googleapis.com
          volumeAttributes:
            bucketName: cloud-run-gpu-demo-models
      nodeSelector:
        run.googleapis.com/accelerator: nvidia-l4
  traffic:
  - percent: 100
    latestRevision: true

2025-03-12 08:17:32.918
time=2025-03-12T15:17:32.917Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error"

2025-03-12 08:17:32.933
time=2025-03-12T15:17:32.931Z level=INFO source=runner.go:882 msg="starting ollama engine"
2025-03-12 08:17:32.933
time=2025-03-12T15:17:32.932Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:41619"
2025-03-12 08:17:33.075
time=2025-03-12T15:17:33.074Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default=""
2025-03-12 08:17:33.075
time=2025-03-12T15:17:33.074Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default=""
2025-03-12 08:17:33.075
time=2025-03-12T15:17:33.074Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=35
2025-03-12 08:17:33.075
time=2025-03-12T15:17:33.074Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12
2025-03-12 08:17:33.150
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
2025-03-12 08:17:33.150
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
2025-03-12 08:17:33.150
ggml_cuda_init: found 1 CUDA devices:
2025-03-12 08:17:33.150
  Device 0: NVIDIA L4, compute capability 8.9, VMM: yes
2025-03-12 08:17:33.150
load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so
2025-03-12 08:17:33.151
time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib
2025-03-12 08:17:33.151
time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64
2025-03-12 08:17:33.151
time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64
2025-03-12 08:17:33.151
time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama
2025-03-12 08:17:33.153
ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-alderlake.so score: 0
2025-03-12 08:17:33.153
ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-haswell.so score: 55
2025-03-12 08:17:33.153
ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-icelake.so score: 0
2025-03-12 08:17:33.154
ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-sandybridge.so score: 20
2025-03-12 08:17:33.154
ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-skylakex.so score: 183
2025-03-12 08:17:33.156
load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so
2025-03-12 08:17:33.156
time=2025-03-12T15:17:33.155Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc)
2025-03-12 08:17:33.169
time=2025-03-12T15:17:33.168Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model"
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.209Z level=DEBUG source=ggml.go:220 msg="created tensor" name=mm.mm_input_projection.weight shape="[2560 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=mm.mm_soft_emb_norm.weight shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=output_norm.weight shape=[2560] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=token_embd.weight shape="[2560 262144]" dtype=14 buffer_type=CPU
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=output.weight shape="[2560 262144]" dtype=14 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_q.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_q.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_v.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_v.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm1.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm1.weight shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm2.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm2.weight shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc1.bias shape=[4304] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc1.weight shape="[1152 4304]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc2.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc2.weight shape="[4304 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_q.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_q.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_v.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_v.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm1.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm1.weight shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm2.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm2.weight shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc1.bias shape=[4304] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc1.weight shape="[1152 4304]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc2.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc2.weight shape="[4304 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.211
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_q.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_q.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_v.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_v.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm1.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm1.weight shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm2.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm2.weight shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc1.bias shape=[4304] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc1.weight shape="[1152 4304]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc2.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc2.weight shape="[4304 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0
2025-03-12 08:17:33.212
time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0

Relevant log output


OS

No response

GPU

No response

CPU

No response

Ollama version

No response

Originally created by @steren on GitHub (Mar 12, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9691 ### What is the issue? Ollama is able to write the downloaded model to my mounted GCS bucket, but not to read it on new instances startup. It only happens with gemma3, llama3 works. As you see, I mount a GCS bucket at /root/.ollama/, use Direct VPC + Egress=all. On first run, Ollama properly pulls the model from the internet and stores it in the bucket. I can see it in the GCS bucket UI. On a second run, Ollama is attempting to load the model from GCS, but never succeeds. I get: ``` $ OLLAMA_HOST=https://ollama-gemma-250756049697.us-central1.run.app ollama run gemma3 --verbose Error: unmarshal: invalid character 'u' looking for beginning of value ``` Here's my Cloud Run YAML: ``` apiVersion: serving.knative.dev/v1 kind: Service metadata: name: ollama-gemma namespace: '250756049697' selfLink: /apis/serving.knative.dev/v1/namespaces/250756049697/services/ollama-gemma uid: 0adc720d-749b-4645-bc0f-fc0d5917867b resourceVersion: AAYwIxodEac generation: 7 creationTimestamp: '2025-03-11T14:06:14.692930Z' labels: cloud.googleapis.com/location: us-central1 annotations: serving.knative.dev/creator: steren@google.com serving.knative.dev/lastModifier: steren@google.com run.googleapis.com/client-name: cloud-console run.googleapis.com/launch-stage: BETA run.googleapis.com/operation-id: 42b2cd7d-a01e-49fc-b540-0705667e9761 run.googleapis.com/ingress: all run.googleapis.com/ingress-status: all run.googleapis.com/urls: >- ["https://ollama-gemma-250756049697.us-central1.run.app","https://ollama-gemma-ikoam4ya6a-uc.a.run.app"] spec: template: metadata: labels: client.knative.dev/nonce: 18b59871-a0a2-489c-89b2-bae2d39d3172 run.googleapis.com/startupProbeType: Default annotations: autoscaling.knative.dev/maxScale: '200' run.googleapis.com/client-name: cloud-console run.googleapis.com/vpc-access-egress: all-traffic run.googleapis.com/network-interfaces: '[{"network":"default","subnetwork":"default"}]' run.googleapis.com/cpu-throttling: 'false' run.googleapis.com/startup-cpu-boost: 'true' spec: containerConcurrency: 4 timeoutSeconds: 300 serviceAccountName: 250756049697-compute@developer.gserviceaccount.com containers: - name: ollama-1 image: 'ollama/ollama:0.6.0' ports: - name: http1 containerPort: 11434 resources: limits: cpu: 8000m nvidia.com/gpu: '1' memory: 32Gi volumeMounts: - name: gcs-1 mountPath: /root/.ollama/ startupProbe: timeoutSeconds: 240 periodSeconds: 240 failureThreshold: 1 tcpSocket: port: 11434 volumes: - name: gcs-1 csi: driver: gcsfuse.run.googleapis.com volumeAttributes: bucketName: cloud-run-gpu-demo-models nodeSelector: run.googleapis.com/accelerator: nvidia-l4 traffic: - percent: 100 latestRevision: true ``` ``` 2025-03-12 08:17:32.918 time=2025-03-12T15:17:32.917Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server error" 2025-03-12 08:17:32.933 time=2025-03-12T15:17:32.931Z level=INFO source=runner.go:882 msg="starting ollama engine" 2025-03-12 08:17:32.933 time=2025-03-12T15:17:32.932Z level=INFO source=runner.go:938 msg="Server listening on 127.0.0.1:41619" 2025-03-12 08:17:33.075 time=2025-03-12T15:17:33.074Z level=WARN source=ggml.go:149 msg="key not found" key=general.name default="" 2025-03-12 08:17:33.075 time=2025-03-12T15:17:33.074Z level=WARN source=ggml.go:149 msg="key not found" key=general.description default="" 2025-03-12 08:17:33.075 time=2025-03-12T15:17:33.074Z level=INFO source=ggml.go:67 msg="" architecture=gemma3 file_type=Q4_K_M name="" description="" num_tensors=883 num_key_values=35 2025-03-12 08:17:33.075 time=2025-03-12T15:17:33.074Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama/cuda_v12 2025-03-12 08:17:33.150 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no 2025-03-12 08:17:33.150 ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no 2025-03-12 08:17:33.150 ggml_cuda_init: found 1 CUDA devices: 2025-03-12 08:17:33.150 Device 0: NVIDIA L4, compute capability 8.9, VMM: yes 2025-03-12 08:17:33.150 load_backend: loaded CUDA backend from /usr/lib/ollama/cuda_v12/libggml-cuda.so 2025-03-12 08:17:33.151 time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib 2025-03-12 08:17:33.151 time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64 2025-03-12 08:17:33.151 time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:93 msg="skipping path which is not part of ollama" path=/usr/local/nvidia/lib64 2025-03-12 08:17:33.151 time=2025-03-12T15:17:33.150Z level=DEBUG source=ggml.go:99 msg="ggml backend load all from path" path=/usr/lib/ollama 2025-03-12 08:17:33.153 ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-alderlake.so score: 0 2025-03-12 08:17:33.153 ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-haswell.so score: 55 2025-03-12 08:17:33.153 ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-icelake.so score: 0 2025-03-12 08:17:33.154 ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-sandybridge.so score: 20 2025-03-12 08:17:33.154 ggml_backend_load_best: /usr/lib/ollama/libggml-cpu-skylakex.so score: 183 2025-03-12 08:17:33.156 load_backend: loaded CPU backend from /usr/lib/ollama/libggml-cpu-skylakex.so 2025-03-12 08:17:33.156 time=2025-03-12T15:17:33.155Z level=INFO source=ggml.go:109 msg=system CPU.0.SSE3=1 CPU.0.SSSE3=1 CPU.0.AVX=1 CPU.0.AVX2=1 CPU.0.F16C=1 CPU.0.FMA=1 CPU.0.AVX512=1 CPU.0.LLAMAFILE=1 CPU.1.LLAMAFILE=1 CUDA.0.ARCHS=500,600,610,700,750,800,860,870,890,900,1200 CUDA.0.USE_GRAPHS=1 CUDA.0.PEER_MAX_BATCH_SIZE=128 compiler=cgo(gcc) 2025-03-12 08:17:33.169 time=2025-03-12T15:17:33.168Z level=INFO source=server.go:619 msg="waiting for server to become available" status="llm server loading model" 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.209Z level=DEBUG source=ggml.go:220 msg="created tensor" name=mm.mm_input_projection.weight shape="[2560 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=mm.mm_soft_emb_norm.weight shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=output_norm.weight shape=[2560] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=token_embd.weight shape="[2560 262144]" dtype=14 buffer_type=CPU 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=output.weight shape="[2560 262144]" dtype=14 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_q.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_q.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_v.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.attn_v.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm1.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm1.weight shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm2.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.layer_norm2.weight shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc1.bias shape=[4304] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc1.weight shape="[1152 4304]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc2.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.0.mlp.fc2.weight shape="[4304 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_q.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_q.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_v.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.attn_v.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm1.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm1.weight shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm2.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.210Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.layer_norm2.weight shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc1.bias shape=[4304] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc1.weight shape="[1152 4304]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc2.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.1.mlp.fc2.weight shape="[4304 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.211 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_q.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_q.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_v.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.attn_v.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm1.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm1.weight shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm2.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.layer_norm2.weight shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc1.bias shape=[4304] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc1.weight shape="[1152 4304]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc2.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.10.mlp.fc2.weight shape="[4304 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_k.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_k.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_output.bias shape=[1152] dtype=0 buffer_type=CUDA0 2025-03-12 08:17:33.212 time=2025-03-12T15:17:33.211Z level=DEBUG source=ggml.go:220 msg="created tensor" name=v.blk.11.attn_output.weight shape="[1152 1152]" dtype=1 buffer_type=CUDA0 ``` ### Relevant log output ```shell ``` ### OS _No response_ ### GPU _No response_ ### CPU _No response_ ### Ollama version _No response_
GiteaMirror added the bug label 2026-04-22 13:00:25 -05:00
Author
Owner

@steren commented on GitHub (Mar 28, 2025):

Seems like newest versions of Ollama work.

<!-- gh-comment-id:2762683642 --> @steren commented on GitHub (Mar 28, 2025): Seems like newest versions of Ollama work.
Author
Owner

@steren commented on GitHub (Mar 31, 2025):

Actually, re-opening. A googler was struggling loading Gemma from GCS

<!-- gh-comment-id:2767345085 --> @steren commented on GitHub (Mar 31, 2025): Actually, re-opening. A googler was struggling loading Gemma from GCS
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#32086