[GH-ISSUE #4971] How to disallow the use of both gpu and cpu #65178

Open
opened 2026-05-03 19:56:27 -05:00 by GiteaMirror · 4 comments
Owner

Originally created by @xiaohanglei on GitHub (Jun 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4971

When using both GPU and CPU, the output will be garbled, so I want to prohibit this scenario

image

Originally created by @xiaohanglei on GitHub (Jun 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4971 When using both GPU and CPU, the output will be garbled, so I want to prohibit this scenario ![image](https://github.com/ollama/ollama/assets/32543872/8dec48d9-12aa-4ccc-b2a6-6243ec1f6b27)
GiteaMirror added the feature request label 2026-05-03 19:56:27 -05:00
Author
Owner

@jmorganca commented on GitHub (Jun 11, 2024):

Hi @xiaohanglei what model is this? It definitely shouldn't be happening - sorry about that

<!-- gh-comment-id:2159722516 --> @jmorganca commented on GitHub (Jun 11, 2024): Hi @xiaohanglei what model is this? It definitely shouldn't be happening - sorry about that
Author
Owner

@xiaohanglei commented on GitHub (Jun 12, 2024):

Hello, the test model is based on a model named qwen:1.8b. I have modified some parameter values on this basis. The details are as follows:

FROM qwen:1.8b
TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant"
SYSTEM You are a helpful assistant.
PARAMETER top_p 0.7
PARAMETER num_ctx 4096
PARAMETER repeat_last_n -1
PARAMETER repeat_penalty 1.05
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
PARAMETER temperature 0.3
PARAMETER top_k 20

The results of using the model are as follows:

7fc0c7c0006ac657a74c7c59fd0e0fc6

The following are the output logs from Ollama when the issue occurs:

文件:
ollama_output.txt

Testing Conclusion:

Through testing, it can be determined that this issue is related to the num_ctx parameter value. When it is set to 2048, the issue does not occur. However, when set to 4096, the issue is highly likely to reoccur.

Test Environment:

OS: Windows 10
GPU: NVIDIA GeForce GTX 1050 Ti
CPU: Intel Core i5-12490F
Ollama version: 0.1.41

Test Scenario:

  1. Use testing tools to increase the GPU memory load to over 95%, so that when loading the model, it can be split between the CPU and GPU. An example image is shown below:
    e5b4c9cb4bdbdaad0349c47b142bd1e6
915e2b8274f46c34d96a2b498219a361

The following code is what I use to increase GPU memory load for testing purposes. It is provided for reference only:


import torch
import time

# 检查是否有可用的GPU
if torch.cuda.is_available():
    # 获取默认的GPU设备
    device = torch.device('cuda')
    print(f'Using GPU: {torch.cuda.get_device_name(device)}')

    # 获取GPU的总显存容量
    total_memory = torch.cuda.get_device_properties(device).total_memory
    print(f'Total GPU memory: {total_memory / (1024 ** 3):.2f} GB')

    # 计算需要分配的元素数量以占用约85%的显存
    target_memory = int(total_memory * 4.3)
    num_elements = target_memory // 4

    # 创建一个大张量并将其分配到GPU上
    tensor = torch.zeros(num_elements, dtype=torch.float32, device=device)
    print(f'Allocated {target_memory / (1024 ** 3):.2f} GB of GPU memory, which is 85% of the total GPU memory.')

    # 创建两个较小的张量用于矩阵乘法运算
    size = 1024  # 你可以根据需要调整这个大小
    tensor_a = torch.randn(size, size, device=device)
    tensor_b = torch.randn(size, size, device=device)

    # 保持程序运行状态,并进行大量计算以提高GPU利用率
    try:
        while True:
            # 进行大量矩阵乘法运算以提高GPU利用率
            for _ in range(800):  # 调整循环次数以控制计算负载
                result = torch.matmul(tensor_a, tensor_b)
                #time.sleep(0.001)
            #time.sleep(0.01)  # 增加休眠时间,模拟实际计算任务的间隔
    except KeyboardInterrupt:
        print('Program terminated by user.')
else:
    print('No GPU available.')


<!-- gh-comment-id:2162011910 --> @xiaohanglei commented on GitHub (Jun 12, 2024): # Hello, the test model is based on a model named qwen:1.8b. I have modified some parameter values on this basis. The details are as follows: ``` FROM qwen:1.8b TEMPLATE "{{ if .System }}<|im_start|>system {{ .System }}<|im_end|>{{ end }}<|im_start|>user {{ .Prompt }}<|im_end|> <|im_start|>assistant" SYSTEM You are a helpful assistant. PARAMETER top_p 0.7 PARAMETER num_ctx 4096 PARAMETER repeat_last_n -1 PARAMETER repeat_penalty 1.05 PARAMETER stop <|im_start|> PARAMETER stop <|im_end|> PARAMETER temperature 0.3 PARAMETER top_k 20 ``` # The results of using the model are as follows: <img width="1425" alt="7fc0c7c0006ac657a74c7c59fd0e0fc6" src="https://github.com/ollama/ollama/assets/32543872/1be042e6-9160-48fe-ade1-dc7bfc32f2d2"> # The following are the output logs from Ollama when the issue occurs: 文件: [ollama_output.txt](https://github.com/user-attachments/files/15797734/ollama_output.txt) # Testing Conclusion: Through testing, it can be determined that this issue is related to the num_ctx parameter value. When it is set to 2048, the issue does not occur. However, when set to 4096, the issue is highly likely to reoccur. # Test Environment: OS: Windows 10 GPU: NVIDIA GeForce GTX 1050 Ti CPU: Intel Core i5-12490F Ollama version: 0.1.41 # Test Scenario: 1. Use testing tools to increase the GPU memory load to over 95%, so that when loading the model, it can be split between the CPU and GPU. An example image is shown below: <img width="882" alt="e5b4c9cb4bdbdaad0349c47b142bd1e6" src="https://github.com/ollama/ollama/assets/32543872/e516aafb-11dc-466b-887b-5bfd335d0a81"> <img width="617" alt="915e2b8274f46c34d96a2b498219a361" src="https://github.com/ollama/ollama/assets/32543872/74e65f53-1902-42a5-b031-52c606c6dd6a"> # The following code is what I use to increase GPU memory load for testing purposes. It is provided for reference only: ``` py import torch import time # 检查是否有可用的GPU if torch.cuda.is_available(): # 获取默认的GPU设备 device = torch.device('cuda') print(f'Using GPU: {torch.cuda.get_device_name(device)}') # 获取GPU的总显存容量 total_memory = torch.cuda.get_device_properties(device).total_memory print(f'Total GPU memory: {total_memory / (1024 ** 3):.2f} GB') # 计算需要分配的元素数量以占用约85%的显存 target_memory = int(total_memory * 4.3) num_elements = target_memory // 4 # 创建一个大张量并将其分配到GPU上 tensor = torch.zeros(num_elements, dtype=torch.float32, device=device) print(f'Allocated {target_memory / (1024 ** 3):.2f} GB of GPU memory, which is 85% of the total GPU memory.') # 创建两个较小的张量用于矩阵乘法运算 size = 1024 # 你可以根据需要调整这个大小 tensor_a = torch.randn(size, size, device=device) tensor_b = torch.randn(size, size, device=device) # 保持程序运行状态,并进行大量计算以提高GPU利用率 try: while True: # 进行大量矩阵乘法运算以提高GPU利用率 for _ in range(800): # 调整循环次数以控制计算负载 result = torch.matmul(tensor_a, tensor_b) #time.sleep(0.001) #time.sleep(0.01) # 增加休眠时间,模拟实际计算任务的间隔 except KeyboardInterrupt: print('Program terminated by user.') else: print('No GPU available.') ```
Author
Owner

@xiaohanglei commented on GitHub (Jun 12, 2024):

I suspect that this issue is similar to #4977

<!-- gh-comment-id:2162035608 --> @xiaohanglei commented on GitHub (Jun 12, 2024): I suspect that this issue is similar to #4977
Author
Owner

@xiaohanglei commented on GitHub (Jun 14, 2024):

Hi @xiaohanglei what model is this? It definitely shouldn't be happening - sorry about that

@jmorganca A test scenario has been provided

<!-- gh-comment-id:2167110822 --> @xiaohanglei commented on GitHub (Jun 14, 2024): > Hi @xiaohanglei what model is this? It definitely shouldn't be happening - sorry about that @jmorganca A test scenario has been provided
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#65178