[GH-ISSUE #4342] echo 0 > /proc/sys/kernel/hung_task_timeout_secs, 然后显卡hung住了,最后服务器只能重启了 #28464

Closed
opened 2026-04-22 06:39:24 -05:00 by GiteaMirror · 2 comments
Owner

Originally created by @YangFW on GitHub (May 11, 2024).
Original GitHub issue: https://github.com/ollama/ollama/issues/4342

What is the issue?

May 11 10:54:31 localhost kernel: [] ? system_call_fastpath+0x25/0x2a
May 11 10:54:31 localhost kernel: [] ? SyS_ioctl+0x81/0xa0
May 11 10:54:31 localhost kernel: [] ? SYSC_newstat+0x2e/0x60
May 11 10:54:31 localhost kernel: [] ? do_vfs_ioctl+0x3a8/0x5c0
May 11 10:54:31 localhost kernel: [] ? nvidia_unlocked_ioctl+0x20/0x30 [nvidia]
May 11 10:54:31 localhost kernel: [] ? nvidia_ioctl.isra.22+0x728/0x910 [nvidia]
May 11 10:54:31 localhost kernel: [] ? rm_ioctl+0x58/0xb0 [nvidia]
May 11 10:54:31 localhost kernel: [] ? _nv000731rm+0x16b/0xeb0 [nvidia]
May 11 10:54:31 localhost kernel: [] ? _nv000577rm+0x5d/0x70 [nvidia]
May 11 10:54:31 localhost kernel: [] ? _nv045215rm+0x5c/0x90 [nvidia]
May 11 10:54:31 localhost kernel: [] ? _nv045214rm+0x16f/0x320 [nvidia]
May 11 10:54:31 localhost kernel: [] ? _nv047020rm+0x175/0x3a0 [nvidia]
May 11 10:54:31 localhost kernel: [] ? _nv047087rm+0x54/0xd0 [nvidia]
May 11 10:54:31 localhost kernel: [] ? _nv045252rm+0x26c/0x2f0 [nvidia]
May 11 10:54:31 localhost kernel: [] _nv043989rm+0x10/0x40 [nvidia]
May 11 10:54:31 localhost kernel: [] os_acquire_rwlock_write+0x42/0x50 [nvidia]
May 11 10:54:31 localhost kernel: [] down_write+0x2d/0x41
May 11 10:54:31 localhost kernel: [] call_rwsem_down_write_failed+0x17/0x30
May 11 10:54:31 localhost kernel: [] ? __d_instantiate+0x2d/0xf0
May 11 10:54:31 localhost kernel: [] rwsem_down_write_failed+0x215/0x3c0
May 11 10:54:31 localhost kernel: [] schedule+0x29/0x70
May 11 10:54:31 localhost kernel: Call Trace:
May 11 10:54:31 localhost kernel: ollama_llama_se D ffff9ac9571205e0 0 1236 1183 0x00000080
May 11 10:54:31 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 11 10:54:31 localhost kernel: INFO: task ollama_llama_se:1236 blocked for more than 120 seconds.
May 11 10:52:39 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 42852. Sending cookies. Check SNMP counters.

服务128G,80线程,2*4090,nvidia driver:550,cuda版本:12.4
ollama 开启 codegemma,使用正常,但是过了一天就挂了,用不了了
日志倒叙如上,然后ollama服务、python服务全假死,kill不掉
/etc/sysctl.conf文件内容已经改成了下面的:
fs.inotify.max_user_watches=524288
vm.overcommit_memory = 1
net.core.somaxconn = 1024
vm.dirty_background_ratio=5
vm.dirty_ratio=10
net.ipv4.ip_forward=1
最后只能重启服务器解决这样的问题,请问有什么好的办法吗,是ollama的问题吗,服务器上该配的已经配好了

OS

Linux

GPU

Nvidia

CPU

No response

Ollama version

0.1.33

Originally created by @YangFW on GitHub (May 11, 2024). Original GitHub issue: https://github.com/ollama/ollama/issues/4342 ### What is the issue? May 11 10:54:31 localhost kernel: [<ffffffff85fc539a>] ? system_call_fastpath+0x25/0x2a May 11 10:54:31 localhost kernel: [<ffffffff85a71c21>] ? SyS_ioctl+0x81/0xa0 May 11 10:54:31 localhost kernel: [<ffffffff85a61d0e>] ? SYSC_newstat+0x2e/0x60 May 11 10:54:31 localhost kernel: [<ffffffff85a71988>] ? do_vfs_ioctl+0x3a8/0x5c0 May 11 10:54:31 localhost kernel: [<ffffffffc074c130>] ? nvidia_unlocked_ioctl+0x20/0x30 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc074bf28>] ? nvidia_ioctl.isra.22+0x728/0x910 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc1237e38>] ? rm_ioctl+0x58/0xb0 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc1230abb>] ? _nv000731rm+0x16b/0xeb0 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc07f8f9d>] ? _nv000577rm+0x5d/0x70 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc07eb1bc>] ? _nv045215rm+0x5c/0x90 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc07ead7f>] ? _nv045214rm+0x16f/0x320 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc1062745>] ? _nv047020rm+0x175/0x3a0 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc07e9614>] ? _nv047087rm+0x54/0xd0 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc0801f2c>] ? _nv045252rm+0x26c/0x2f0 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc10610c0>] _nv043989rm+0x10/0x40 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffffc0758132>] os_acquire_rwlock_write+0x42/0x50 [nvidia] May 11 10:54:31 localhost kernel: [<ffffffff85fb716d>] down_write+0x2d/0x41 May 11 10:54:31 localhost kernel: [<ffffffff85bae557>] call_rwsem_down_write_failed+0x17/0x30 May 11 10:54:31 localhost kernel: [<ffffffff85a7519d>] ? __d_instantiate+0x2d/0xf0 May 11 10:54:31 localhost kernel: [<ffffffff85fb9455>] rwsem_down_write_failed+0x215/0x3c0 May 11 10:54:31 localhost kernel: [<ffffffff85fb7ca9>] schedule+0x29/0x70 May 11 10:54:31 localhost kernel: Call Trace: May 11 10:54:31 localhost kernel: ollama_llama_se D ffff9ac9571205e0 0 1236 1183 0x00000080 May 11 10:54:31 localhost kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 11 10:54:31 localhost kernel: INFO: task ollama_llama_se:1236 blocked for more than 120 seconds. May 11 10:52:39 localhost kernel: TCP: request_sock_TCP: Possible SYN flooding on port 42852. Sending cookies. Check SNMP counters. 服务128G,80线程,2*4090,nvidia driver:550,cuda版本:12.4 ollama 开启 codegemma,使用正常,但是过了一天就挂了,用不了了 日志倒叙如上,然后ollama服务、python服务全假死,kill不掉 /etc/sysctl.conf文件内容已经改成了下面的: fs.inotify.max_user_watches=524288 vm.overcommit_memory = 1 net.core.somaxconn = 1024 vm.dirty_background_ratio=5 vm.dirty_ratio=10 net.ipv4.ip_forward=1 最后只能重启服务器解决这样的问题,请问有什么好的办法吗,是ollama的问题吗,服务器上该配的已经配好了 ### OS Linux ### GPU Nvidia ### CPU _No response_ ### Ollama version 0.1.33
GiteaMirror added the bug label 2026-04-22 06:39:24 -05:00
Author
Owner

@mchiang0610 commented on GitHub (May 11, 2024):

对不起,可以试一下 ollama 最新的 0.1.36 吗? 这个问题应该已经解决了

还有遇到这个问题的话,可以帮我们重开这个吗? 谢谢

<!-- gh-comment-id:2105623337 --> @mchiang0610 commented on GitHub (May 11, 2024): 对不起,可以试一下 ollama 最新的 0.1.36 吗? 这个问题应该已经解决了 还有遇到这个问题的话,可以帮我们重开这个吗? 谢谢
Author
Owner

@woxiangbo commented on GitHub (Aug 16, 2024):

@mchiang0610 竟然碰到一个懂中文的大佬,麻烦帮忙看下这个问题呢,搞不定了 https://github.com/ollama/ollama/issues/4131
感谢!

<!-- gh-comment-id:2293151548 --> @woxiangbo commented on GitHub (Aug 16, 2024): @mchiang0610 竟然碰到一个懂中文的大佬,麻烦帮忙看下这个问题呢,搞不定了 https://github.com/ollama/ollama/issues/4131 感谢!
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#28464