[GH-ISSUE #9369] amdgpu: Queue memory allocated to wrong device #68173

Closed
opened 2026-05-04 12:44:25 -05:00 by GiteaMirror · 3 comments
Owner

Originally created by @pshirshov on GitHub (Feb 26, 2025).
Original GitHub issue: https://github.com/ollama/ollama/issues/9369

What is the issue?

I'm trying to run ollama on a dual gpu system with W7900+7900 XTX cards. Their GFX versions are the same. Regardless of the value of OLLAMA_SCHED_SPREAD, I'm always getting a kernel oops:

[ 7933.858592] amdgpu: Queue memory allocated to wrong device
[ 7933.858600] BUG: unable to handle page fault for address: 0000000200000142
[ 7933.858602] #PF: supervisor read access in kernel mode
[ 7933.858603] #PF: error_code(0x0000) - not-present page
[ 7933.858604] PGD 23b668067 P4D 23b668067 PUD 0
[ 7933.858606] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[ 7933.858609] CPU: 13 UID: 61547 PID: 320580 Comm: .ollama-wrapped Tainted: P           O       6.12.13 #1-NixOS
[ 7933.858611] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE
[ 7933.858612] Hardware name: ASUS System Product Name/PRIME X870-P, BIOS 1001 01/11/2025
[ 7933.858613] RIP: 0010:amdgpu_amdkfd_free_gtt_mem+0x15/0xa0 [amdgpu]
[ 7933.858754] Code: 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48 8b 2e 48 89 f3 31 f6 <48> 8b bd 40 01 00 00 4c 8b a5 a8 01 00 00 e8 b8 16 95 fb 83 f8 fc
[ 7933.858756] RSP: 0018:ffffa2ed3631bc10 EFLAGS: 00010246
[ 7933.858757] RAX: ffffa0082eefb600 RBX: ffff9ffe0919ac00 RCX: 0000000000000000
[ 7933.858758] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ffab0880000
[ 7933.858759] RBP: 0000000200000002 R08: 0000000000000000 R09: 0000000000000000
[ 7933.858759] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 7933.858760] R13: ffff9ffa973b4200 R14: 00000000ffffffea R15: ffff9ffa9b794c00
[ 7933.858761] FS:  00007f6f7f7fe6c0(0000) GS:ffffa009bd880000(0000) knlGS:0000000000000000
[ 7933.858761] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7933.858762] CR2: 0000000200000142 CR3: 0000000200e86000 CR4: 0000000000f50ef0
[ 7933.858763] PKRU: 55555554
[ 7933.858763] Call Trace:
[ 7933.858765]  <TASK>
[ 7933.858768]  ? __die_body.cold+0x19/0x2d
[ 7933.858771]  ? page_fault_oops+0x174/0x2f0
[ 7933.858773]  ? exc_page_fault+0x71/0x160
[ 7933.858775]  ? asm_exc_page_fault+0x26/0x30
[ 7933.858777]  ? amdgpu_amdkfd_free_gtt_mem+0x15/0xa0 [amdgpu]
[ 7933.858877]  init_user_queue.isra.0.cold+0x57/0x59 [amdgpu]
[ 7933.859012]  pqm_create_queue+0x1d6/0x530 [amdgpu]
[ 7933.859117]  kfd_ioctl_create_queue+0x236/0x630 [amdgpu]
[ 7933.859204]  kfd_ioctl+0x2dd/0x4b0 [amdgpu]
[ 7933.859282]  ? __pfx_kfd_ioctl_create_queue+0x10/0x10 [amdgpu]
[ 7933.859357]  __x64_sys_ioctl+0x99/0xe0
[ 7933.859359]  do_syscall_64+0xb7/0x210
[ 7933.859361]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[ 7933.859363] RIP: 0033:0x7f6fcc22384f
[ 7933.859378] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 28 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[ 7933.859379] RSP: 002b:00007f6f7f7fc430 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[ 7933.859381] RAX: ffffffffffffffda RBX: 00007f6f7f7fc4e0 RCX: 00007f6fcc22384f
[ 7933.859381] RDX: 00007f6f7f7fc4e0 RSI: 00000000c0584b02 RDI: 0000000000000003
[ 7933.859382] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000002bea000
[ 7933.859382] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6f7f7fc890
[ 7933.859383] R13: 00007f6f84006000 R14: 00000000c0584b02 R15: 0000000000000000
[ 7933.859384]  </TASK>
[ 7933.859384] Modules linked in: sd_mod xt_MASQUERADE xt_mark nft_chain_nat nf_nat rfcomm snd_seq_dummy snd_hrtimer snd_seq qrtr r8153_ecm cdc_ether usbnet bnep ccm algif_aead crypto_null des3_ede_x86_64 cbc des_generic libdes algif_skcipher cmac md4 algif_hash af_alg msr nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component cfg80211 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_ctl_led ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG edac_mce_amd snd_hda_codec_hdmi nf_log_syslog edac_core xt_tcpudp btusb amd_atl nft_compat intel_rapl_msr btrtl snd_hda_intel intel_rapl_common btintel crct10dif_pclmul crc32_pclmul btbcm snd_intel_dspcfg polyval_clmulni btmtk polyval_generic snd_intel_sdw_acpi nf_tables snd_usb_audio r8125(O) ghash_clmulni_intel snd_hda_codec spd5118 sha512_ssse3 libcrc32c crc32c_generic bluetooth sha256_ssse3 crc32c_intel snd_usbmidi_lib sha1_ssse3 snd_hda_core r8152 sp5100_tco eeepc_wmi snd_ump aesni_intel asus_wmi snd_rawmidi watchdog gf128mul
[ 7933.859411]  snd_hwdep snd_seq_device crypto_simd mii platform_profile cryptd mc battery snd_pcm libphy i8042 input_leds hid_jabra sch_fq_codel wmi_bmof i2c_piix4 sparse_keymap rapl joydev mousedev led_class snd_timer i2c_smbus rfkill snd k10temp soundcore ucsi_acpi typec_ucsi rtc_cmos typec roles uinput gpio_amdpt atkbd gpio_generic tiny_power_button button libps2 serio evdev vivaldi_fmap mac_hid loop cpufreq_powersave tun tap macvlan vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd kvm_amd ccp kvm fuse efi_pstore configfs nfnetlink zram 842_decompress 842_compress lz4hc_compress lz4_compress dmi_sysfs ip_tables x_tables bridge stp llc hid_generic dm_mod dax af_packet usbhid hid amdgpu ahci libahci crc16 amdxcp i2c_algo_bit libata drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy thunderbolt zfs(PO) xhci_pci xhci_hcd drm_display_helper nvme scsi_mod nvme_core cec tpm_crb scsi_common nvme_auth video tpm_tis tpm_tis_core wmi spl(O) efivarfs tpm rng_core libaescfb ecdh_generic ecc autofs4
[ 7933.859447] CR2: 0000000200000142
[ 7933.859448] ---[ end trace 0000000000000000 ]---
[ 7933.967953] RIP: 0010:amdgpu_amdkfd_free_gtt_mem+0x15/0xa0 [amdgpu]
[ 7933.968088] Code: 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48 8b 2e 48 89 f3 31 f6 <48> 8b bd 40 01 00 00 4c 8b a5 a8 01 00 00 e8 b8 16 95 fb 83 f8 fc
[ 7933.968089] RSP: 0018:ffffa2ed3631bc10 EFLAGS: 00010246
[ 7933.968091] RAX: ffffa0082eefb600 RBX: ffff9ffe0919ac00 RCX: 0000000000000000
[ 7933.968092] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ffab0880000
[ 7933.968093] RBP: 0000000200000002 R08: 0000000000000000 R09: 0000000000000000
[ 7933.968093] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 7933.968094] R13: ffff9ffa973b4200 R14: 00000000ffffffea R15: ffff9ffa9b794c00
[ 7933.968094] FS:  00007f6f7f7fe6c0(0000) GS:ffffa009bd880000(0000) knlGS:0000000000000000
[ 7933.968095] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7933.968096] CR2: 0000000200000142 CR3: 0000000200e86000 CR4: 0000000000f50ef0
[ 7933.968096] PKRU: 55555554
[ 7933.968097] note: .ollama-wrapped[320580] exited with irqs disabled

My kernel version is 6.12.13 and my rocm version is 6.0.2.

Relevant log output


OS

Linux

GPU

AMD

CPU

AMD

Ollama version

0.5.12

Originally created by @pshirshov on GitHub (Feb 26, 2025). Original GitHub issue: https://github.com/ollama/ollama/issues/9369 ### What is the issue? I'm trying to run ollama on a dual gpu system with W7900+7900 XTX cards. Their GFX versions are the same. Regardless of the value of OLLAMA_SCHED_SPREAD, I'm always getting a kernel oops: ``` [ 7933.858592] amdgpu: Queue memory allocated to wrong device [ 7933.858600] BUG: unable to handle page fault for address: 0000000200000142 [ 7933.858602] #PF: supervisor read access in kernel mode [ 7933.858603] #PF: error_code(0x0000) - not-present page [ 7933.858604] PGD 23b668067 P4D 23b668067 PUD 0 [ 7933.858606] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI [ 7933.858609] CPU: 13 UID: 61547 PID: 320580 Comm: .ollama-wrapped Tainted: P O 6.12.13 #1-NixOS [ 7933.858611] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE [ 7933.858612] Hardware name: ASUS System Product Name/PRIME X870-P, BIOS 1001 01/11/2025 [ 7933.858613] RIP: 0010:amdgpu_amdkfd_free_gtt_mem+0x15/0xa0 [amdgpu] [ 7933.858754] Code: 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48 8b 2e 48 89 f3 31 f6 <48> 8b bd 40 01 00 00 4c 8b a5 a8 01 00 00 e8 b8 16 95 fb 83 f8 fc [ 7933.858756] RSP: 0018:ffffa2ed3631bc10 EFLAGS: 00010246 [ 7933.858757] RAX: ffffa0082eefb600 RBX: ffff9ffe0919ac00 RCX: 0000000000000000 [ 7933.858758] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ffab0880000 [ 7933.858759] RBP: 0000000200000002 R08: 0000000000000000 R09: 0000000000000000 [ 7933.858759] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 7933.858760] R13: ffff9ffa973b4200 R14: 00000000ffffffea R15: ffff9ffa9b794c00 [ 7933.858761] FS: 00007f6f7f7fe6c0(0000) GS:ffffa009bd880000(0000) knlGS:0000000000000000 [ 7933.858761] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7933.858762] CR2: 0000000200000142 CR3: 0000000200e86000 CR4: 0000000000f50ef0 [ 7933.858763] PKRU: 55555554 [ 7933.858763] Call Trace: [ 7933.858765] <TASK> [ 7933.858768] ? __die_body.cold+0x19/0x2d [ 7933.858771] ? page_fault_oops+0x174/0x2f0 [ 7933.858773] ? exc_page_fault+0x71/0x160 [ 7933.858775] ? asm_exc_page_fault+0x26/0x30 [ 7933.858777] ? amdgpu_amdkfd_free_gtt_mem+0x15/0xa0 [amdgpu] [ 7933.858877] init_user_queue.isra.0.cold+0x57/0x59 [amdgpu] [ 7933.859012] pqm_create_queue+0x1d6/0x530 [amdgpu] [ 7933.859117] kfd_ioctl_create_queue+0x236/0x630 [amdgpu] [ 7933.859204] kfd_ioctl+0x2dd/0x4b0 [amdgpu] [ 7933.859282] ? __pfx_kfd_ioctl_create_queue+0x10/0x10 [amdgpu] [ 7933.859357] __x64_sys_ioctl+0x99/0xe0 [ 7933.859359] do_syscall_64+0xb7/0x210 [ 7933.859361] entry_SYSCALL_64_after_hwframe+0x77/0x7f [ 7933.859363] RIP: 0033:0x7f6fcc22384f [ 7933.859378] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 28 48 8b 44 24 18 64 48 2b 04 25 28 00 00 [ 7933.859379] RSP: 002b:00007f6f7f7fc430 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 7933.859381] RAX: ffffffffffffffda RBX: 00007f6f7f7fc4e0 RCX: 00007f6fcc22384f [ 7933.859381] RDX: 00007f6f7f7fc4e0 RSI: 00000000c0584b02 RDI: 0000000000000003 [ 7933.859382] RBP: 0000000000000003 R08: 0000000000000001 R09: 0000000002bea000 [ 7933.859382] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f6f7f7fc890 [ 7933.859383] R13: 00007f6f84006000 R14: 00000000c0584b02 R15: 0000000000000000 [ 7933.859384] </TASK> [ 7933.859384] Modules linked in: sd_mod xt_MASQUERADE xt_mark nft_chain_nat nf_nat rfcomm snd_seq_dummy snd_hrtimer snd_seq qrtr r8153_ecm cdc_ether usbnet bnep ccm algif_aead crypto_null des3_ede_x86_64 cbc des_generic libdes algif_skcipher cmac md4 algif_hash af_alg msr nls_iso8859_1 nls_cp437 vfat fat snd_hda_codec_realtek snd_hda_codec_generic snd_hda_scodec_component cfg80211 xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_ctl_led ip6t_rpfilter ipt_rpfilter xt_pkttype xt_LOG edac_mce_amd snd_hda_codec_hdmi nf_log_syslog edac_core xt_tcpudp btusb amd_atl nft_compat intel_rapl_msr btrtl snd_hda_intel intel_rapl_common btintel crct10dif_pclmul crc32_pclmul btbcm snd_intel_dspcfg polyval_clmulni btmtk polyval_generic snd_intel_sdw_acpi nf_tables snd_usb_audio r8125(O) ghash_clmulni_intel snd_hda_codec spd5118 sha512_ssse3 libcrc32c crc32c_generic bluetooth sha256_ssse3 crc32c_intel snd_usbmidi_lib sha1_ssse3 snd_hda_core r8152 sp5100_tco eeepc_wmi snd_ump aesni_intel asus_wmi snd_rawmidi watchdog gf128mul [ 7933.859411] snd_hwdep snd_seq_device crypto_simd mii platform_profile cryptd mc battery snd_pcm libphy i8042 input_leds hid_jabra sch_fq_codel wmi_bmof i2c_piix4 sparse_keymap rapl joydev mousedev led_class snd_timer i2c_smbus rfkill snd k10temp soundcore ucsi_acpi typec_ucsi rtc_cmos typec roles uinput gpio_amdpt atkbd gpio_generic tiny_power_button button libps2 serio evdev vivaldi_fmap mac_hid loop cpufreq_powersave tun tap macvlan vfio_pci vfio_pci_core vfio_iommu_type1 vfio iommufd kvm_amd ccp kvm fuse efi_pstore configfs nfnetlink zram 842_decompress 842_compress lz4hc_compress lz4_compress dmi_sysfs ip_tables x_tables bridge stp llc hid_generic dm_mod dax af_packet usbhid hid amdgpu ahci libahci crc16 amdxcp i2c_algo_bit libata drm_ttm_helper ttm drm_exec gpu_sched drm_suballoc_helper drm_buddy thunderbolt zfs(PO) xhci_pci xhci_hcd drm_display_helper nvme scsi_mod nvme_core cec tpm_crb scsi_common nvme_auth video tpm_tis tpm_tis_core wmi spl(O) efivarfs tpm rng_core libaescfb ecdh_generic ecc autofs4 [ 7933.859447] CR2: 0000000200000142 [ 7933.859448] ---[ end trace 0000000000000000 ]--- [ 7933.967953] RIP: 0010:amdgpu_amdkfd_free_gtt_mem+0x15/0xa0 [amdgpu] [ 7933.968088] Code: 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 41 54 55 53 48 8b 2e 48 89 f3 31 f6 <48> 8b bd 40 01 00 00 4c 8b a5 a8 01 00 00 e8 b8 16 95 fb 83 f8 fc [ 7933.968089] RSP: 0018:ffffa2ed3631bc10 EFLAGS: 00010246 [ 7933.968091] RAX: ffffa0082eefb600 RBX: ffff9ffe0919ac00 RCX: 0000000000000000 [ 7933.968092] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9ffab0880000 [ 7933.968093] RBP: 0000000200000002 R08: 0000000000000000 R09: 0000000000000000 [ 7933.968093] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [ 7933.968094] R13: ffff9ffa973b4200 R14: 00000000ffffffea R15: ffff9ffa9b794c00 [ 7933.968094] FS: 00007f6f7f7fe6c0(0000) GS:ffffa009bd880000(0000) knlGS:0000000000000000 [ 7933.968095] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 7933.968096] CR2: 0000000200000142 CR3: 0000000200e86000 CR4: 0000000000f50ef0 [ 7933.968096] PKRU: 55555554 [ 7933.968097] note: .ollama-wrapped[320580] exited with irqs disabled ``` My kernel version is 6.12.13 and my rocm version is 6.0.2. ### Relevant log output ```shell ``` ### OS Linux ### GPU AMD ### CPU AMD ### Ollama version 0.5.12
GiteaMirror added the bug label 2026-05-04 12:44:25 -05:00
Author
Owner

@pshirshov commented on GitHub (Feb 26, 2025):

Another message which I observe a lot when I try to use dual gpu setup, is [ 569.997757] amdgpu: Queue memory allocated to wrong device

<!-- gh-comment-id:2685883674 --> @pshirshov commented on GitHub (Feb 26, 2025): Another message which I observe a lot when I try to use dual gpu setup, is `[ 569.997757] amdgpu: Queue memory allocated to wrong device`
Author
Owner

@pshirshov commented on GitHub (Feb 26, 2025):

Likely it's caused by my slightly outdated rocm version or some nixos specific issue. I've ran ollama in a docker on the same machine and it worked well and scheduled model across two GPUs.

<!-- gh-comment-id:2686071707 --> @pshirshov commented on GitHub (Feb 26, 2025): Likely it's caused by my slightly outdated rocm version or some nixos specific issue. I've ran ollama in a docker on the same machine and it worked well and scheduled model across two GPUs.
Author
Owner

@pshirshov commented on GitHub (Mar 7, 2025):

Works with rocm 6.3.3

<!-- gh-comment-id:2707705777 --> @pshirshov commented on GitHub (Mar 7, 2025): Works with rocm 6.3.3
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: github-starred/ollama#68173