Bug ID 1228869
Summary amdgpu: Suspend To RAM after resume - throwing an " kernel NULL pointer dereference" coming from RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
Classification openSUSE
Product openSUSE Tumbleweed
Version Current
Hardware x86-64
OS openSUSE Tumbleweed
Status NEW
Severity Major
Priority P5 - None
Component Kernel
Assignee kernel-bugs@opensuse.org
Reporter mail@holad.de
QA Contact qa-bugs@suse.de
Target Milestone ---
Found By ---
Blocker ---

Hi All!

I'm seeing sporadic 'BUG: kernel NULL pointer dereference' errors coming from
the kernel module 'amdgpu' after resuming from Suspend To RAM rendering the
entire device unusable. This problem also occurs also with with 6.9.x.

es-ws:/home/had # lsb_release -a
LSB Version:    n/a
Distributor ID: openSUSE
Description:    openSUSE Tumbleweed
Release:        20240801
Codename:       n/a

es-ws:/home/had # sudo lshw -c cpu | grep product
       product: AMD Ryzen 9 5900X 12-Core Processor

es-ws:/home/had # lspci -nn | grep "VGA\|Display"
09:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc.
[AMD/ATI] Ellesmere [Radeon RX 470/480/570/570X/580/580X/590] [1002:67df] (rev
ef)

es-ws:/home/had # sudo dmidecode -t BIOS | grep Version 
        Version: 5201

es-ws:/home/had # uname -a
Linux es-ws 6.10.2-1-default #1 SMP PREEMPT_DYNAMIC Mon Jul 29 08:51:47 UTC
2024 (65a34e2) x86_64 x86_64 x86_64 GNU/Linux


Error:

170947.891041] [ T995603] BUG: kernel NULL pointer dereference, address:
0000000000000008
[170947.891045] [ T995603] #PF: supervisor read access in kernel mode
[170947.891046] [ T995603] #PF: error_code(0x0000) - not-present page
[170947.891048] [ T995603] PGD 0 P4D 0 
[170947.891050] [ T995603] Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
[170947.891052] [ T995603] CPU: 0 PID: 995603 Comm: kscreenloc:cs0 Tainted: G  
        O       6.10.2-1-default #1 openSUSE Tumbleweed
b3f1bf5ce4f399ce3dea9a7ecea1c9a6e505f1e8
[170947.891055] [ T995603] Hardware name: System manufacturer System Product
Name/ROG STRIX B450-F GAMING, BIOS 5201 08/10/2023
[170947.891057] [ T995603] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
[170947.891064] [ T995603] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00
00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 61 37 00 00 48 8b 45
10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
[170947.891065] [ T995603] RSP: 0018:ffff9ffe16f97a28 EFLAGS: 00010206
[170947.891067] [ T995603] RAX: 0000000000000000 RBX: ffff911d64d78000 RCX:
ffff91179c4826d0
[170947.891069] [ T995603] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff911bb871f838
[170947.891070] [ T995603] RBP: ffff911bb871f810 R08: ffff91175b733b28 R09:
ffff9ffe16f97898
[170947.891071] [ T995603] R10: ffff9ffe16f97890 R11: 0000000000000003 R12:
0000000000000001
[170947.891072] [ T995603] R13: ffff9ffe16f97aa8 R14: 0000000000000000 R15:
ffff9117497fb000
[170947.891073] [ T995603] FS:  00007f68396006c0(0000)
GS:ffff91263e600000(0000) knlGS:0000000000000000
[170947.891074] [ T995603] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[170947.891075] [ T995603] CR2: 0000000000000008 CR3: 0000000d2c836000 CR4:
0000000000750ef0
[170947.891076] [ T995603] PKRU: 55555554
[170947.891077] [ T995603] Call Trace:
[170947.891079] [ T995603]  <TASK>
[170947.891081] [ T995603]  ? __die_body.cold+0x14/0x24
[170947.891086] [ T995603]  ? page_fault_oops+0x134/0x2c0
[170947.891088] [ T995603]  ? prb_read_valid+0x1b/0x30
[170947.891091] [ T995603]  ? console_unlock+0x6a/0x110
[170947.891094] [ T995603]  ? exc_page_fault+0x73/0x170
[170947.891096] [ T995603]  ? asm_exc_page_fault+0x26/0x30
[170947.891100] [ T995603]  ? drm_sched_job_arm+0x23/0x60 [gpu_sched
6729bac8a830f3cdc041e097e9ebbdee5be285e0]
[170947.891103] [ T995603]  ? drm_sched_job_arm+0x1f/0x60 [gpu_sched
6729bac8a830f3cdc041e097e9ebbdee5be285e0]
[170947.891106] [ T995603]  amdgpu_cs_ioctl+0x144e/0x1a00 [amdgpu
90bf0372d613afc3edf1507ca6f0e3c0c94513e4]
[170947.891348] [ T995603]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu
90bf0372d613afc3edf1507ca6f0e3c0c94513e4]
[170947.891546] [ T995603]  drm_ioctl_kernel+0xaa/0x100
[170947.891552] [ T995603]  drm_ioctl+0x25d/0x4c0
[170947.891555] [ T995603]  ? __pfx_amdgpu_cs_ioctl+0x10/0x10 [amdgpu
90bf0372d613afc3edf1507ca6f0e3c0c94513e4]
[170947.891736] [ T995603]  ? try_to_wake_up+0x1fc/0x640
[170947.891741] [ T995603]  amdgpu_drm_ioctl+0x4e/0x90 [amdgpu
90bf0372d613afc3edf1507ca6f0e3c0c94513e4]
[170947.891913] [ T995603]  __x64_sys_ioctl+0x94/0xd0
[170947.891917] [ T995603]  do_syscall_64+0x82/0x160
[170947.891921] [ T995603]  ? syscall_exit_to_user_mode+0x72/0x220
[170947.891923] [ T995603]  ? do_syscall_64+0x8e/0x160
[170947.891925] [ T995603]  ? amdgpu_drm_ioctl+0x71/0x90 [amdgpu
90bf0372d613afc3edf1507ca6f0e3c0c94513e4]
[170947.892096] [ T995603]  ? syscall_exit_to_user_mode+0x72/0x220
[170947.892099] [ T995603]  ? do_syscall_64+0x8e/0x160
[170947.892100] [ T995603]  ? do_syscall_64+0x8e/0x160
[170947.892101] [ T995603]  ? __sysvec_apic_timer_interrupt+0x55/0x100
[170947.892104] [ T995603]  ? __irq_exit_rcu+0x38/0xb0
[170947.892107] [ T995603]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[170947.892109] [ T995603] RIP: 0033:0x7f684590f3df
[170947.892142] [ T995603] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04
24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f
05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[170947.892143] [ T995603] RSP: 002b:00007f68395ff720 EFLAGS: 00000246
ORIG_RAX: 0000000000000010
[170947.892145] [ T995603] RAX: ffffffffffffffda RBX: 00007f68395ff8f8 RCX:
00007f684590f3df
[170947.892147] [ T995603] RDX: 00007f68395ff7f0 RSI: 00000000c0186444 RDI:
000000000000000e
[170947.892148] [ T995603] RBP: 00007f68395ff7f0 R08: 00007f68395ff970 R09:
00007f68395ff7c0
[170947.892149] [ T995603] R10: 000055e5fad95930 R11: 0000000000000246 R12:
00000000c0186444
[170947.892150] [ T995603] R13: 000000000000000e R14: 00007f68395ff8f8 R15:
000055e5fad96150
[170947.892152] [ T995603]  </TASK>
[170947.892153] [ T995603] Modules linked in: ftdi_sio usbserial sctp
ip6_udp_tunnel udp_tunnel uinput rfcomm snd_seq_dummy snd_hrtimer snd_seq ccm
af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet
nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables vboxnetadp(O)
vboxnetflt(O) qrtr vboxdrv(O) cmac algif_hash algif_skcipher af_alg bnep
nls_iso8859_1 nls_cp437 vfat fat iwlmvm mac80211 snd_hda_codec_realtek libarc4
snd_hda_codec_generic snd_hda_scodec_component snd_hda_codec_hdmi uvcvideo
intel_rapl_msr amd_atl snd_usb_audio intel_rapl_common btusb videobuf2_vmalloc
btrtl uvc videobuf2_memops snd_hda_intel btintel videobuf2_v4l2 snd_usbmidi_lib
edac_mce_amd snd_intel_dspcfg snd_ump btbcm snd_intel_sdw_acpi iwlwifi videodev
snd_rawmidi btmtk eeepc_wmi snd_hda_codec kvm_amd asus_wmi videobuf2_common
snd_seq_device battery bluetooth joydev mc asus_wmi_sensors cfg80211
snd_hda_core kvm snd_hwdep platform_profile sparse_keymap snd_pcm rfkill igb
[170947.892191] [ T995603]  snd_timer mxm_wmi wmi_bmof snd pcspkr dca k10temp
acpi_cpufreq i2c_piix4 soundcore gpio_amdpt gpio_generic tiny_power_button
button tcp_bbr sch_fq configfs nvme_fabrics fuse loop efi_pstore nfnetlink
dmi_sysfs ip_tables x_tables dm_crypt essiv authenc trusted asn1_encoder tee
hid_logitech_hidpp hid_logitech_dj hid_plantronics hid_generic usbhid amdgpu
crct10dif_pclmul ahci crc32_pclmul libahci polyval_clmulni polyval_generic
gf128mul libata ghash_clmulni_intel sha512_ssse3 video sha256_ssse3 amdxcp
xhci_pci i2c_algo_bit sha1_ssse3 xhci_pci_renesas drm_ttm_helper ttm sd_mod
drm_exec scsi_dh_emc xhci_hcd nvme scsi_dh_rdac gpu_sched scsi_dh_alua
drm_suballoc_helper sg drm_buddy aesni_intel drm_display_helper crypto_simd
nvme_core scsi_mod usbcore cryptd ccp cec nvme_auth sp5100_tco rc_core
scsi_common t10_pi wmi btrfs blake2b_generic libcrc32c crc32c_intel xor
raid6_pq dm_mod msr i2c_dev efivarfs
[170947.892231] [ T995603] CR2: 0000000000000008
[170947.892233] [ T995603] ---[ end trace 0000000000000000 ]---
[170947.892235] [ T995603] RIP: 0010:drm_sched_job_arm+0x23/0x60 [gpu_sched]
[170947.892239] [ T995603] Code: 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00
00 55 53 48 8b 6f 60 48 85 ed 74 3f 48 89 fb 48 89 ef e8 61 37 00 00 48 8b 45
10 <48> 8b 50 08 48 89 53 18 8b 45 24 89 43 5c b8 01 00 00 00 f0 48 0f
[170947.892240] [ T995603] RSP: 0018:ffff9ffe16f97a28 EFLAGS: 00010206
[170947.892241] [ T995603] RAX: 0000000000000000 RBX: ffff911d64d78000 RCX:
ffff91179c4826d0
[170947.892242] [ T995603] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
ffff911bb871f838
[170947.892243] [ T995603] RBP: ffff911bb871f810 R08: ffff91175b733b28 R09:
ffff9ffe16f97898
[170947.892244] [ T995603] R10: ffff9ffe16f97890 R11: 0000000000000003 R12:
0000000000000001
[170947.892245] [ T995603] R13: ffff9ffe16f97aa8 R14: 0000000000000000 R15:
ffff9117497fb000
[170947.892246] [ T995603] FS:  00007f68396006c0(0000)
GS:ffff91263e600000(0000) knlGS:0000000000000000
[170947.892248] [ T995603] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[170947.892249] [ T995603] CR2: 0000000000000008 CR3: 0000000d2c836000 CR4:
0000000000750ef0
[170947.892250] [ T995603] PKRU: 55555554
[170947.892250] [ T995603] note: kscreenloc:cs0[995603] exited with irqs
disabled

Startup of a working system:

[    3.580412] [    T661] [drm] amdgpu kernel modesetting enabled.
[    3.580488] [    T661] amdgpu: Virtual CRAT table created for CPU
[    3.580500] [    T661] amdgpu: Topology: Add CPU node
[    3.580878] [    T661] amdgpu 0000:09:00.0: No more image in the PCI ROM
[    3.580895] [    T661] amdgpu 0000:09:00.0: amdgpu: Fetched VBIOS from ROM
BAR
[    3.580896] [    T661] amdgpu: ATOM BIOS: 113-5E353BU-O6G
[    3.643676] [    T661] amdgpu 0000:09:00.0: vgaarb: deactivate vga console
[    3.643679] [    T661] amdgpu 0000:09:00.0: amdgpu: Trusted Memory Zone
(TMZ) feature not supported
[    3.644529] [    T661] amdgpu 0000:09:00.0: amdgpu: VRAM: 8192M
0x000000F400000000 - 0x000000F5FFFFFFFF (8192M used)
[    3.644531] [    T661] amdgpu 0000:09:00.0: amdgpu: GART: 256M
0x000000FF00000000 - 0x000000FF0FFFFFFF
[    3.644597] [    T661] [drm] amdgpu: 8192M of VRAM memory ready
[    3.644598] [    T661] [drm] amdgpu: 32108M of GTT memory ready.
[    3.652965] [    T661] amdgpu: hwmgr_sw_init smu backed is polaris10_smu
[    4.401624] [    T661] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    4.401630] [    T661] kfd kfd: amdgpu: Total number of KFD nodes to be
created: 1
[    4.401696] [    T661] amdgpu: Virtual CRAT table created for GPU
[    4.401744] [    T661] amdgpu: Topology: Add dGPU node [0x67df:0x1002]
[    4.401746] [    T661] kfd kfd: amdgpu: added device 1002:67df
[    4.401762] [    T661] amdgpu 0000:09:00.0: amdgpu: SE 4, SH per SE 1, CU
per SH 9, active_cu_number 32
[    4.404858] [    T661] amdgpu 0000:09:00.0: amdgpu: Using BACO for runtime
pm
[    4.405143] [    T661] [drm] Initialized amdgpu 3.57.0 20150101 for
0000:09:00.0 on minor 1
[    4.423609] [    T661] fbcon: amdgpudrmfb (fb0) is primary device
[    4.558249] [    T661] amdgpu 0000:09:00.0: [drm] fb0: amdgpudrmfb frame
buffer device
[   15.488832] [   T1172] snd_hda_intel 0000:09:00.1: bound 0000:09:00.0 (ops
amdgpu_dm_audio_component_bind_ops [amdgpu])


Best Regards,
Holger


You are receiving this mail because: