5.15.160 kernel breaks amdgpu driver
by fsLeg from LinuxQuestions.org on (#6NCSQ)
I have a laptop with AMD integrated graphics and Nvidia discrete GPU that runs Slackware 15. I use amdgpu driver for graphics and Nvidia for 3D stuff like games.
A few days ago Pat released 5.15.160 kernel that fixed a whole bunch of vulnerabilities, including CVE-2024-1086 (the netfilter one) everyone was talking about, so today I finally upgraded the kernel. But when I rebooted as usual (after creating initrd and copying it and the new generic kernel to EFI partition) I was greeted with a black screen with not even a blinking cursor. The system seemed unresponsive, no Ctrl+Alt+Delete or REISUB were working; SSH worked, however, so I was able to reinstall 5.15.145 kernel and boot with it if needed. At first I thought that Nvidia GPU was somehow used as the primary one, but blacklisting it didn't do anything. After I added nomodeset kernel parameter I was able to login into the system (no graphical session, of course) and inspect dmesg output. It turned out amdgpu driver was acting up:
Code:...
amdgpu: HMM registered 2048MB device memory
[ 13.181724] amdgpu: Topology: Add APU node [0x15d8:0x1002]
[ 13.181732] kfd kfd: amdgpu: added device 1002:15d8
[ 13.181755] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
[ 13.181777] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
[ 13.181794] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
...And then there were a bunch of errors and some call traces.
I managed to find the issue as well as a solution: https://lists.freedesktop.org/archiv...ne/109478.html
Basically, one of the patches that wasn't supposed to be in 5.15 was accidentally ported anyway, so the solution is to revert this (use with patch -R):
Code:--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2023-12-23 12:42:00.000000000 +0300
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2024-05-25 17:20:19.000000000 +0300
@@ -2486,10 +2487,6 @@
if (r)
goto init_failed;
- r = amdgpu_amdkfd_resume_iommu(adev);
- if (r)
- goto init_failed;
-
r = amdgpu_device_ip_hw_init_phase1(adev);
if (r)
goto init_failed;
@@ -2528,6 +2525,10 @@
if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);
+ r = amdgpu_amdkfd_resume_iommu(adev);
+ if (r)
+ goto init_failed;
+
amdgpu_fru_get_product_info(adev);
init_failed:I did just that, recompiled the module using this command (so I wouldn't have to recompile everything which takes ages):
Code:make modules SUBDIRS=drivers/gpu/drm/amd/amdgpumoved the resulting amdgpu.ko to its proper place, rebooted - and everything works again, so I don't have to downgrade back to 5.15.145 kernel.
Just thought I'd share in case I'm not the only one. Hopefully, the next 5.15 kernel fixes the issue.
A few days ago Pat released 5.15.160 kernel that fixed a whole bunch of vulnerabilities, including CVE-2024-1086 (the netfilter one) everyone was talking about, so today I finally upgraded the kernel. But when I rebooted as usual (after creating initrd and copying it and the new generic kernel to EFI partition) I was greeted with a black screen with not even a blinking cursor. The system seemed unresponsive, no Ctrl+Alt+Delete or REISUB were working; SSH worked, however, so I was able to reinstall 5.15.145 kernel and boot with it if needed. At first I thought that Nvidia GPU was somehow used as the primary one, but blacklisting it didn't do anything. After I added nomodeset kernel parameter I was able to login into the system (no graphical session, of course) and inspect dmesg output. It turned out amdgpu driver was acting up:
Code:...
amdgpu: HMM registered 2048MB device memory
[ 13.181724] amdgpu: Topology: Add APU node [0x15d8:0x1002]
[ 13.181732] kfd kfd: amdgpu: added device 1002:15d8
[ 13.181755] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
[ 13.181777] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
[ 13.181794] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
...And then there were a bunch of errors and some call traces.
I managed to find the issue as well as a solution: https://lists.freedesktop.org/archiv...ne/109478.html
Basically, one of the patches that wasn't supposed to be in 5.15 was accidentally ported anyway, so the solution is to revert this (use with patch -R):
Code:--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2023-12-23 12:42:00.000000000 +0300
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2024-05-25 17:20:19.000000000 +0300
@@ -2486,10 +2487,6 @@
if (r)
goto init_failed;
- r = amdgpu_amdkfd_resume_iommu(adev);
- if (r)
- goto init_failed;
-
r = amdgpu_device_ip_hw_init_phase1(adev);
if (r)
goto init_failed;
@@ -2528,6 +2525,10 @@
if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);
+ r = amdgpu_amdkfd_resume_iommu(adev);
+ if (r)
+ goto init_failed;
+
amdgpu_fru_get_product_info(adev);
init_failed:I did just that, recompiled the module using this command (so I wouldn't have to recompile everything which takes ages):
Code:make modules SUBDIRS=drivers/gpu/drm/amd/amdgpumoved the resulting amdgpu.ko to its proper place, rebooted - and everything works again, so I don't have to downgrade back to 5.15.145 kernel.
Just thought I'd share in case I'm not the only one. Hopefully, the next 5.15 kernel fixes the issue.