5.15.160 kernel breaks amdgpu driver

fsLeg

from LinuxQuestions.org on 2024-06-08 15:46 (#6NCSQ)

I have a laptop with AMD integrated graphics and Nvidia discrete GPU that runs Slackware 15. I use amdgpu driver for graphics and Nvidia for 3D stuff like games.

A few days ago Pat released 5.15.160 kernel that fixed a whole bunch of vulnerabilities, including CVE-2024-1086 (the netfilter one) everyone was talking about, so today I finally upgraded the kernel. But when I rebooted as usual (after creating initrd and copying it and the new generic kernel to EFI partition) I was greeted with a black screen with not even a blinking cursor. The system seemed unresponsive, no Ctrl+Alt+Delete or REISUB were working; SSH worked, however, so I was able to reinstall 5.15.145 kernel and boot with it if needed. At first I thought that Nvidia GPU was somehow used as the primary one, but blacklisting it didn't do anything. After I added nomodeset kernel parameter I was able to login into the system (no graphical session, of course) and inspect dmesg output. It turned out amdgpu driver was acting up:

Code:...
amdgpu: HMM registered 2048MB device memory
[ 13.181724] amdgpu: Topology: Add APU node [0x15d8:0x1002]
[ 13.181732] kfd kfd: amdgpu: added device 1002:15d8
[ 13.181755] kfd kfd: amdgpu: Failed to resume IOMMU for device 1002:15d8
[ 13.181777] amdgpu 0000:05:00.0: amdgpu: amdgpu_device_ip_init failed
[ 13.181794] amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
...And then there were a bunch of errors and some call traces.

I managed to find the issue as well as a solution: https://lists.freedesktop.org/archiv...ne/109478.html

Basically, one of the patches that wasn't supposed to be in 5.15 was accidentally ported anyway, so the solution is to revert this (use with patch -R):

Code:--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2023-12-23 12:42:00.000000000 +0300
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 2024-05-25 17:20:19.000000000 +0300
@@ -2486,10 +2487,6 @@
if (r)
goto init_failed;

- r = amdgpu_amdkfd_resume_iommu(adev);
- if (r)
- goto init_failed;
-
r = amdgpu_device_ip_hw_init_phase1(adev);
if (r)
goto init_failed;
@@ -2528,6 +2525,10 @@
if (!adev->gmc.xgmi.pending_reset)
amdgpu_amdkfd_device_init(adev);

+ r = amdgpu_amdkfd_resume_iommu(adev);
+ if (r)
+ goto init_failed;
+
amdgpu_fru_get_product_info(adev);

init_failed:I did just that, recompiled the module using this command (so I wouldn't have to recompile everything which takes ages):

Code:make modules SUBDIRS=drivers/gpu/drm/amd/amdgpumoved the resulting amdgpu.ko to its proper place, rebooted - and everything works again, so I don't have to downgrade back to 5.15.145 kernel.

Just thought I'd share in case I'm not the only one. Hopefully, the next 5.15 kernel fixes the issue.

Source	RSS or Atom Feed
Feed Location	https://feeds.feedburner.com/linuxquestions/latest
Feed Title	LinuxQuestions.org
Feed Link	https://www.linuxquestions.org/questions/