Graphics card suddenly causes boot crash with mce error
by MirceaKitsune from LinuxQuestions.org on (#5SNJA)
Something strange and unsettling happened to me today. I woke up to my screen no longer powering back on after moving the mouse, not an entirely unique occurrence. I restarted and was surprised to see that right before the login screen, the monitor would power itself off, and this time I was unable to do a clean shutdown by pressing the power button. It soon became apparent the computer would stay frozen for roughly a minute, then proceed to restart itself and repeat the cycle. After one restart I'm able to catch the following error message in the console:
https://i.imgur.com/zNK01Vs.jpg
I realized it must be hardware related since I didn't install any updates nor make changes to the system configuration for over a week, this wouldn't happen yesterday on the exact same system... to confirm it I reproduced by booting a live image, exact same behavior there. I pulled out the memory modules and tried them in sets, disconnected all hard drives, tried two different screens (HDMI and DisplayPort cables), booting two kernels (5.14 and 5.15), radeon vs amdgpu, reset the CMOS via pins... in the end the only thing that worked was removing my video card and plugging in an older one.
What makes this extremely bizarre is that I get image up until boot time: I can enter BIOS just fine, see GRUB, there are no GPU freezes or graphical corruption... this seems to be all Linux detecting an error and freaking out over it. All error messages are prefixed with "mce" and oddly enough reference a CPU issue, the rest of my hardware works just fine so it's not the processor thank god.
Does anyone know what could break in a video card that would make Linux do this? I saw a reference about a `mcelog` command for these errors, but like I said the machine becomes completely inoperable after that's printed so I can't issue any commands. If you can suggest further tests I'll take a look, but please mention everything I could test first as I don't feel comfortable plugging and pulling the video card with my motherboard so often and risk breaking things (tried it twice today). If this is a hardware issue that can't be solved from kernel I have no choice but to spend a large sum of money I didn't want to spend... figured I'd ask for help here first so I know I tried everything else.
https://i.imgur.com/zNK01Vs.jpg
I realized it must be hardware related since I didn't install any updates nor make changes to the system configuration for over a week, this wouldn't happen yesterday on the exact same system... to confirm it I reproduced by booting a live image, exact same behavior there. I pulled out the memory modules and tried them in sets, disconnected all hard drives, tried two different screens (HDMI and DisplayPort cables), booting two kernels (5.14 and 5.15), radeon vs amdgpu, reset the CMOS via pins... in the end the only thing that worked was removing my video card and plugging in an older one.
What makes this extremely bizarre is that I get image up until boot time: I can enter BIOS just fine, see GRUB, there are no GPU freezes or graphical corruption... this seems to be all Linux detecting an error and freaking out over it. All error messages are prefixed with "mce" and oddly enough reference a CPU issue, the rest of my hardware works just fine so it's not the processor thank god.
Does anyone know what could break in a video card that would make Linux do this? I saw a reference about a `mcelog` command for these errors, but like I said the machine becomes completely inoperable after that's printed so I can't issue any commands. If you can suggest further tests I'll take a look, but please mention everything I could test first as I don't feel comfortable plugging and pulling the video card with my motherboard so often and risk breaking things (tried it twice today). If this is a hardware issue that can't be solved from kernel I have no choice but to spend a large sum of money I didn't want to spend... figured I'd ask for help here first so I know I tried everything else.