How to recover data from MDADM RAID HDDs experiencing Buffer I/O errors and Target errors
by EvanRC from LinuxQuestions.org on (#5KTVM)
Not a linux newbie, but new to these forums.
From March up until this July 2nd, the 4x 3.0 TB RAID5 system I had setup for my workplace was functioning fine. Ran on MDADM with four Seagate Barracuda drives (sdb, sdc, sdd and sde), interfacing to Ubuntu Server 20.04 Focal Fossa, kept well updated.
Recently, someone accidentally unplugged the power that it and the controlling server were hooked up to, despite the UPS specifically being for this purpose (small company). I got the server up and running fine, but the MDADM RAID didn't fare well - started simply as /dev/md0 not appearing.
When I ran mdadm --assemble --scan I got:
SDD and SDC returned four errors each (the very-close sectors separated with slashes):
Code:blk_update_request: critical target error, dev sd*, sector 25879390758(400/608/629/628) op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0SDB returned that for just sectors 258790758(400/608).
I had assumed, "oh, maybe we just have to reboot and check them with smartctl!" However, they returned those again. SMARTCTL gave very limited information for /dev/sdb which appeared partially broken as well when compared with all the others. Notice that /dev/sde had no sector errors earlier? It came up with errors later (all of them did; look at post bottom), but I digress. SMARTCTL gave full SMART Attribute/Test/Event tables as well as more Feature and Device information than it did for /dev/sdb. I have the dumps available if needed but SDB was 'Disabled, frozen' while the others were 'Disabled, NOT FROZEN' for ATA Security.
I continued trying to figure out the problem by sifting through dmesg. When I'd run the Scan Assemble, each sector error was matched with the following:
Code:sd 4:0:0:3: [sd*] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 4:0:0:3: [sd*] tag#0 Sense Key : Illegal Request [current]
sd 4:0:0:3: [sd*] tag#0 Add. Sense: Logical block address out of range
sd 4:0:0:3: [sd*] tag#0 CDB: Read(16) 88 00 00 00 E6 E6 E6 E6 00 00 00 00 0* 00 00I'm not versed in this well enough to decipher that, but it looks bad.
Anyway, when I ran sudo debugfs /dev/sd* I got very similar messages, of which were basically constant across the drives.
Code:debugfs: Bad magic number in super-block while trying to open /dev/sd*
blk_update_request: critical target error, dev sd*, sector 2589390758400 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
blk_update_request: critical target error, dev sd*, sector 2589390758400 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev sd*, logical block 126939695379200, async page read
blk_update_request: critical target error, dev sd*, sector 2589390758402 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev sd*, logical block 126939695379201, async page read
Buffer I/O error on dev sd*, logical block 126939695379202, async page read
Buffer I/O error on dev sd*, logical block 126939695379203, async page readI looked around on the forum here about the I/O and on Ubuntu Forums about the Superblock error. The first of the two said to look in SMARTCTL for 'Commands leading to the command that caused the error,' but the SMARTCTL coughed up nothing of that sort. The latter had first suggested doing sudo fdisk -l /dev/sd* but gave no indication of their Disk model, oddly. Since the entire group of disks were in use for the RAID, the only "partition" was virtual, which may explain some of the debugfs output. I couldn't use mdadm -E /dev/md0 since the virtual device no longer existed, and using it on any of the disks gave the four sector errors previously described, but for all of them, including SDE and SDB.
From a few other forums, I tried a few other miscellaneous things which resulted similarly. So, this leads me to the buildup question(s) - can I recover anything from the RAID? Are the drives shot from the power loss, or just need some special repair tool? Or is it time to cut my losses and pull a complete wipe (the RAID had some complex but unused code, as well as old backups) of them and start anew?
I'm already close to being in over my head here, despite having some confidence in my CLI and Disk Management abilities. Any help would be greatly appreciated.
From March up until this July 2nd, the 4x 3.0 TB RAID5 system I had setup for my workplace was functioning fine. Ran on MDADM with four Seagate Barracuda drives (sdb, sdc, sdd and sde), interfacing to Ubuntu Server 20.04 Focal Fossa, kept well updated.
Recently, someone accidentally unplugged the power that it and the controlling server were hooked up to, despite the UPS specifically being for this purpose (small company). I got the server up and running fine, but the MDADM RAID didn't fare well - started simply as /dev/md0 not appearing.
When I ran mdadm --assemble --scan I got:
SDD and SDC returned four errors each (the very-close sectors separated with slashes):
Code:blk_update_request: critical target error, dev sd*, sector 25879390758(400/608/629/628) op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0SDB returned that for just sectors 258790758(400/608).
I had assumed, "oh, maybe we just have to reboot and check them with smartctl!" However, they returned those again. SMARTCTL gave very limited information for /dev/sdb which appeared partially broken as well when compared with all the others. Notice that /dev/sde had no sector errors earlier? It came up with errors later (all of them did; look at post bottom), but I digress. SMARTCTL gave full SMART Attribute/Test/Event tables as well as more Feature and Device information than it did for /dev/sdb. I have the dumps available if needed but SDB was 'Disabled, frozen' while the others were 'Disabled, NOT FROZEN' for ATA Security.
I continued trying to figure out the problem by sifting through dmesg. When I'd run the Scan Assemble, each sector error was matched with the following:
Code:sd 4:0:0:3: [sd*] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 4:0:0:3: [sd*] tag#0 Sense Key : Illegal Request [current]
sd 4:0:0:3: [sd*] tag#0 Add. Sense: Logical block address out of range
sd 4:0:0:3: [sd*] tag#0 CDB: Read(16) 88 00 00 00 E6 E6 E6 E6 00 00 00 00 0* 00 00I'm not versed in this well enough to decipher that, but it looks bad.
Anyway, when I ran sudo debugfs /dev/sd* I got very similar messages, of which were basically constant across the drives.
Code:debugfs: Bad magic number in super-block while trying to open /dev/sd*
blk_update_request: critical target error, dev sd*, sector 2589390758400 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
blk_update_request: critical target error, dev sd*, sector 2589390758400 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev sd*, logical block 126939695379200, async page read
blk_update_request: critical target error, dev sd*, sector 2589390758402 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Buffer I/O error on dev sd*, logical block 126939695379201, async page read
Buffer I/O error on dev sd*, logical block 126939695379202, async page read
Buffer I/O error on dev sd*, logical block 126939695379203, async page readI looked around on the forum here about the I/O and on Ubuntu Forums about the Superblock error. The first of the two said to look in SMARTCTL for 'Commands leading to the command that caused the error,' but the SMARTCTL coughed up nothing of that sort. The latter had first suggested doing sudo fdisk -l /dev/sd* but gave no indication of their Disk model, oddly. Since the entire group of disks were in use for the RAID, the only "partition" was virtual, which may explain some of the debugfs output. I couldn't use mdadm -E /dev/md0 since the virtual device no longer existed, and using it on any of the disks gave the four sector errors previously described, but for all of them, including SDE and SDB.
From a few other forums, I tried a few other miscellaneous things which resulted similarly. So, this leads me to the buildup question(s) - can I recover anything from the RAID? Are the drives shot from the power loss, or just need some special repair tool? Or is it time to cut my losses and pull a complete wipe (the RAID had some complex but unused code, as well as old backups) of them and start anew?
I'm already close to being in over my head here, despite having some confidence in my CLI and Disk Management abilities. Any help would be greatly appreciated.