Help chasing an I/O error, bad cable, bad controller, bad drive?

systemloc

from LinuxQuestions.org on 2024-03-02 16:21 (#6K1VN)

I'm running an x86 PC, with Slackware 14.2 (4.19.139 huge/custom), recently updated to 15 (5.15.149 huge/custom). I have an LSI SAS2308 (rev 05) FW 16.00.00.00-IT, BIOS MPT2BIOS-7.31.00.00, and an IBM SAS expander 46M0997 previously FW ver 605, now 634A. I run 15 3-4 TB SATA drives all HGST Ultrastar or Seagate Ironwolf. The OS is on a separate drive connected to the builtin motherboard SATA controller.

My drives are configured in RAID 6, except the OS drive, which is kept separate from the RAID.

Problem:

I've been tracking a persistent rare intermittent fault where the RAID would lose 1-3 drives in short order after working well for days. I found that only the Seagate drives were affected. No errors were recorded in the SMART record. I could reproduce the fault by turning off the raid and doing a simple 'dd if=/dev/(drive) of=/dev/null' for a cycle or two. At this time, the FW on the IBM card was the old 605(?) FW, and all of the Seagate drives were connected to it. At this time, 4 drives were connected directly to the LSI card, and the rest to the IBM expander, which was connected to the LSI card. For diagnosis, I swapped all the Seagate drives to be connected directly to the LSI card. I did the dd test and got no errors. Researching, I found that the old IBM FW tended to have drive incompatibilities, so I updated it to current. (Thanks Art Of Server dude!) I tried hooking the Seagate drives back to the IBM card and running a few cycles of my dd test, and no errors. I also noted I was only using one uplink port on the IBM card, so I reconfigured my cabling to hook both uplinks to the LSI, and put all the RAID disks on the IBM card. Note that all of the drives always remained on the same 'SF 8087 to 4x SATA/SAS' splitter cable. As an aside, I did some speed testing with one vs two uplink cables, and did not notice a difference. Are there two ports to allow redundant HBAs? Hmm..

At this point, I also updated to Slackware 15 and updated the kernel. This box had sat disused for awhile, and 15 came out in the interim. :)

This is the error I was getting with the incompatible FW on the IBM expander:
Note that I was receiving similar errors from multiple Seagate drives, and none of them had any errors logged in SMART.

Code:[ 1291.975114] blk_update_request: I/O error, dev sdi, sector 15006224 op 0x0:(READ) flags 0x80700 phys_seg 26 prio class 0
[ 1291.975141] blk_update_request: I/O error, dev sdi, sector 15002128 op 0x0:(READ) flags 0x80700 phys_seg 15 prio class 0
[ 1291.975151] blk_update_request: I/O error, dev sdi, sector 14999568 op 0x0:(READ) flags 0x84700 phys_seg 64 prio class 0
[ 1291.975250] blk_update_request: I/O error, dev sdi, sector 15003664 op 0x0:(READ) flags 0x84700 phys_seg 46 prio class 0
[ 1291.975340] blk_update_request: I/O error, dev sdi, sector 14999568 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 1291.975348] Buffer I/O error on dev sdi, logical block 1874946, async page read
[ 1291.975391] blk_update_request: I/O error, dev sdi, sector 14999568 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ 1291.975413] Buffer I/O error on dev sdi, logical block 1874946, async page read
[ 1292.267222] sd 0:0:8:0: [sdi] Synchronizing SCSI cache
[ 1292.267303] sd 0:0:8:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
[ 1292.270904] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x5005076028e30891)
[ 1292.270923] mpt2sas_cm0: removing handle(0x0012), sas_addr(0x5005076028e30891)
[ 1292.270927] mpt2sas_cm0: enclosure logical id(0x5005076028e30880), slot(255)
[ 1297.219020] mpt2sas_cm0: handle(0x12) sas_address(0x5005076028e30891) port_type(0x1)
[ 1298.237602] scsi 0:0:16:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 6
[ 1298.237636] scsi 0:0:16:0: SATA: handle(0x0012), sas_addr(0x5005076028e30891), phy(16), device_name(0x0000000000000000)
[ 1298.237639] scsi 0:0:16:0: enclosure logical id (0x5005076028e30880), slot(255)
[ 1298.237733] scsi 0:0:16:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[ 1298.237739] scsi 0:0:16:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[ 1298.243439] sd 0:0:16:0: Power-on or device reset occurred
[ 1298.243585] sd 0:0:16:0: Attached scsi generic sg8 type 0
[ 1298.244273] end_device-0:0:16: add: handle(0x0012), sas_addr(0x5005076028e30891)
[ 1298.247613] sd 0:0:16:0: [sdi] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[ 1298.247648] sd 0:0:16:0: [sdi] 4096-byte physical blocks
[ 1298.277722] sd 0:0:16:0: [sdi] Write Protect is off
[ 1298.277734] sd 0:0:16:0: [sdi] Mode Sense: 7f 00 10 08
[ 1298.279513] sd 0:0:16:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
[ 1298.391582] sd 0:0:16:0: [sdi] Attached SCSI disk
I proceeded to dd my drives and then mdadm --repair the array through several cycles over days. I eventually began to have similar recurrent errors, now coming from only one of the Seagate drives. Again, though, no errors were logged in SMART. I tried doing a 'smartctl -t long' test on the drive, and no errors were noted. On several repeats of the dd test on the drive, I noted that the sectors failing the read were not consistent.

Here's the current error:
Code:[37855.651421] blk_update_request: I/O error, dev sdi, sector 471946240 op 0x0:(READ) flags 0x80700 phys_s
eg 38 prio class 0
[37855.651549] blk_update_request: I/O error, dev sdi, sector 471944192 op 0x0:(READ) flags 0x84700 phys_s
eg 128 prio class 0
[37855.651589] blk_update_request: I/O error, dev sdi, sector 471946208 op 0x0:(READ) flags 0x80700 phys_s
eg 2 prio class 0
[37855.653050] blk_update_request: I/O error, dev sdi, sector 471944192 op 0x0:(READ) flags 0x0 phys_seg 1
prio class 0
[37855.653073] Buffer I/O error on dev sdi, logical block 58993024, async page read
[37855.653133] blk_update_request: I/O error, dev sdi, sector 471944192 op 0x0:(READ) flags 0x0 phys_seg 1
prio class 0
[37855.653138] Buffer I/O error on dev sdi, logical block 58993024, async page read
[37855.915477] sd 0:0:16:0: [sdi] Synchronizing SCSI cache
[37855.915627] sd 0:0:16:0: [sdi] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=DRIVER_OK
[37855.918520] mpt2sas_cm0: mpt3sas_transport_port_remove: removed: sas_addr(0x5005076028e30891)
[37855.918533] mpt2sas_cm0: removing handle(0x0012), sas_addr(0x5005076028e30891)
[37855.918538] mpt2sas_cm0: enclosure logical id(0x5005076028e30880), slot(255)
[37860.143948] mpt2sas_cm0: handle(0x12) sas_address(0x5005076028e30891) port_type(0x1)
[37861.162416] scsi 0:0:17:0: Direct-Access ATA ST4000VN008-2DR1 SC60 PQ: 0 ANSI: 6
[37861.162466] scsi 0:0:17:0: SATA: handle(0x0012), sas_addr(0x5005076028e30891), phy(16), device_name(0x0
000000000000000)
[37861.162472] scsi 0:0:17:0: enclosure logical id (0x5005076028e30880), slot(255)
[37861.162657] scsi 0:0:17:0: atapi(n), ncq(y), asyn_notify(n), smart(y), fua(y), sw_preserve(y)
[37861.162695] scsi 0:0:17:0: qdepth(32), tagged(1), scsi_level(7), cmd_que(1)
[37861.168247] sd 0:0:17:0: Attached scsi generic sg8 type 0
[37861.168650] sd 0:0:17:0: Power-on or device reset occurred
[37861.169260] end_device-0:0:17: add: handle(0x0012), sas_addr(0x5005076028e30891)
[37861.173216] sd 0:0:17:0: [sdi] 7814037168 512-byte logical blocks: (4.00 TB/3.64 TiB)
[37861.173226] sd 0:0:17:0: [sdi] 4096-byte physical blocks
[37861.204349] sd 0:0:17:0: [sdi] Write Protect is off
[37861.204402] sd 0:0:17:0: [sdi] Mode Sense: 7f 00 10 08
[37861.206162] sd 0:0:17:0: [sdi] Write cache: enabled, read cache: enabled, supports DPO and FUA
[37861.318143] sd 0:0:17:0: [sdi] Attached SCSI disk
Here is the (abridged) smartctl output of the drive in question:

Code:smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.149] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Seagate IronWolf
Device Model: ST4000VN008-2DR166
Serial Number: ZGY8WY6K
LU WWN Device Id: 5 000c50 0c8bcc814
Firmware Version: SC60
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5980 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Sat Mar 2 11:03:23 2024 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 064 044 Pre-fail Always - 78372864
3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 41
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 091 060 045 Pre-fail Always - 1197074970
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 5685 (148 154 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 30
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 073 068 040 Old_age Always - 27 (Min/Max 24/27)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 22
193 Load_Cycle_Count 0x0032 084 084 000 Old_age Always - 33686
194 Temperature_Celsius 0x0022 027 040 000 Old_age Always - 27 (0 18 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 4609 (67 78 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 14865683719
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 45013795835

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 5675 -
# 2 Extended offline Completed without error 00% 3703 -
# 3 Extended offline Completed without error 00% 626 -
# 4 Extended offline Completed without error 00% 105 -
Troubleshooting:

I strongly suspect the drive is just fine, as SMART doesn't record any of these read faults. I suspect ongoing incompatibility with the IBM expander, or a bad cable. Of note, a single cable connected all of the Seagate drives through all of this testing, thus is unique to every fault. Currently, I replaced that cable, and I'm doing testing now, with the Seagate drives plugged into the expander card. If I get errors, I will try again, with the Seagate drives on the LSI card directly.

Questions:

What does this error actually mean? What are 'flags' in the error, and why does it have sectors which don't appear to match the LBA address given?

What device is actually generating the error? The drive, the expander, the HBA, or the PC/kernel?

Could this error be caused by a bad cable, bad expander/HBA, Firmware incompatibility, bad disk?

Does the fact that the errors don't correlate with any recorded fault in the SMART data make a bad disk unlikely?

Source	RSS or Atom Feed
Feed Location	https://feeds.feedburner.com/linuxquestions/latest
Feed Title	LinuxQuestions.org
Feed Link	https://www.linuxquestions.org/questions/