WD Red SA500 Issues with LSI 9300-8i controller?

I’m running Debian (Devuan) Linux with Kernel 4.9 and ZFS 0.8.2. On Saturday I installed six SA500 2.5" 2TB drives in a raidz2. Have this SuperMicro Chassis with a SAS3 Backplane connected with a LSI 9300-8i controller.

Drives appear to disconnect randomly from a LSI 9300-8i SAS3 controller during normal operation. A disconnect can be made to happen quickly by running the command “zpool trim rpool” and the SA500 the drives will disconnect from the backplane within minutes. Only by limiting the trim to 100MB/s will the trim finish with “zpool trim -r 104857600 rpool”.

So far in the last three days, the SA500 drives have disconnected more than two dozen times and have left my ZFS pool in a degraded state multiple times. I have to keep putting the drive back online after ZFS kicks it out due to errors.

Something is seriously wrong with the SA500 drives. I don’t have issues with SAS SSDs for other SATA SSDs such as the Crucial MX500, Samsung EVO SSDs, etc. Western Digital must have known NAS drives would be going into ZFS systems behind LSI controllers.

I don’t know if this comment will help you, or not.

We just purchased 2 x 500GB Red 2.5" SSDs and 1 x 1TB Red 2.5" SSD.

The 1TB SSD was initially wired to an old SiI3132 eSATA controller,
and during the first STARTUP I could tell the controller was not happy
with that WDC Red SSD. I knew that an older WDC Black HDD worked
AOK with that SiL3132 eSATA controller, so I swapped the cables.
Now, the new 1GB Red SSD is wired to a 3G SATA port integrated
on the motherboard, and it’s now working fine. Fortunately.

We also had some problems configuring 2 x 500GB Red SSDs
in a RAID-0 array controlled by a Highpoint 2720SGL AIC controller card.
Even though the Option ROM did initialize both and created a RAID array,
after STARTUP the Operating System did not detect that array,
and the Highpoint GUI (Management Console) reported 2 error events.

After restarting, those error events did not repeat, and
I was able to format that RAID array with Windows Disk Management.
However, a different problem then arose when I tried to ALIGN
that RAID array, using Partition Wizard: that program appeared
to hang on that step.

So, I deleted that partition, recreated a new partition, and
without writing any data to that partition, I ran Partition Wizard
again to ALIGN that partition. Fortunately, that sequence
worked AOK, and after that I was able to load our database
onto that RAID-0 array. After another day of routine testing,
e.g. with a custom Batch file that launches CHKDSK
in Windows Command Prompt, those 2 x Red SSDs
appear to have settled down.

The good news is that on the same PCs, we have had
exactly ZERO problems enabling similar configurations
on the same PCs, using WDC Blue 2.5" SSDs:
those SSDs have been trouble-free, even after
“migrating OS” to those SSDs using that feature
in licensed version of Partition Wizard.

So, consider returning your Red SSDs for the Blue SSDs.

I upgraded the LSI 9300-8i BIOS and the firmware to the latest version 16.00.10.00 which was released on October 18th, 2019, and the drives were not kicked out and I saw no signs of errors. The Broadcom file contain firmware version 16.00.10.00 and BIOS version 8.37.00.00. the drives became pretty stable at that point. There was still an occasional issue with a drive being kicked out once a month. Upgrading the kernel to the 5.x series solved that and the drives have been completely stable ever since. So the six SA500s are doing their job.

I don’t think the Blue drives will offer DRAT and RZAT that the LSI controller needs in order for trim to work. DRAT and RZAT are mostly included in only Enterprise drives. Very few consumer drives have this commands:

Thanks for the detailed reply. Hopefully it will help someone.

Many thanks for that clarification. I completely forgot about TRIM,
and that surely is an important feature.

I got lulled by our prior success with the WDC 2.5" Blue SSDs,
and ours is certainly not a 24/7 network.

a drive being kicked out once a month

I would still consider that UNacceptable.

I agree that any drive being kicked out that has not failed is unacceptable. Finding the new 5.x kernels solved that problem was a relief. Drives have been rock solid since then. Not a single drop out and performance is excellent.

Drives have been rock solid since then.
Not a single drop out and performance is excellent.

Great news! Thanks.

Thinking back, I remember helping Users who configured RAID arrays
with WD Black HDDs, and certain controllers were dropping one or more
of those HDDs. Western Digital got involved and explained that
the Black series of HDDs did not support TLER (time-limited error recovery),
whereas other RAID-compatible WD HDDs did support TLER.

At a minimum, that information helped lots of WD customers
realize that there was a rational explanation for the dropped drives,
not a manufacturing defect. Some controllers would “poll”
WD’s Black HDDs while those drives were doing normal
error checking, and the drives failed to ACK those polls.

In your situation, I can see how more modern and complex controllers
plus support for TRIM instructions and all modern RAID modes
combine to increase the chances of intermittent failures.

Can you point us to any detailed documentation that
explains what features make these Red SSDs more
durable for 24/7 uses? e.g. something more detailed
than an advertising brochure?

Thanks again!

I found this product brief at WD’s website:

Other than WD’s own docs marketing materials, I’m afraid I don’t. The SA500 are market against Seagate’s NAS line, though the Seagate’s have much longer endurance a price to go along with it. The SA500 I would consider an entry level NAS SSD. It does support DRAT and RZAT which is a must when using LSI SAS3 HBAs. Just hooking a SSD to SATA or NVME works fine for ZFS. It’s going through a HBA that you run into issues.

I think as long as the SSD supports DRAT and RZAT, and it’s endurance is good, it should work for most workloads. Having a larger SLC cache helps with the larger writes. The benchmarks I saw on the SA500s showed almost no slowdown when filling the entire drive. While most consume SSDs will go along very fast until it runs out of SLC and then the performance will drop to the TLC/QLC memory speed.

Speaking of HBAs, we had a lot of success recently experimenting
with the PCIe 3.0 slot in a refurbished HP Z220 MT workstation:
we buy these because they are so cheap, and after discounting
the retail cost of Windows 10 Pro x64, the hardware is almost free.

Because Highpoint finally released a bootable “4x4” add-in card,
we tested 1 x Highpoint SSD7103 and 4 x 250GB M.2 NVMe SSDs.

One detail we already knew about was the requirement to switch
the motherboard BIOS to UEFI mode, which only required 10 seconds.

In RAID-0 mode, CDM measured READs at 11,697 MB/second.

Everything worked perfectly the first time, including “migrate OS”
using a licensed copy of Partition Wizard for Windows 10 x64.

That SSD7103 should work just fine with 4 x WDC M.2 NVMe SSDs,
and I see that there are SSD7103 drivers for several versions of Linux.

Highpoint has recently announced PCIe Gen4 versions of these AICs.
The SSD7540 has 8 x M.2 NVMe PCIe 4.0 ports = 64TB total storage capacity
on a single HBA. Pretty amazing!

Their hardware is very reliable, but in the (distant) past their documentation
was poor, requiring too much trial-and-error. I don’t know if that
situation has improved, but I suspect it has improved in response
to customer feedback.