One of our AMD Epyc hosts needs a reboot

This incident has been resolved.

Post-mortem: Dual NVMe Drive Failure on EPYC Host

Here is our post-mortem for the incident today which caused extended downtime for approximately 317 EPYC-based customer instances.

Summary

Earlier today, a single EPYC hypervisor experienced a storage failure following a planned administrative restart. The restart itself was routine and performed to address degraded disk I/O performance that had been observed over the preceding days.

Following the reboot, the system failed to come back online due to the unexpected loss of two NVMe drives, which together formed a complete ZFS top-level mirror vdev. The simultaneous loss of both members of a mirror rendered the ZFS pool unimportable and resulted in extended downtime while recovery operations were performed.

Timeline and Detection

Prior to the restart, we performed standard pre-maintenance checks:

The ZFS storage pool reported as ONLINE
No critical ZFS alerts were present
No hardware warnings or failures were reported by Dell iDRAC / IPMI
There was a single historical ZFS write error recorded on one device, but this was not accompanied by device faulting, checksum storms, or pool degradation

This type of isolated write error is something we occasionally observe across large fleets and, based on long operational experience, does not normally indicate imminent or catastrophic failure. The expectation was therefore to proceed with a controlled reboot, followed by a scrub if necessary.

At the time of the restart, there were no predictive indicators from either the storage layer or the hardware management layer that suggested an elevated risk of failure.

Failure Event

Upon reboot, the system failed to import its ZFS pool. Investigation revealed that two NVMe drives were no longer available to the system. These two drives together constituted an entire mirror vdev at the top level of the pool.

In ZFS, the loss of a complete top-level vdev makes a pool unrecoverable by design, as data is striped across vdevs and cannot be reconstructed without at least one surviving replica.

This immediately escalated the incident from a routine maintenance task to a full host recovery operation.

Hardware Investigation

With assistance from on-site datacenter technicians, extensive hardware diagnostics were performed:

Drives were reseated and moved to known-good NVMe bays
Backplane, cabling, and PCIe connectivity were verified
BIOS and iDRAC inventory were reviewed
Power cycling and cold starts were attempted

The results were conclusive:

One drive was no longer detected at all by the system
The second drive was detected but reported a capacity of 0 GB and failed initialization

At this point, it was clear that both NVMe drives had suffered irreversible failure, likely at the controller or firmware level.

Why This Was Exceptionally Unlikely

This failure mode is statistically extreme:

Both drives were enterprise-grade NVMe devices
Both were members of a mirror specifically designed to tolerate single-device failure
There were no SMART, iDRAC, or ZFS indicators suggesting a pending fault
The failures occurred effectively simultaneously and only became fully visible after a reboot

In many thousands of host-years of operation, we have not previously encountered a scenario where both members of a ZFS mirror failed in such close succession without advance warning.

The absence of meaningful alerts meant that there was no operational signal that would normally justify preemptive action such as taking the host out of service prior to the reboot.

Impact

Approximately 317 customer instances on the affected host experienced downtime
The host itself required full storage reinitialization
Customer instances were restored from backup snapshots via our Incus-based recovery infrastructure

Because the incident occurred while the current daily backup cycle was still in progress, restore points varied:

Approximately 20% of instances were recovered from backups taken earlier the same morning
Approximately 80% of instances were recovered from the most recent completed weekly backup, taken the previous morning (CET)

Recovery and Resolution

Once it was clear that the local ZFS pool could not be recovered:

The affected storage pool was destroyed and recreated
The host was re-initialized cleanly
Customer instances were restored from the most recent available snapshots
All affected services were brought back online

Lessons Learned and Preventive Measures

Although this incident stemmed from an extremely improbable hardware failure, we are still taking concrete steps to reduce the blast radius of similar edge cases in the future:

More conservative handling and escalation of any ZFS device-level errors, even when isolated
Additional scrutiny around storage health prior to maintenance reboots on high-density hosts
Adjustments to maintenance timing relative to active backup windows
Review of power and firmware interactions specific to NVMe devices under sustained I/O load
Continued evaluation of pool layout and recovery strategies to further limit worst-case scenarios

Closing Notes

This incident was not caused by a single mistake, misconfiguration, or ignored alert. It was the result of a rare and unfortunate convergence of hardware failures that only became fully apparent at reboot time.

While ZFS behaved exactly as designed — refusing to mount a pool whose integrity could not be proven — the lack of advance warning made the outcome both surprising and severe.

We regret the disruption caused and appreciate the patience shown while recovery was underway. Incidents like this feed directly into improving our operational resilience and recovery procedures going forward.

Webdock - One of our AMD Epyc hosts needs a reboot – Incident details

All systems operational

This incident has been resolved. Post-mortem: Dual NVMe Drive Failure on EPYC Host