Affected
Partial outage from 8:49 AM to 2:51 PM
- ResolvedResolved
This incident has been resolved.
Post-mortem: Dual NVMe Drive Failure on EPYC HostHere is our post-mortem for the incident today which caused extended downtime for approximately 317 EPYC-based customer instances.
Summary
Earlier today, a single EPYC hypervisor experienced a storage failure following a planned administrative restart. The restart itself was routine and performed to address degraded disk I/O performance that had been observed over the preceding days.
Following the reboot, the system failed to come back online due to the unexpected loss of two NVMe drives, which together formed a complete ZFS top-level mirror vdev. The simultaneous loss of both members of a mirror rendered the ZFS pool unimportable and resulted in extended downtime while recovery operations were performed.
Timeline and Detection
Prior to the restart, we performed standard pre-maintenance checks:
The ZFS storage pool reported as ONLINE
No critical ZFS alerts were present
No hardware warnings or failures were reported by Dell iDRAC / IPMI
There was a single historical ZFS write error recorded on one device, but this was not accompanied by device faulting, checksum storms, or pool degradation
This type of isolated write error is something we occasionally observe across large fleets and, based on long operational experience, does not normally indicate imminent or catastrophic failure. The expectation was therefore to proceed with a controlled reboot, followed by a scrub if necessary.
At the time of the restart, there were no predictive indicators from either the storage layer or the hardware management layer that suggested an elevated risk of failure.
Failure Event
Upon reboot, the system failed to import its ZFS pool. Investigation revealed that two NVMe drives were no longer available to the system. These two drives together constituted an entire mirror vdev at the top level of the pool.
In ZFS, the loss of a complete top-level vdev makes a pool unrecoverable by design, as data is striped across vdevs and cannot be reconstructed without at least one surviving replica.
This immediately escalated the incident from a routine maintenance task to a full host recovery operation.
Hardware Investigation
With assistance from on-site datacenter technicians, extensive hardware diagnostics were performed:
Drives were reseated and moved to known-good NVMe bays
Backplane, cabling, and PCIe connectivity were verified
BIOS and iDRAC inventory were reviewed
Power cycling and cold starts were attempted
The results were conclusive:
One drive was no longer detected at all by the system
The second drive was detected but reported a capacity of 0 GB and failed initialization
At this point, it was clear that both NVMe drives had suffered irreversible failure, likely at the controller or firmware level.
Why This Was Exceptionally Unlikely
This failure mode is statistically extreme:
Both drives were enterprise-grade NVMe devices
Both were members of a mirror specifically designed to tolerate single-device failure
There were no SMART, iDRAC, or ZFS indicators suggesting a pending fault
The failures occurred effectively simultaneously and only became fully visible after a reboot
In many thousands of host-years of operation, we have not previously encountered a scenario where both members of a ZFS mirror failed in such close succession without advance warning.
The absence of meaningful alerts meant that there was no operational signal that would normally justify preemptive action such as taking the host out of service prior to the reboot.
Impact
Approximately 317 customer instances on the affected host experienced downtime
The host itself required full storage reinitialization
Customer instances were restored from backup snapshots via our Incus-based recovery infrastructure
Because the incident occurred while the current daily backup cycle was still in progress, restore points varied:
Approximately 20% of instances were recovered from backups taken earlier the same morning
Approximately 80% of instances were recovered from the most recent completed weekly backup, taken the previous morning (CET)
Recovery and Resolution
Once it was clear that the local ZFS pool could not be recovered:
The affected storage pool was destroyed and recreated
The host was re-initialized cleanly
Customer instances were restored from the most recent available snapshots
All affected services were brought back online
Lessons Learned and Preventive Measures
Although this incident stemmed from an extremely improbable hardware failure, we are still taking concrete steps to reduce the blast radius of similar edge cases in the future:
More conservative handling and escalation of any ZFS device-level errors, even when isolated
Additional scrutiny around storage health prior to maintenance reboots on high-density hosts
Adjustments to maintenance timing relative to active backup windows
Review of power and firmware interactions specific to NVMe devices under sustained I/O load
Continued evaluation of pool layout and recovery strategies to further limit worst-case scenarios
Closing Notes
This incident was not caused by a single mistake, misconfiguration, or ignored alert. It was the result of a rare and unfortunate convergence of hardware failures that only became fully apparent at reboot time.
While ZFS behaved exactly as designed — refusing to mount a pool whose integrity could not be proven — the lack of advance warning made the outcome both surprising and severe.
We regret the disruption caused and appreciate the patience shown while recovery was underway. Incidents like this feed directly into improving our operational resilience and recovery procedures going forward.
- UpdateUpdate
Some bad news. Unfortunately our DC guys saw a rare critical 2-drive failure. We'll reload all the servers on the affected host with the latest snapshots we have. Our sincere apologies for this.
- MonitoringMonitoring
There seems to be a problem with some cabling in the physical host. The DC guys are on this. Sorry for the inconvenience
- IdentifiedIdentified
The physical host needs a reboot. Instances running there will see a 15-20 minutes of downtime.
![[object Object]](/_next/image?url=https%3A%2F%2Finstatus.com%2Fuser-content%2Fv1674633197%2Feerjhdoiwnsy4gcu69hd.png&w=3840&q=75)