Webdock - One of our AMD Epyc hosts needs a reboot – Incident details

All systems operational

One of our AMD Epyc hosts needs a reboot

Resolved
Major outage
Started 10 days agoLasted about 6 hours

Affected

Denmark: General Infrastructure

Partial outage from 8:49 AM to 2:51 PM

Updates
  • Resolved
    Resolved

    This incident has been resolved.

    Post-mortem: Dual NVMe Drive Failure on EPYC Host

    Here is our post-mortem for the incident today which caused extended downtime for approximately 317 EPYC-based customer instances.


    Summary

    Earlier today, a single EPYC hypervisor experienced a storage failure following a planned administrative restart. The restart itself was routine and performed to address degraded disk I/O performance that had been observed over the preceding days.

    Following the reboot, the system failed to come back online due to the unexpected loss of two NVMe drives, which together formed a complete ZFS top-level mirror vdev. The simultaneous loss of both members of a mirror rendered the ZFS pool unimportable and resulted in extended downtime while recovery operations were performed.


    Timeline and Detection

    Prior to the restart, we performed standard pre-maintenance checks:

    • The ZFS storage pool reported as ONLINE

    • No critical ZFS alerts were present

    • No hardware warnings or failures were reported by Dell iDRAC / IPMI

    • There was a single historical ZFS write error recorded on one device, but this was not accompanied by device faulting, checksum storms, or pool degradation

    This type of isolated write error is something we occasionally observe across large fleets and, based on long operational experience, does not normally indicate imminent or catastrophic failure. The expectation was therefore to proceed with a controlled reboot, followed by a scrub if necessary.

    At the time of the restart, there were no predictive indicators from either the storage layer or the hardware management layer that suggested an elevated risk of failure.


    Failure Event

    Upon reboot, the system failed to import its ZFS pool. Investigation revealed that two NVMe drives were no longer available to the system. These two drives together constituted an entire mirror vdev at the top level of the pool.

    In ZFS, the loss of a complete top-level vdev makes a pool unrecoverable by design, as data is striped across vdevs and cannot be reconstructed without at least one surviving replica.

    This immediately escalated the incident from a routine maintenance task to a full host recovery operation.


    Hardware Investigation

    With assistance from on-site datacenter technicians, extensive hardware diagnostics were performed:

    • Drives were reseated and moved to known-good NVMe bays

    • Backplane, cabling, and PCIe connectivity were verified

    • BIOS and iDRAC inventory were reviewed

    • Power cycling and cold starts were attempted

    The results were conclusive:

    • One drive was no longer detected at all by the system

    • The second drive was detected but reported a capacity of 0 GB and failed initialization

    At this point, it was clear that both NVMe drives had suffered irreversible failure, likely at the controller or firmware level.


    Why This Was Exceptionally Unlikely

    This failure mode is statistically extreme:

    • Both drives were enterprise-grade NVMe devices

    • Both were members of a mirror specifically designed to tolerate single-device failure

    • There were no SMART, iDRAC, or ZFS indicators suggesting a pending fault

    • The failures occurred effectively simultaneously and only became fully visible after a reboot

    In many thousands of host-years of operation, we have not previously encountered a scenario where both members of a ZFS mirror failed in such close succession without advance warning.

    The absence of meaningful alerts meant that there was no operational signal that would normally justify preemptive action such as taking the host out of service prior to the reboot.


    Impact

    • Approximately 317 customer instances on the affected host experienced downtime

    • The host itself required full storage reinitialization

    • Customer instances were restored from backup snapshots via our Incus-based recovery infrastructure

    Because the incident occurred while the current daily backup cycle was still in progress, restore points varied:

    • Approximately 20% of instances were recovered from backups taken earlier the same morning

    • Approximately 80% of instances were recovered from the most recent completed weekly backup, taken the previous morning (CET)


    Recovery and Resolution

    Once it was clear that the local ZFS pool could not be recovered:

    • The affected storage pool was destroyed and recreated

    • The host was re-initialized cleanly

    • Customer instances were restored from the most recent available snapshots

    • All affected services were brought back online


    Lessons Learned and Preventive Measures

    Although this incident stemmed from an extremely improbable hardware failure, we are still taking concrete steps to reduce the blast radius of similar edge cases in the future:

    • More conservative handling and escalation of any ZFS device-level errors, even when isolated

    • Additional scrutiny around storage health prior to maintenance reboots on high-density hosts

    • Adjustments to maintenance timing relative to active backup windows

    • Review of power and firmware interactions specific to NVMe devices under sustained I/O load

    • Continued evaluation of pool layout and recovery strategies to further limit worst-case scenarios


    Closing Notes

    This incident was not caused by a single mistake, misconfiguration, or ignored alert. It was the result of a rare and unfortunate convergence of hardware failures that only became fully apparent at reboot time.

    While ZFS behaved exactly as designed — refusing to mount a pool whose integrity could not be proven — the lack of advance warning made the outcome both surprising and severe.

    We regret the disruption caused and appreciate the patience shown while recovery was underway. Incidents like this feed directly into improving our operational resilience and recovery procedures going forward.

  • Update
    Update

    Some bad news. Unfortunately our DC guys saw a rare critical 2-drive failure. We'll reload all the servers on the affected host with the latest snapshots we have. Our sincere apologies for this.

  • Monitoring
    Monitoring

    There seems to be a problem with some cabling in the physical host. The DC guys are on this. Sorry for the inconvenience

  • Identified
    Identified

    The physical host needs a reboot. Instances running there will see a 15-20 minutes of downtime.