Webdock - Issue with another EPYC host – Incident details

All systems operational

Issue with another EPYC host

Resolved
Major outage
Started 8 days agoLasted about 7 hours

Affected

Denmark: General Infrastructure

Major outage from 9:58 AM to 10:31 AM, Operational from 10:31 AM to 1:18 PM, Major outage from 1:18 PM to 2:36 PM, Operational from 2:36 PM to 5:11 PM

Updates
  • Resolved
    Resolved

    We have completed all migrations. This should conclude this incident. We apologize for any inconvenience caused.

  • Update
    Update

    All customers are now starting on the unstable system. You should see your VPS come up very soon. Migrations will begin shortly.

  • Update
    Update

    Unfortunately it turns out this system will not and can not support a single CPU layout while allowing our NVMe drives to function. The only and last option available to us is to insert the faulty CPU again and live-migrate all users away from this system as quickly as we can. You will get migration start and end notifications by email. We assume and expect we will be able to complete the migrations before the faulty CPU kicks up a fuss again. We will update here once the migrations are complete and this issue is fully resolved. We do not have a firm ETA on this, potentially this will take a couple of hours. You should see your instance come up within long, then at some point it will go down for a minute or two while being started in the new location, after which you should see no further disruption.

  • Monitoring
    Monitoring

    Turns out the fault did follow the CPU, so the CPU is simply bad. We just now removed the CPU and booted the system in a single CPU configuration. However, this resulted in our NVMe drives no longer being visible. For this reason we are switching the healthy CPU to another CPU socket, in the hopes that the PCIe lanes for the drives are tied to that socket and we can run with that single socket that way. If it turns out both CPUs are required for NVME to come up correctly, we will need to reinsert the bad CPU, and do a quick-as-possible migration away from this system for all customers currently there. We will update once we know more.

    Unfortunately we do not have a spare CPU of this exact type available in the DC, so these are the options open to us at the moment.

  • Identified
    Identified

    The issue reappeared. We are looking into it.

  • Postmortem
    Postmortem

    Incident Post-Mortem – Unexpected Server Reboots

    Affected system: Single compute node (Dell R6525, dual AMD EPYC)

    Summary

    One compute node experienced repeated unexpected reboots caused by hardware-level Machine Check Exceptions (MCEs) reported by the system firmware and operating system. The issue was resolved after on-site hardware intervention, and the system is now operating normally.

    Impact

    Customers hosted on this node experienced service interruptions during the reboot loop. No data loss occurred.

    Root Cause (most likely)

    The most likely cause was a marginal CPU socket contact (pin pressure / seating issue) on one processor socket. This can occasionally occur even on new systems and may only surface after some time in production.

    When the CPUs were removed, inspected, reseated, and swapped between sockets, the errors stopped and have not recurred.

    Other causes considered

    While investigating, we also evaluated and ruled out:

    • ECC memory failures (no memory errors were logged by firmware or iDRAC)

    • Operating system or kernel issues

    • Sustained thermal overload

    Other less likely contributors include transient socket power instability or inter-CPU fabric retraining issues, both of which can be cleared by a full power-off and reseat.

    Background

    The server was newly installed approximately 1½ months ago and successfully passed a 10-hour full system stress test before being placed into production. The issue developed later and was not present during initial burn-in.

    Resolution & current status

    • CPUs were reseated and swapped between sockets

    • System firmware counters were cleared

    • The server is now stable and operating normally under load

    • Ongoing monitoring has been increased

  • Resolved
    Resolved

    This incident has been resolved. Once again, sorry for the inconvenience.

  • Identified
    Identified

    Our DC guys looking into the issue. Apparently it looks like one of the CPU has failed (dual CPU setup). The guys are working on to bring up the host.

    Apologies for the inconvenience.