Issue with another EPYC host

Updates

Resolved
January 14, 2026 at 5:11 PM
Resolved
January 14, 2026 at 5:11 PM
We have completed all migrations. This should conclude this incident. We apologize for any inconvenience caused.
Update
January 14, 2026 at 2:36 PM
Update
January 14, 2026 at 2:36 PM
All customers are now starting on the unstable system. You should see your VPS come up very soon. Migrations will begin shortly.
Update
January 14, 2026 at 2:16 PM
Update
January 14, 2026 at 2:16 PM
Unfortunately it turns out this system will not and can not support a single CPU layout while allowing our NVMe drives to function. The only and last option available to us is to insert the faulty CPU again and live-migrate all users away from this system as quickly as we can. You will get migration start and end notifications by email. We assume and expect we will be able to complete the migrations before the faulty CPU kicks up a fuss again. We will update here once the migrations are complete and this issue is fully resolved. We do not have a firm ETA on this, potentially this will take a couple of hours. You should see your instance come up within long, then at some point it will go down for a minute or two while being started in the new location, after which you should see no further disruption.
Monitoring
January 14, 2026 at 1:51 PM
Monitoring
January 14, 2026 at 1:51 PM
Turns out the fault did follow the CPU, so the CPU is simply bad. We just now removed the CPU and booted the system in a single CPU configuration. However, this resulted in our NVMe drives no longer being visible. For this reason we are switching the healthy CPU to another CPU socket, in the hopes that the PCIe lanes for the drives are tied to that socket and we can run with that single socket that way. If it turns out both CPUs are required for NVME to come up correctly, we will need to reinsert the bad CPU, and do a quick-as-possible migration away from this system for all customers currently there. We will update once we know more.

Unfortunately we do not have a spare CPU of this exact type available in the DC, so these are the options open to us at the moment.
Identified
January 14, 2026 at 1:18 PM
Identified
January 14, 2026 at 1:18 PM
The issue reappeared. We are looking into it.
Postmortem
January 14, 2026 at 10:43 AM
Postmortem
January 14, 2026 at 10:43 AM
Incident Post-Mortem – Unexpected Server Reboots
Affected system: Single compute node (Dell R6525, dual AMD EPYC)
Summary
One compute node experienced repeated unexpected reboots caused by hardware-level Machine Check Exceptions (MCEs) reported by the system firmware and operating system. The issue was resolved after on-site hardware intervention, and the system is now operating normally.
Impact
Customers hosted on this node experienced service interruptions during the reboot loop. No data loss occurred.
Root Cause (most likely)
The most likely cause was a marginal CPU socket contact (pin pressure / seating issue) on one processor socket. This can occasionally occur even on new systems and may only surface after some time in production.
When the CPUs were removed, inspected, reseated, and swapped between sockets, the errors stopped and have not recurred.
Other causes considered
While investigating, we also evaluated and ruled out:
- ECC memory failures (no memory errors were logged by firmware or iDRAC)
- Operating system or kernel issues
- Sustained thermal overload
Other less likely contributors include transient socket power instability or inter-CPU fabric retraining issues, both of which can be cleared by a full power-off and reseat.
Background
The server was newly installed approximately 1½ months ago and successfully passed a 10-hour full system stress test before being placed into production. The issue developed later and was not present during initial burn-in.
Resolution & current status
- CPUs were reseated and swapped between sockets
- System firmware counters were cleared
- The server is now stable and operating normally under load
- Ongoing monitoring has been increased
Resolved
January 14, 2026 at 10:31 AM
Resolved
January 14, 2026 at 10:31 AM
This incident has been resolved. Once again, sorry for the inconvenience.
Identified
January 14, 2026 at 9:58 AM
Identified
January 14, 2026 at 9:58 AM
Our DC guys looking into the issue. Apparently it looks like one of the CPU has failed (dual CPU setup). The guys are working on to bring up the host.
Apologies for the inconvenience.

Webdock - Issue with another EPYC host – Incident details

All systems operational

Incident Post-Mortem – Unexpected Server Reboots

Summary

Impact

Root Cause (most likely)

Other causes considered

Background

Resolution & current status