Webdock - Host system instability in Denmark – Incident details

All systems operational

Host system instability in Denmark

Resolved
Operational
Started 2 months agoLasted about 8 hours

Affected

Denmark: General Infrastructure

Partial outage from 11:15 AM to 11:29 AM, Operational from 11:29 AM to 12:35 PM, Partial outage from 12:35 PM to 6:48 PM

Updates
  • Resolved
    Resolved

    All customers have been migrated away and everything is operational, except for two almalinux instances which are having problems booting as they have selinux active which is causing issues. Our engineers are working on these last two customer instances and they should hopefully be up and running soon. This incident is now resolved and we will be diagnosing the failed host over the next days to ascertain if it needs an RMA or parts can be exhanged before it rejoins our fleet. Thank you for your patience during the instability and repeated outages today.

  • Update
    Update

    The fix suggested by Dell support seems to have had the exact opposite effect on system stability: This time the system was up for only 10 minutes. This tells us that firstly the problem does indeed have something to do with the CPUs and their Bios-managed performance profiles on the system, and second that this is likely a hard hardware fault which we cannot work around.

    This means we will be starting migrations of all customers away from this host shortly. You will receive emails notifying you of when migration starts and ends.

    Once all customers are away from this system, it will be sent back to manufacturer for further diagnosis and replacement.

  • Update
    Update

    The suggested BIOS setting from Dell support has been implemented and all customers are up. If this does not resolve the issue, we will perform an evacuation of this system in the form of a migration of all customers. This in effect means you may see up to two more restarts / brief outages today. Hopefully it doesn't come to that and the issue is resolved now, but we have no way of guaranteeing a positive result.

  • Update
    Update

    Unfortunately the issue happened yet again. We did not have time to implement the fix as suggested by Dell support. We will perform a reboot of the affected system now, in order to apply the BIOS change. After which, as outlined earlier, if that doesn't have the desired effect of a stable system, we will proceed with migrations of all customers away from the affected system.

  • Update
    Update

    We have come up with a gameplan for how to deal with this particular system: We have found a BIOS setting which may prevent the issue from happening. If this host system spontaneously reboots again with that error, we will firstly try that fix. If we get a fourth reboot, it is clear that the system has some fault related to CPU, which we are unable to diagnose at this time. In this case we'd migrate all customers away from this host to our other healthier hosts, and at that point send in the system for an RMA. We hope it doesn't come to migrations - but rather you experience a last restart as your instance comes up on a good host, than unplanned reboots/downtime and arbitrary times. We will update here as the situation develops.

  • Monitoring
    Monitoring

    Unfortunately this system experienced another fault just and rebooted. We have identified a potential issue with one of the CPUs - we are still diagnosing. We apologize for the inconvenience.

  • Resolved
    Resolved

    This incident has been resolved. The host system decided to reboot. We are analyzing logs in order to determine the cause. All customer VPS instances are up.

  • Investigating
    Investigating

    We lost networking for a host in Denmark and this host may be down. We are currently checking the status of the system.