Webdock - Notice history

All systems operational

Denmark: Network Infrastructure - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Denmark: Storage Backend - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 99.84%
Jul 2024
Aug 2024
Sep 2024

Denmark: General Infrastructure - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 99.72%
Jul 2024
Aug 2024
Sep 2024

Canada: Network Infrastructure - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Canada: Storage Backend - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Canada: General Infrastructure - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 99.83%
Jul 2024
Aug 2024
Sep 2024

Webdock Statistics Server - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Webdock Dashboard - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 99.97%
Jul 2024
Aug 2024
Sep 2024

Webdock Website - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Webdock Image Server - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Webdock REST API - Operational

100% - uptime
Jul 2024 · 100.0%Aug · 100.0%Sep · 100.0%
Jul 2024
Aug 2024
Sep 2024

Notice history

Sep 2024

Host system instability in Denmark
  • Resolved
    Resolved

    All customers have been migrated away and everything is operational, except for two almalinux instances which are having problems booting as they have selinux active which is causing issues. Our engineers are working on these last two customer instances and they should hopefully be up and running soon. This incident is now resolved and we will be diagnosing the failed host over the next days to ascertain if it needs an RMA or parts can be exhanged before it rejoins our fleet. Thank you for your patience during the instability and repeated outages today.

  • Update
    Update

    The fix suggested by Dell support seems to have had the exact opposite effect on system stability: This time the system was up for only 10 minutes. This tells us that firstly the problem does indeed have something to do with the CPUs and their Bios-managed performance profiles on the system, and second that this is likely a hard hardware fault which we cannot work around.

    This means we will be starting migrations of all customers away from this host shortly. You will receive emails notifying you of when migration starts and ends.

    Once all customers are away from this system, it will be sent back to manufacturer for further diagnosis and replacement.

  • Update
    Update

    The suggested BIOS setting from Dell support has been implemented and all customers are up. If this does not resolve the issue, we will perform an evacuation of this system in the form of a migration of all customers. This in effect means you may see up to two more restarts / brief outages today. Hopefully it doesn't come to that and the issue is resolved now, but we have no way of guaranteeing a positive result.

  • Update
    Update

    Unfortunately the issue happened yet again. We did not have time to implement the fix as suggested by Dell support. We will perform a reboot of the affected system now, in order to apply the BIOS change. After which, as outlined earlier, if that doesn't have the desired effect of a stable system, we will proceed with migrations of all customers away from the affected system.

  • Update
    Update

    We have come up with a gameplan for how to deal with this particular system: We have found a BIOS setting which may prevent the issue from happening. If this host system spontaneously reboots again with that error, we will firstly try that fix. If we get a fourth reboot, it is clear that the system has some fault related to CPU, which we are unable to diagnose at this time. In this case we'd migrate all customers away from this host to our other healthier hosts, and at that point send in the system for an RMA. We hope it doesn't come to migrations - but rather you experience a last restart as your instance comes up on a good host, than unplanned reboots/downtime and arbitrary times. We will update here as the situation develops.

  • Monitoring
    Monitoring

    Unfortunately this system experienced another fault just and rebooted. We have identified a potential issue with one of the CPUs - we are still diagnosing. We apologize for the inconvenience.

  • Resolved
    Resolved

    This incident has been resolved. The host system decided to reboot. We are analyzing logs in order to determine the cause. All customer VPS instances are up.

  • Investigating
    Investigating

    We lost networking for a host in Denmark and this host may be down. We are currently checking the status of the system.

Jul 2024

DK DC1 power outage
  • Resolved
    Resolved

    We now believe we are completely recovered from the power outage this morning. We ended up having to roll back a number of customer servers which had experienced some data corruption, in order to fully resolve all issues. We will be reviewing all procedures and operations at our DC in order to firstly prevent any such power outage incident happening again, whether we are doing maintenance or not, and secondly look at whether we can build in better protections for our data pools in order to avoid the corruption issues we saw today. There are known methods for this, but they come at a performance penalty, which we will be evaluating in the coming week or so.

    We sincerely apologize for the inconvenience caused today. This was force majeure at work and/or inadequately prepared technical staff which was working on our UPS systems today.

  • Update
    Update

    We are close to having all issues fully resolved. However, the power outage seems to have affected 3 hosts in an adverse way where the storage pools on these hosts are reporting as degraded. This in turn is preventing proper restarts of vps servers on those hosts. We are looking into how to resolve this issue. The good news is that all customers have been up for a long while now and we have no other outstanding issues except this current storage pool issue on the 3 hosts in question.

    We hope to resolve these last problems within the next few hours. The resolution may involve migrating a small number of customers to other locations, in which case you will receive a migration notification by email.

  • Update
    Update

    Unfortunately we have had to recover from last known backups on a single one of our hosts which for some reason had a completely corrupted storage pool after the power outage. We will look at how we can avoid such corruption in the future. In any case, all customer servers on that system are coming up one by one as they are reprovisioned from the snapshot performed this past evening or about 9 hours ago. We will update here once all servers are up and we are happy with how all systems are looking.

  • Update
    Update

    We are now down to a single host having problems. It seems like we may have to recover from last known backups for this system (backups from about 8 hours ago). We will try a few more things to recover the local storage pool, which was corrupted during the power outage somehow.

    In other news, the UPS guys have completed their maintenance work and believe they have identified the issue which caused the outage this morning. When they isolated one of our UPS units to do maintenance, the remaining units were unable to communicate properly causing them to drop the load to our DC. This is not supposed to happen and points to either wrong cabling or faulty components which were not caught during initial power outage testing before we went live with the DC

    It is ironic that they exact systems designed to protect us from power outage were the ones responsible for a power outage, but it is what it is and all we can do from our side is trust that our UPS guys have now gotten us back to a redundant state.

  • Update
    Update

    Most customer VPS are up now and we are demoting this to a partial outage. We have a single host system where we are seeing some serious issues with the storage there after the power outage, which may take longer to recover than the others. We are working on this system right now.

  • Monitoring
    Monitoring

    We are slowly bringing up all customer VPS servers. It seems that in some cases a few seconds of data loss is to be expected when we've had such a hard power cut to all systems simultaneously. We are hoping this does not result in any corruption of data, but we have no overview of the impact yet. We will focus on getting customer servers up and running first of all, then we will inspect all systems one by one.

  • Identified
    Identified

    We have power again and most services are booting or are already booted. However, the UPS guys say the fault should of course never have happened on the first place, that's what we have emergency power systems in the first place. They are investigating the root cause and have asked us to hold off doing any work on our side as they say there is a chance we may have another power cut before they are done. We hope this will not be the case...

  • Investigating
    Investigating

    A work crew is doing some UPS maintenance today and it seems they somehow managed to cut power to the DC. We are currently investigating this incident.

Jul 2024 to Sep 2024

Next