Webdock - DK DC1 power outage – Incident details

All systems operational

DK DC1 power outage

Resolved
Major outage
Started 6 months agoLasted about 6 hours

Affected

Denmark: General Infrastructure

Major outage from 6:11 AM to 7:16 AM, Partial outage from 7:16 AM to 12:16 PM

Updates
  • Resolved
    Resolved

    We now believe we are completely recovered from the power outage this morning. We ended up having to roll back a number of customer servers which had experienced some data corruption, in order to fully resolve all issues. We will be reviewing all procedures and operations at our DC in order to firstly prevent any such power outage incident happening again, whether we are doing maintenance or not, and secondly look at whether we can build in better protections for our data pools in order to avoid the corruption issues we saw today. There are known methods for this, but they come at a performance penalty, which we will be evaluating in the coming week or so.

    We sincerely apologize for the inconvenience caused today. This was force majeure at work and/or inadequately prepared technical staff which was working on our UPS systems today.

  • Update
    Update

    We are close to having all issues fully resolved. However, the power outage seems to have affected 3 hosts in an adverse way where the storage pools on these hosts are reporting as degraded. This in turn is preventing proper restarts of vps servers on those hosts. We are looking into how to resolve this issue. The good news is that all customers have been up for a long while now and we have no other outstanding issues except this current storage pool issue on the 3 hosts in question.

    We hope to resolve these last problems within the next few hours. The resolution may involve migrating a small number of customers to other locations, in which case you will receive a migration notification by email.

  • Update
    Update

    Unfortunately we have had to recover from last known backups on a single one of our hosts which for some reason had a completely corrupted storage pool after the power outage. We will look at how we can avoid such corruption in the future. In any case, all customer servers on that system are coming up one by one as they are reprovisioned from the snapshot performed this past evening or about 9 hours ago. We will update here once all servers are up and we are happy with how all systems are looking.

  • Update
    Update

    We are now down to a single host having problems. It seems like we may have to recover from last known backups for this system (backups from about 8 hours ago). We will try a few more things to recover the local storage pool, which was corrupted during the power outage somehow.

    In other news, the UPS guys have completed their maintenance work and believe they have identified the issue which caused the outage this morning. When they isolated one of our UPS units to do maintenance, the remaining units were unable to communicate properly causing them to drop the load to our DC. This is not supposed to happen and points to either wrong cabling or faulty components which were not caught during initial power outage testing before we went live with the DC

    It is ironic that they exact systems designed to protect us from power outage were the ones responsible for a power outage, but it is what it is and all we can do from our side is trust that our UPS guys have now gotten us back to a redundant state.

  • Update
    Update

    Most customer VPS are up now and we are demoting this to a partial outage. We have a single host system where we are seeing some serious issues with the storage there after the power outage, which may take longer to recover than the others. We are working on this system right now.

  • Monitoring
    Monitoring

    We are slowly bringing up all customer VPS servers. It seems that in some cases a few seconds of data loss is to be expected when we've had such a hard power cut to all systems simultaneously. We are hoping this does not result in any corruption of data, but we have no overview of the impact yet. We will focus on getting customer servers up and running first of all, then we will inspect all systems one by one.

  • Identified
    Identified

    We have power again and most services are booting or are already booted. However, the UPS guys say the fault should of course never have happened on the first place, that's what we have emergency power systems in the first place. They are investigating the root cause and have asked us to hold off doing any work on our side as they say there is a chance we may have another power cut before they are done. We hope this will not be the case...

  • Investigating
    Investigating

    A work crew is doing some UPS maintenance today and it seems they somehow managed to cut power to the DC. We are currently investigating this incident.