Webdock - Intermittent host instability causing downtime in Denmark - resolution forthcoming – Incident details

Canada: General Infrastructure experiencing partial outage

Intermittent host instability causing downtime in Denmark - resolution forthcoming

Resolved
Under maintenance
Started 3 months agoLasted 5 days

Affected

Denmark: General Infrastructure

Under maintenance from 2:28 PM to 8:16 AM

Updates
  • Resolved
    Resolved

    We are now calling this infrastructure issue resolved as now we have completed all migrations from Finland and we have not seen a virtualization crash since we implemented our fix on saturday. We have been forced to to host restarts on two seperate hosts in the same period however, but this was due to the systems being in an already bad state from the previously bad config.

    We were planning on shifting all of our IP ranges away from FInland to be announced directly in Denmark today, thus reducing latency by about 20ms, but due to the other unrelated issue we saw today where our ISP has a major outage in their Kolding DC and this affecting one of our fibers, and where it looks like they have incorrectly configured our IP ranges, we are postponing the changeover until our ISP has fixed everything on their side and our ranges are correctly configured with them.

    The changeover should have no noticeable impact on our customers, but we will post a maintenance notification here when we do the operation, just in case.

  • Monitoring
    Monitoring

    During the day Saturday after continued investigations of the root causes of our problems, we found a simple caching parameter in our system setup which had been set to an incorrectly low value. After modifying this value across all of our hosts we saw an immediate drop in load across the board. The incorrect value was written by our base setup orchestration scripts and was a holdover from earlier testing.

    We are actually quite amazed at what a huge difference it made setting this caching parameter to a proper value.

    Ever since we modified this parameter all systems have been green across the board and operating at a fantastic efficiency. In fact, the infrastructure seems to be performing now as we had planned all along (if not better) and we don't have a single host breaking a sweat at this time.

    Our customers should be able to notice a clear difference now that this fix has been implemented. What's even better news is that since we implemented this fix, we have not seen a single crash or hang of our virtualization.

    It is too early to call the issue completely fixed however due to the relative infrequency of crashes we saw before - but if everything runs stable for the next 48 hours or so, it's really looking like a simple configuration tweak was all that it took to resolve our (honestly quite major) issues.

    We will still proceed with our plan of moving certain high resource vps's to a dedicated location monday/tuesday and we are still postponing the last migrations of Finland until tuesday. We want to be absolutely sure things are OK and that this isn't a "too good to be true" type of situation.

    But for now, all is well in our cloud. We couldn't be happier that we found this resolution and we hope you are too :)

    Thanks again for sticking with us here

    Arni Johannesson

    CEO

  • Identified
    Identified

    Over the last few days, after having migrated about 90% of our workloads from Finland, we have identified an issue which is causing us a great deal of trouble. Essentially speaking, the new virtualization environment we have set up in Denmark has proven to be unstable when our host instances are under load.

    What we observe is that our virtual environment crashes and halts all processing. There is no common denominator except it's happening on specific hosts and all of these hosts are ones where we have placed high cpu and high i/o users. The crashes do not leave any trace in any logs and all we see is just a halt state where a reboot of the host is required to bring up the customer workloads. Fortunately restarts are quick and the crashes seem to not happen more than once or twice in any 48 hour period on the affected hosts. As infrequent as they are, these incidents are of course completely unacceptable.

    The only resolution to this issue is to deploy new hardware where we git rid of the virtualization components responsible for this behavior and migrate high cpu and i/o users over to the new hardware. We have put in an emergency order with our hardware vendor this morning and we expect to receive and deploy the new hardware monday. At which time a small subset of our customers (it looks to be about 30-40 vps instances at most causing these issues at this time) will be migrated to the new hardware in order to improve overall stability of our cloud. If you see an unplanned migration notification monday or tuesday after already being migrated to Denmark, this is the reason.

    Moving forward we will not be deploying workloads in the same virtual environment as we have set up now in Denmark, as there is no way for us to fix this issue with the virtualization. All we can do is to get rid of it and return to a more direct-to-bare-metal approach as we have traditionally done (and which has never caused us such probems in the past)

    We did test our new virtualization extensively and under load before deployment - and for at least 48 hours continuously for each system we deployed - but some subset of our customers are doing something "special" which we cant quite identify which puts our virtualization under some unique stress which causes it to hang from time to time. As it turns out, simulated load (stressing cpu, i/o and network) does not reflect real-world load closely enough for us to have caught this issue before we went full scale deployment in Denmark.

    As we don't know exactly what triggers this issue, then we are of course worried this will happen on otherwise (thus far) unaffected hosts if a customer starts performing some workload our virtualization doesn't like. If this turns out to be the case, we will likely be forced to migrate most if not all of our customers who are already in Denmark to hardware which is configured without the troublesome virtualization components. We will do what we can to avoid this however, as besides it being a huge amount of work and investment would mean another migration downtime period for our customers. If this turns out to be required, we will of course send out a news bulletin to all affected customers with details.

    We will of course monitor the situation closely over the weekend and respond as quickly as we can around the clock in order to bring back up customer instances if these hangs/crashes happen again. We are almost sure it will happen again, at least a couple of times in the next 2-3 days until we have this resolved, given the behavior we have seen these past two days.

    We sincerely apologize for these disruptions. Denmark DC was supposed to be a fast (and happy) place for your vps, but this issue caught us off guard. Rest assured we are doing everything we can to eliminate this and will do so as soon as is possible in the coming days.

    Thank you for sticking with us

    Arni Johannesson

    CEO Webdock