Webdock - Notice history

Denmark: General Infrastructure under maintenance

Denmark: Network Infrastructure - Operational

100% - uptime
May 2024 · 99.97%Jun · 99.84%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Denmark: Storage Backend - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Denmark: General Infrastructure - Under maintenance

100% - uptime
May 2024 · 100.0%Jun · 99.90%Jul · 99.65%
May 2024
Jun 2024
Jul 2024

Canada: Network Infrastructure - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Canada: Storage Backend - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Canada: General Infrastructure - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Webdock Statistics Server - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Webdock Dashboard - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Webdock Website - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Webdock Image Server - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Webdock REST API - Operational

100% - uptime
May 2024 · 100.0%Jun · 100.0%Jul · 100.0%
May 2024
Jun 2024
Jul 2024

Notice history

Jul 2024

DK DC1 power outage
  • Resolved
    Resolved

    We now believe we are completely recovered from the power outage this morning. We ended up having to roll back a number of customer servers which had experienced some data corruption, in order to fully resolve all issues. We will be reviewing all procedures and operations at our DC in order to firstly prevent any such power outage incident happening again, whether we are doing maintenance or not, and secondly look at whether we can build in better protections for our data pools in order to avoid the corruption issues we saw today. There are known methods for this, but they come at a performance penalty, which we will be evaluating in the coming week or so.

    We sincerely apologize for the inconvenience caused today. This was force majeure at work and/or inadequately prepared technical staff which was working on our UPS systems today.

  • Monitoring
    Update

    We are close to having all issues fully resolved. However, the power outage seems to have affected 3 hosts in an adverse way where the storage pools on these hosts are reporting as degraded. This in turn is preventing proper restarts of vps servers on those hosts. We are looking into how to resolve this issue. The good news is that all customers have been up for a long while now and we have no other outstanding issues except this current storage pool issue on the 3 hosts in question.

    We hope to resolve these last problems within the next few hours. The resolution may involve migrating a small number of customers to other locations, in which case you will receive a migration notification by email.

  • Monitoring
    Update

    Unfortunately we have had to recover from last known backups on a single one of our hosts which for some reason had a completely corrupted storage pool after the power outage. We will look at how we can avoid such corruption in the future. In any case, all customer servers on that system are coming up one by one as they are reprovisioned from the snapshot performed this past evening or about 9 hours ago. We will update here once all servers are up and we are happy with how all systems are looking.

  • Monitoring
    Update

    We are now down to a single host having problems. It seems like we may have to recover from last known backups for this system (backups from about 8 hours ago). We will try a few more things to recover the local storage pool, which was corrupted during the power outage somehow.

    In other news, the UPS guys have completed their maintenance work and believe they have identified the issue which caused the outage this morning. When they isolated one of our UPS units to do maintenance, the remaining units were unable to communicate properly causing them to drop the load to our DC. This is not supposed to happen and points to either wrong cabling or faulty components which were not caught during initial power outage testing before we went live with the DC

    It is ironic that they exact systems designed to protect us from power outage were the ones responsible for a power outage, but it is what it is and all we can do from our side is trust that our UPS guys have now gotten us back to a redundant state.

  • Monitoring
    Update

    Most customer VPS are up now and we are demoting this to a partial outage. We have a single host system where we are seeing some serious issues with the storage there after the power outage, which may take longer to recover than the others. We are working on this system right now.

  • Monitoring
    Monitoring

    We are slowly bringing up all customer VPS servers. It seems that in some cases a few seconds of data loss is to be expected when we've had such a hard power cut to all systems simultaneously. We are hoping this does not result in any corruption of data, but we have no overview of the impact yet. We will focus on getting customer servers up and running first of all, then we will inspect all systems one by one.

  • Identified
    Identified

    We have power again and most services are booting or are already booted. However, the UPS guys say the fault should of course never have happened on the first place, that's what we have emergency power systems in the first place. They are investigating the root cause and have asked us to hold off doing any work on our side as they say there is a chance we may have another power cut before they are done. We hope this will not be the case...

  • Investigating
    Investigating

    A work crew is doing some UPS maintenance today and it seems they somehow managed to cut power to the DC. We are currently investigating this incident.

Jun 2024

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required
Scheduled for June 14, 2024 at 8:51 AM – July 13, 2024 at 5:21 AM 29 days
  • Update
    July 23, 2024 at 7:09 AM
    In progress
    July 23, 2024 at 7:09 AM

    Progress update from the Webdock team on this issue: Our remediation efforts have progressed as planned but have unfortunately not produced the desired result. Despite having upgraded affected systems to a much cleaner upstream kernel build and newer zfs filesystem components, there is still some memory corruption happening which crashes lxcfs. We now suspect the issue may be isolated in libfuse, but we cannot be certain at this time.

    We have not seen any wholesale kernel crashes since the upgrades, but that issue cannot be considered closed either until we have had a bit more time without incident.

    In order to effectively solve the lxcfs issue, we have through extensive dialogue with the creator of our hypervisor decided to try and adopt a completely new and cutting edge feature which doesn't even exist yet: per-instance lxcfs virtualization. This essentially means that each VPS server gets its own lxcfs virtualization instead of it being global for the host system. This means that if an lxcfs instance crashes, it will only affect a single customer vps, and where it is easily remediated by just rebooting that one vps - instead of having to reboot ALL customer vps servers on a host.

    As mentioned this feature doesn't exist yet. It exists in lxcfs as a custom feature built for Netflix as they did not want their containerized workloads to have lxcfs crash across entire hosts but it actually hasn't been implemented in the hypervisor, LXD, yet. For this reason we have decided to sponsor the work of the fork of LXD called Incus for which the creator and former maintainer of LXD is responsible. He is currently working on implementing the feature which should be ready and tested within a few days.

    After which, Webdock will proceed to migrate from LXD to Incus in a rolling fashion across our entire infrastructure in Denmark. This will obviously mean yet another downtime event for almost all our customers (only kvm customers are not affected by this) - but after the upgrade we should be in a state where 99.99% of our customers will no longer be affected by this issue and we should see greater stability throughout.

    We will update here once we start the procedure, at first only on a couple of hosts where we will then watch behavior and performance for some days before moving on to the rest of our infrastructure.

    We thank you all for your continued patience

    Arni Johannesson, CEO Webdock

  • Update
    July 16, 2024 at 9:25 AM
    In progress
    July 16, 2024 at 9:25 AM

    We have now rolled out the supposed fix for our issues to all systems which were already affected by the crash. If these systems perform well over the next week or so, then we will roll out the fix to remaining systems whenever the problem occurs, but no sooner, in order to minimize disruption for our customers. We will update this issue once we know more and have had some time to observe and work with the new kernel and zfs packages.

  • Update
    July 12, 2024 at 9:25 AM
    In progress
    July 12, 2024 at 9:25 AM

    A quick status update on this issue from the Webdock team:

    The main symptom of the current issues is the continued crashing of lxcfs on our host systems. This causes a range of issues, primarily that top/htop will not work in your vps and that you see the resource utilization of the entire host server you are on.

    The main headache we are facing is that lxcfs is a global component on each host where if we restart it we have to restart all customer vps servers. This is obviously not great, as then everybody on that host has some downtime of a few minutes.

    So, instead of restarting everything willy-nilly which fixes things until the next crash of lxcfs, we are actively working with people at Canonical and the creator of our virtualization (LXD), who are absolute experts in the field as to how we can resolve the situation, so that hopefully we will "only" need a single restart of our hosts, and we will be in a good state after that.

    We are still seeing the occasional Kernel crash, but they are fortunately few and far between after we balanced our infrastructure and tweaked some knobs and dials in order to reduce the chance of a kernel crash as much as we could.

    The basic issue in the Ubuntu Kernel has not been identified yet nor fixed, and the people we speak to say that it could take months for an offical fix from Canonical.

    For this reason, we now have a roadmap for how to mitigate this issue, but it's rather technical and requires a lot of testing. We hope to complete this testing this week, and then we will do a rolling update of all host systems (with a restart for everybody...) in order to bring us up to a cleaner, newer kernel version which should (hopefully) be free of the bugs we are seeing. This is the recommended course of action by the experts we have been in dialogue with.

    This whole undertaking is a massive task, and we are hard at work on this.

    We thank you for your continued patience with us as we work to resolve these issues

    Arni Johannesson, CEO

  • Update
    July 05, 2024 at 7:30 AM
    In progress
    July 05, 2024 at 7:30 AM

    Unfortunately this issue is persisting on our infrastructure. We have now gotten as far as having determined the most likley core cause of this issue in cooperation with the lxcfs developer who has been assisting us. Unfortunately we have also started seeing some wholesale kernel crashes again, which was otherwise an issue we believed had gone away.

    Without going into too much detail, the issue seems to be in the latest Ubuntu kernel and the ZFS kernel modules. What happens is that on systems where there is a large disk read workloads, and if the ZFS ARC RAM cache fills up so ZFS starts hitting the disk drives a lot, this triggers some bug which can cause random memory corruption in the Linux Kernel. This in turn can cause interesting (bad) side-effects seemingly at random, such as a kernel crash requiring a reboot of the host system.

    We are actively working on this issue - actually we are not working on anything else these days - finding ways to mitigate and solve this issue. We hope to find a permanent resolution soon, but a lot of the actual work is out of our hands as this essentially speaking falls under the purview of the Ubuntu kernel maintainers.

    We will of course update here once we know more

    Arni Johannesson
    CEO

  • In progress
    June 19, 2024 at 1:22 PM
    In progress
    June 19, 2024 at 1:22 PM

    Quick update on this issue: Since we implemented crash collection code in order to assist the lxcfs developers, we have yet to see a single crash of this subsystem and everything has been running well. This is good news for our customers who'd otherwise be potentially impacted, but bad news for actually identifying the issue we were having. We are keeping this issue open for now and are keeping our crash collection in place until we see a crash of lxcfs so we can hopefully move forward on this issue.

  • Planned
    June 14, 2024 at 8:51 AM
    Planned
    June 14, 2024 at 8:51 AM

    If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.

    This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

    Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

    The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

    Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

    However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

    The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

    We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

    We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

    We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

    Arni Johannesson
    CEO

One fiber down in DK DC, some impact was seen on 217.78.237.0/24
  • Resolved
    Resolved

    Through pure luck we have been operational almost all day today as we had not shifted our entire network from Finland yet. We are still unclear as to how much of our network is incorrectly configured with GlobalConnect but we will make sure everything is in order with them tomorrow.

    We have now learned a bit more about the nature of the incident. It seems they had a major fiber break on a backbone bundle in connection with some freeway work near Kolding. As the break was beneath a busy freeway and the size of the bundle this meant their repair work has taken a very long time.

    They now report they should have our fiber up within the next 3 hours. As we are not operationally impacted and that we know the timeframe for their fix, we are calling this issue resolved on our side.

  • Monitoring
    Update
    Our ISP GlobalConnect still has an outage in about half of their locations in Denmark. We believe that at this point this is looking like one of the biggest outages this provider has ever had in the nordics, at least given the scale of their affected area and the duration. We are still waiting to receive light on the affected fiber pair, but fortunately due to our redundant setup, as of now we are unaffected and have no issues with connectivity. All customers and ip ranges are still operational. We will update here once we are in a redundant and normalized state again in the DK DC. We have in the meantime reached out to other ISPs in the region today and are looking into establishing more fiber to our facility for further redundancy. The incident today was a warning to us that despite us having redundant fully-diverse-route connections, we are still relying on the infrastructure (and BGP configuration) of a single upstream provider. We are working towards fixing that as soon as possible. In addition to this we have implemented further monitoring so we can catch partial outages sooner. This morning we did not realize at first that a single ip range was non-functional as all the others were working, so it took us 20-30 minutes to realize that we were indeed affected by the ISP outage. This time to react should be significantly reduced now as we have automated monitoring on all IP addresses on our network.
  • Monitoring
    Update

    It seems our ISP incorrectly configured one of our ranges so that it was not being advertised properly to the internet in a redundant fashion. We still have a tunnel from our old Finland location and we requested Hetzner to start advertising this prefix 217.78.237.0/24 again. They did so promptly and we now see connectivity again. The underlying issue is not fixed yet however so we will keep this incident open until the situation is fully resolved.

  • Monitoring
    Update

    It seems we are not completely unaffected, it turns out one of our ip ranges 217.78.237.0/24 is being affected by the outage for unknown reasons. It should be routed the same as all other nets, but for some reason the partial ISP outage is affecting this net. This is impacting about 8% of our customers, so if you are one of the unlucky ones, rest assured we are working to resolve this issue.

  • Monitoring
    Monitoring

    We saw a fiber connect to our DK DC loose light at about 8.40 CET. After speaking with our ISP they report some core equipment went down in their Kolding location. We have of course redundant fiber connections so our other connects took over all traffic and all we saw was a short duration of some packet loss while any traffic that was flowing through Kolding was redirected through our other fiber from that ISP. There is no lasting impact on us at this time it seems. The ISP is working on their issue and they say we should receive light again sometime today so we are fully redundant again. We will monitor the situation, but there is nothing we can do on our side at the moment except wait.

Intermittent host instability causing downtime in Denmark - resolution forthcoming
  • Resolved
    Resolved

    We are now calling this infrastructure issue resolved as now we have completed all migrations from Finland and we have not seen a virtualization crash since we implemented our fix on saturday. We have been forced to to host restarts on two seperate hosts in the same period however, but this was due to the systems being in an already bad state from the previously bad config.

    We were planning on shifting all of our IP ranges away from FInland to be announced directly in Denmark today, thus reducing latency by about 20ms, but due to the other unrelated issue we saw today where our ISP has a major outage in their Kolding DC and this affecting one of our fibers, and where it looks like they have incorrectly configured our IP ranges, we are postponing the changeover until our ISP has fixed everything on their side and our ranges are correctly configured with them.

    The changeover should have no noticeable impact on our customers, but we will post a maintenance notification here when we do the operation, just in case.

  • Monitoring
    Monitoring

    During the day Saturday after continued investigations of the root causes of our problems, we found a simple caching parameter in our system setup which had been set to an incorrectly low value. After modifying this value across all of our hosts we saw an immediate drop in load across the board. The incorrect value was written by our base setup orchestration scripts and was a holdover from earlier testing.

    We are actually quite amazed at what a huge difference it made setting this caching parameter to a proper value.

    Ever since we modified this parameter all systems have been green across the board and operating at a fantastic efficiency. In fact, the infrastructure seems to be performing now as we had planned all along (if not better) and we don't have a single host breaking a sweat at this time.

    Our customers should be able to notice a clear difference now that this fix has been implemented. What's even better news is that since we implemented this fix, we have not seen a single crash or hang of our virtualization.

    It is too early to call the issue completely fixed however due to the relative infrequency of crashes we saw before - but if everything runs stable for the next 48 hours or so, it's really looking like a simple configuration tweak was all that it took to resolve our (honestly quite major) issues.

    We will still proceed with our plan of moving certain high resource vps's to a dedicated location monday/tuesday and we are still postponing the last migrations of Finland until tuesday. We want to be absolutely sure things are OK and that this isn't a "too good to be true" type of situation.

    But for now, all is well in our cloud. We couldn't be happier that we found this resolution and we hope you are too :)

    Thanks again for sticking with us here

    Arni Johannesson

    CEO

  • Identified
    Identified

    Over the last few days, after having migrated about 90% of our workloads from Finland, we have identified an issue which is causing us a great deal of trouble. Essentially speaking, the new virtualization environment we have set up in Denmark has proven to be unstable when our host instances are under load.

    What we observe is that our virtual environment crashes and halts all processing. There is no common denominator except it's happening on specific hosts and all of these hosts are ones where we have placed high cpu and high i/o users. The crashes do not leave any trace in any logs and all we see is just a halt state where a reboot of the host is required to bring up the customer workloads. Fortunately restarts are quick and the crashes seem to not happen more than once or twice in any 48 hour period on the affected hosts. As infrequent as they are, these incidents are of course completely unacceptable.

    The only resolution to this issue is to deploy new hardware where we git rid of the virtualization components responsible for this behavior and migrate high cpu and i/o users over to the new hardware. We have put in an emergency order with our hardware vendor this morning and we expect to receive and deploy the new hardware monday. At which time a small subset of our customers (it looks to be about 30-40 vps instances at most causing these issues at this time) will be migrated to the new hardware in order to improve overall stability of our cloud. If you see an unplanned migration notification monday or tuesday after already being migrated to Denmark, this is the reason.

    Moving forward we will not be deploying workloads in the same virtual environment as we have set up now in Denmark, as there is no way for us to fix this issue with the virtualization. All we can do is to get rid of it and return to a more direct-to-bare-metal approach as we have traditionally done (and which has never caused us such probems in the past)

    We did test our new virtualization extensively and under load before deployment - and for at least 48 hours continuously for each system we deployed - but some subset of our customers are doing something "special" which we cant quite identify which puts our virtualization under some unique stress which causes it to hang from time to time. As it turns out, simulated load (stressing cpu, i/o and network) does not reflect real-world load closely enough for us to have caught this issue before we went full scale deployment in Denmark.

    As we don't know exactly what triggers this issue, then we are of course worried this will happen on otherwise (thus far) unaffected hosts if a customer starts performing some workload our virtualization doesn't like. If this turns out to be the case, we will likely be forced to migrate most if not all of our customers who are already in Denmark to hardware which is configured without the troublesome virtualization components. We will do what we can to avoid this however, as besides it being a huge amount of work and investment would mean another migration downtime period for our customers. If this turns out to be required, we will of course send out a news bulletin to all affected customers with details.

    We will of course monitor the situation closely over the weekend and respond as quickly as we can around the clock in order to bring back up customer instances if these hangs/crashes happen again. We are almost sure it will happen again, at least a couple of times in the next 2-3 days until we have this resolved, given the behavior we have seen these past two days.

    We sincerely apologize for these disruptions. Denmark DC was supposed to be a fast (and happy) place for your vps, but this issue caught us off guard. Rest assured we are doing everything we can to eliminate this and will do so as soon as is possible in the coming days.

    Thank you for sticking with us

    Arni Johannesson

    CEO Webdock

May 2024

May 2024 to Jul 2024

Next