Webdock - Notice history

Canada: General Infrastructure under maintenance

Denmark: Network Infrastructure - Operational

100% - uptime
Apr 2024 · 100.0%May · 99.97%Jun · 99.84%
Apr 2024
May 2024
Jun 2024

Denmark: Storage Backend - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Denmark: General Infrastructure - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 99.90%
Apr 2024
May 2024
Jun 2024

Canada: Network Infrastructure - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Canada: Storage Backend - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Canada: General Infrastructure - Under maintenance

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Webdock Statistics Server - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Webdock Dashboard - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Webdock Website - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Webdock Image Server - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Webdock REST API - Operational

100% - uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024
May 2024
Jun 2024

Notice history

Jun 2024

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required
  • Completed
    August 14, 2024 at 12:33 PM
    Completed
    August 14, 2024 at 12:33 PM

    We have now at long last completed all maintenance relating to our kernel issues, lxcfs issues and migrating to the new hypervisor. All customer instances and systems have been upgraded and are running in Denmark. We are now confident enough in the new setup that we are closing this too-long running issue with our platform in the DK DC. Thank you all for your patience with us over the last month or so, where we have been working on resolving this issue.

    Have a great day, and if you encounter any problems don't hesitate to reach out to our support

    Arni Johannesson, CEO Webdock

  • Update
    August 12, 2024 at 7:44 AM
    In progress
    August 12, 2024 at 7:44 AM

    Due to last minute and changes we need to do, we are postponing the final round of maintenance connected to this issue to wednesday August 14th between 09.00 and 14.00 CEST. All affected customers will receive an email to this effect today. A small subset of customers will be migrated to other hosts instead of experiencing the upgrade/reboot procedure. In both cases the downtime for your VPS will be brief, or between 5-15 minutes or so.

    Based on our experiences with the earlier maintenance which has now given us a full week of runtime with the new setup the indications are extremely positive. We have not had a single crash or issue related to our earlier problems. It really seems to us we have finally found the correct cocktail of Kernel version, ZFS version and hypervisor settings which allow us to be in the stable state we were supposed to be in all along.

    We thank you for your continued patience, and we look forward to finally closing this issue soon.

    Arni Johannesson
    CEO Webdock

  • Update
    August 05, 2024 at 12:01 PM
    In progress
    August 05, 2024 at 12:01 PM

    The maintenance today has gone very well. We have a single system where we have opted to perform migrations of affected customers instead, which will happen in a maintenance window tomorrow.

    Now we will watch the upgraded systems carefully for about a week, and hopefully perform the upgrades on the remainder of our infrastructure beginning of next week. We will update here once we have fixed this upcoming maintenance window in place. We thank you yet again for your patience.

  • Update
    August 02, 2024 at 9:08 AM
    In progress
    August 02, 2024 at 9:08 AM

    We are now ready to start the first round of upgrades and conversion to the new hypervisor variant on Monday the 5th of August. We will be performing maintenance on a number of systems, all of which are in some bad state or another due to the crashing lxcfs. The maintenance will take place between 09.00 and 13.00 CEST and you can expect your VPS to go down for up to 20 minutes. In some cases downtime will be as low as 5 minutes, but it depends on the type of system your VPS is located on.

    Once this round of maintenance has been completed, we will watch all systems very closely for about 1 week, before performing the maintenance on the rest of our fleet.

    We are very hopeful that these changes we will be performing will greatly mitigate, if not completely solve, the issues we have been having with crashes and virtualization problems.

    We thank you for your continued patience, and we hope this will be the last disruption you will see connected with the new infrastructure.

  • Update
    July 31, 2024 at 6:30 AM
    In progress
    July 31, 2024 at 6:30 AM

    Update: We are now getting closer to working out a procedure which should bring us to a resolution of this issue. We are working out the details currently and doing quite a bit of testing as we do not want to launch something half-baked which will cause further issues. We will soon perform maintenance on a few hosts, after which we will observe behavior for some days. If all goes well as we expect, we will roll out all the fixes/changes to the remainder of our infrastructure, presumably in one or two large maintenance windows, where we fix almost all our hosts in one go. The remediation we will be performing:

    1. Downgrading our Kernels to a "last known good kernel" as reported by others in the community. This will not be a terribly old kernel and an LTS kernel which receives all patches and security fixes for some years still, so this gives us plenty of time to wait for a fix in the latest kernels

    2. We are switching to the latest zfs-dkms which is built by our friends at Zabbly where we can then get the latest version with fixes to our particular zfs crashes cherry picked as soon as they become available

    3. We are migrating our entire infrastructure over to Incus, which is the truly open source fork of our hypervisor. This will bring many benefits, but the primary one is that this will enable us to run lxcfs on a per-instance basis and not globally for a host. This means that IF the crashing lxcfs issue happens again, it will only affect the one vps where it crashed and not all customers on a system. Furthermore, remediation will be easy as it will only require a reboot of that one VPS, which the customer can do himself even.

    4. Lastly, we will be redoing some CPU pinning in our virtual machines in order to optimize resources as we discovered we had incorrect pinning and thus were wasting some potential performance on a number of our hosts

    This is a complicated maneuver but we have reduced the complexity by having built automated scripts for most of these steps, and this will only require a single reboot. However, since so many steps need to be completed, downtime will be a bit longer than usual, presumably up to 20 minutes or so.

  • Update
    July 23, 2024 at 7:09 AM
    In progress
    July 23, 2024 at 7:09 AM

    Progress update from the Webdock team on this issue: Our remediation efforts have progressed as planned but have unfortunately not produced the desired result. Despite having upgraded affected systems to a much cleaner upstream kernel build and newer zfs filesystem components, there is still some memory corruption happening which crashes lxcfs. We now suspect the issue may be isolated in libfuse, but we cannot be certain at this time.

    We have not seen any wholesale kernel crashes since the upgrades, but that issue cannot be considered closed either until we have had a bit more time without incident.

    In order to effectively solve the lxcfs issue, we have through extensive dialogue with the creator of our hypervisor decided to try and adopt a completely new and cutting edge feature which doesn't even exist yet: per-instance lxcfs virtualization. This essentially means that each VPS server gets its own lxcfs virtualization instead of it being global for the host system. This means that if an lxcfs instance crashes, it will only affect a single customer vps, and where it is easily remediated by just rebooting that one vps - instead of having to reboot ALL customer vps servers on a host.

    As mentioned this feature doesn't exist yet. It exists in lxcfs as a custom feature built for Netflix as they did not want their containerized workloads to have lxcfs crash across entire hosts but it actually hasn't been implemented in the hypervisor, LXD, yet. For this reason we have decided to sponsor the work of the fork of LXD called Incus for which the creator and former maintainer of LXD is responsible. He is currently working on implementing the feature which should be ready and tested within a few days.

    After which, Webdock will proceed to migrate from LXD to Incus in a rolling fashion across our entire infrastructure in Denmark. This will obviously mean yet another downtime event for almost all our customers (only kvm customers are not affected by this) - but after the upgrade we should be in a state where 99.99% of our customers will no longer be affected by this issue and we should see greater stability throughout.

    We will update here once we start the procedure, at first only on a couple of hosts where we will then watch behavior and performance for some days before moving on to the rest of our infrastructure.

    We thank you all for your continued patience

    Arni Johannesson, CEO Webdock

  • Update
    July 16, 2024 at 9:25 AM
    In progress
    July 16, 2024 at 9:25 AM

    We have now rolled out the supposed fix for our issues to all systems which were already affected by the crash. If these systems perform well over the next week or so, then we will roll out the fix to remaining systems whenever the problem occurs, but no sooner, in order to minimize disruption for our customers. We will update this issue once we know more and have had some time to observe and work with the new kernel and zfs packages.

  • Update
    July 12, 2024 at 9:25 AM
    In progress
    July 12, 2024 at 9:25 AM

    A quick status update on this issue from the Webdock team:

    The main symptom of the current issues is the continued crashing of lxcfs on our host systems. This causes a range of issues, primarily that top/htop will not work in your vps and that you see the resource utilization of the entire host server you are on.

    The main headache we are facing is that lxcfs is a global component on each host where if we restart it we have to restart all customer vps servers. This is obviously not great, as then everybody on that host has some downtime of a few minutes.

    So, instead of restarting everything willy-nilly which fixes things until the next crash of lxcfs, we are actively working with people at Canonical and the creator of our virtualization (LXD), who are absolute experts in the field as to how we can resolve the situation, so that hopefully we will "only" need a single restart of our hosts, and we will be in a good state after that.

    We are still seeing the occasional Kernel crash, but they are fortunately few and far between after we balanced our infrastructure and tweaked some knobs and dials in order to reduce the chance of a kernel crash as much as we could.

    The basic issue in the Ubuntu Kernel has not been identified yet nor fixed, and the people we speak to say that it could take months for an offical fix from Canonical.

    For this reason, we now have a roadmap for how to mitigate this issue, but it's rather technical and requires a lot of testing. We hope to complete this testing this week, and then we will do a rolling update of all host systems (with a restart for everybody...) in order to bring us up to a cleaner, newer kernel version which should (hopefully) be free of the bugs we are seeing. This is the recommended course of action by the experts we have been in dialogue with.

    This whole undertaking is a massive task, and we are hard at work on this.

    We thank you for your continued patience with us as we work to resolve these issues

    Arni Johannesson, CEO

  • Update
    July 05, 2024 at 7:30 AM
    In progress
    July 05, 2024 at 7:30 AM

    Unfortunately this issue is persisting on our infrastructure. We have now gotten as far as having determined the most likley core cause of this issue in cooperation with the lxcfs developer who has been assisting us. Unfortunately we have also started seeing some wholesale kernel crashes again, which was otherwise an issue we believed had gone away.

    Without going into too much detail, the issue seems to be in the latest Ubuntu kernel and the ZFS kernel modules. What happens is that on systems where there is a large disk read workloads, and if the ZFS ARC RAM cache fills up so ZFS starts hitting the disk drives a lot, this triggers some bug which can cause random memory corruption in the Linux Kernel. This in turn can cause interesting (bad) side-effects seemingly at random, such as a kernel crash requiring a reboot of the host system.

    We are actively working on this issue - actually we are not working on anything else these days - finding ways to mitigate and solve this issue. We hope to find a permanent resolution soon, but a lot of the actual work is out of our hands as this essentially speaking falls under the purview of the Ubuntu kernel maintainers.

    We will of course update here once we know more

    Arni Johannesson
    CEO

  • In progress
    June 19, 2024 at 1:22 PM
    In progress
    June 19, 2024 at 1:22 PM

    Quick update on this issue: Since we implemented crash collection code in order to assist the lxcfs developers, we have yet to see a single crash of this subsystem and everything has been running well. This is good news for our customers who'd otherwise be potentially impacted, but bad news for actually identifying the issue we were having. We are keeping this issue open for now and are keeping our crash collection in place until we see a crash of lxcfs so we can hopefully move forward on this issue.

  • Planned
    June 14, 2024 at 8:51 AM
    Planned
    June 14, 2024 at 8:51 AM

    If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.

    This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

    Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

    The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

    Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

    However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

    The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

    We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

    We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

    We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

    Arni Johannesson
    CEO

One fiber down in DK DC, some impact was seen on 217.78.237.0/24
  • Resolved
    Resolved

    Through pure luck we have been operational almost all day today as we had not shifted our entire network from Finland yet. We are still unclear as to how much of our network is incorrectly configured with GlobalConnect but we will make sure everything is in order with them tomorrow.

    We have now learned a bit more about the nature of the incident. It seems they had a major fiber break on a backbone bundle in connection with some freeway work near Kolding. As the break was beneath a busy freeway and the size of the bundle this meant their repair work has taken a very long time.

    They now report they should have our fiber up within the next 3 hours. As we are not operationally impacted and that we know the timeframe for their fix, we are calling this issue resolved on our side.

  • Monitoring
    Update
    Our ISP GlobalConnect still has an outage in about half of their locations in Denmark. We believe that at this point this is looking like one of the biggest outages this provider has ever had in the nordics, at least given the scale of their affected area and the duration. We are still waiting to receive light on the affected fiber pair, but fortunately due to our redundant setup, as of now we are unaffected and have no issues with connectivity. All customers and ip ranges are still operational. We will update here once we are in a redundant and normalized state again in the DK DC. We have in the meantime reached out to other ISPs in the region today and are looking into establishing more fiber to our facility for further redundancy. The incident today was a warning to us that despite us having redundant fully-diverse-route connections, we are still relying on the infrastructure (and BGP configuration) of a single upstream provider. We are working towards fixing that as soon as possible. In addition to this we have implemented further monitoring so we can catch partial outages sooner. This morning we did not realize at first that a single ip range was non-functional as all the others were working, so it took us 20-30 minutes to realize that we were indeed affected by the ISP outage. This time to react should be significantly reduced now as we have automated monitoring on all IP addresses on our network.
  • Monitoring
    Update

    It seems our ISP incorrectly configured one of our ranges so that it was not being advertised properly to the internet in a redundant fashion. We still have a tunnel from our old Finland location and we requested Hetzner to start advertising this prefix 217.78.237.0/24 again. They did so promptly and we now see connectivity again. The underlying issue is not fixed yet however so we will keep this incident open until the situation is fully resolved.

  • Monitoring
    Update

    It seems we are not completely unaffected, it turns out one of our ip ranges 217.78.237.0/24 is being affected by the outage for unknown reasons. It should be routed the same as all other nets, but for some reason the partial ISP outage is affecting this net. This is impacting about 8% of our customers, so if you are one of the unlucky ones, rest assured we are working to resolve this issue.

  • Monitoring
    Monitoring

    We saw a fiber connect to our DK DC loose light at about 8.40 CET. After speaking with our ISP they report some core equipment went down in their Kolding location. We have of course redundant fiber connections so our other connects took over all traffic and all we saw was a short duration of some packet loss while any traffic that was flowing through Kolding was redirected through our other fiber from that ISP. There is no lasting impact on us at this time it seems. The ISP is working on their issue and they say we should receive light again sometime today so we are fully redundant again. We will monitor the situation, but there is nothing we can do on our side at the moment except wait.

Intermittent host instability causing downtime in Denmark - resolution forthcoming
  • Resolved
    Resolved

    We are now calling this infrastructure issue resolved as now we have completed all migrations from Finland and we have not seen a virtualization crash since we implemented our fix on saturday. We have been forced to to host restarts on two seperate hosts in the same period however, but this was due to the systems being in an already bad state from the previously bad config.

    We were planning on shifting all of our IP ranges away from FInland to be announced directly in Denmark today, thus reducing latency by about 20ms, but due to the other unrelated issue we saw today where our ISP has a major outage in their Kolding DC and this affecting one of our fibers, and where it looks like they have incorrectly configured our IP ranges, we are postponing the changeover until our ISP has fixed everything on their side and our ranges are correctly configured with them.

    The changeover should have no noticeable impact on our customers, but we will post a maintenance notification here when we do the operation, just in case.

  • Monitoring
    Monitoring

    During the day Saturday after continued investigations of the root causes of our problems, we found a simple caching parameter in our system setup which had been set to an incorrectly low value. After modifying this value across all of our hosts we saw an immediate drop in load across the board. The incorrect value was written by our base setup orchestration scripts and was a holdover from earlier testing.

    We are actually quite amazed at what a huge difference it made setting this caching parameter to a proper value.

    Ever since we modified this parameter all systems have been green across the board and operating at a fantastic efficiency. In fact, the infrastructure seems to be performing now as we had planned all along (if not better) and we don't have a single host breaking a sweat at this time.

    Our customers should be able to notice a clear difference now that this fix has been implemented. What's even better news is that since we implemented this fix, we have not seen a single crash or hang of our virtualization.

    It is too early to call the issue completely fixed however due to the relative infrequency of crashes we saw before - but if everything runs stable for the next 48 hours or so, it's really looking like a simple configuration tweak was all that it took to resolve our (honestly quite major) issues.

    We will still proceed with our plan of moving certain high resource vps's to a dedicated location monday/tuesday and we are still postponing the last migrations of Finland until tuesday. We want to be absolutely sure things are OK and that this isn't a "too good to be true" type of situation.

    But for now, all is well in our cloud. We couldn't be happier that we found this resolution and we hope you are too :)

    Thanks again for sticking with us here

    Arni Johannesson

    CEO

  • Identified
    Identified

    Over the last few days, after having migrated about 90% of our workloads from Finland, we have identified an issue which is causing us a great deal of trouble. Essentially speaking, the new virtualization environment we have set up in Denmark has proven to be unstable when our host instances are under load.

    What we observe is that our virtual environment crashes and halts all processing. There is no common denominator except it's happening on specific hosts and all of these hosts are ones where we have placed high cpu and high i/o users. The crashes do not leave any trace in any logs and all we see is just a halt state where a reboot of the host is required to bring up the customer workloads. Fortunately restarts are quick and the crashes seem to not happen more than once or twice in any 48 hour period on the affected hosts. As infrequent as they are, these incidents are of course completely unacceptable.

    The only resolution to this issue is to deploy new hardware where we git rid of the virtualization components responsible for this behavior and migrate high cpu and i/o users over to the new hardware. We have put in an emergency order with our hardware vendor this morning and we expect to receive and deploy the new hardware monday. At which time a small subset of our customers (it looks to be about 30-40 vps instances at most causing these issues at this time) will be migrated to the new hardware in order to improve overall stability of our cloud. If you see an unplanned migration notification monday or tuesday after already being migrated to Denmark, this is the reason.

    Moving forward we will not be deploying workloads in the same virtual environment as we have set up now in Denmark, as there is no way for us to fix this issue with the virtualization. All we can do is to get rid of it and return to a more direct-to-bare-metal approach as we have traditionally done (and which has never caused us such probems in the past)

    We did test our new virtualization extensively and under load before deployment - and for at least 48 hours continuously for each system we deployed - but some subset of our customers are doing something "special" which we cant quite identify which puts our virtualization under some unique stress which causes it to hang from time to time. As it turns out, simulated load (stressing cpu, i/o and network) does not reflect real-world load closely enough for us to have caught this issue before we went full scale deployment in Denmark.

    As we don't know exactly what triggers this issue, then we are of course worried this will happen on otherwise (thus far) unaffected hosts if a customer starts performing some workload our virtualization doesn't like. If this turns out to be the case, we will likely be forced to migrate most if not all of our customers who are already in Denmark to hardware which is configured without the troublesome virtualization components. We will do what we can to avoid this however, as besides it being a huge amount of work and investment would mean another migration downtime period for our customers. If this turns out to be required, we will of course send out a news bulletin to all affected customers with details.

    We will of course monitor the situation closely over the weekend and respond as quickly as we can around the clock in order to bring back up customer instances if these hangs/crashes happen again. We are almost sure it will happen again, at least a couple of times in the next 2-3 days until we have this resolved, given the behavior we have seen these past two days.

    We sincerely apologize for these disruptions. Denmark DC was supposed to be a fast (and happy) place for your vps, but this issue caught us off guard. Rest assured we are doing everything we can to eliminate this and will do so as soon as is possible in the coming days.

    Thank you for sticking with us

    Arni Johannesson

    CEO Webdock

May 2024

Apr 2024

Host node outage in Finland
  • Resolved
    Resolved

    All customer VPS and block storage data has now been migrated and activated. We once again apologize for the long duration of this incident for some of you today.

  • Monitoring
    Update

    All customers have now been migrated. We have yet to synch some data (such as snapshots) and add-on block storage for a subset of customers. Some customers may see a bit more downtime later in the day as we activate block storage, but that is yet to be determined.

  • Monitoring
    Monitoring

    Migrations are completing as planned. We apologize for the long wait some of you may be experiencing. If you want your server prioritized for faster recovery, please write support.

  • Identified
    Update

    We are restoring all customer servers on different nodes in Finland. Your server will come up soon from the latest snapshot available. If you want your server to be prioritized, write support and we will bump it in front of the queue. We expect the operation to complete fully within an hour or two. Thank you for your patience during this incident today.

  • Identified
    Identified

    The hardware failure looks to be critical. We are evaluating next steps and it looks like we may need to restore all customer VPS on that host elsewhere on our infrastructure from the latest snapshot.

Apr 2024 to Jun 2024

Next