Webdock - Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required – Maintenance details

All systems operational

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required

Completed
Scheduled for June 14, 2024 at 8:51 AM – August 14, 2024 at 12:33 PM

Affects

Denmark: General Infrastructure

Under maintenance from 8:51 AM to 12:33 PM

Updates
  • Completed
    August 14, 2024 at 12:33 PM
    Completed
    August 14, 2024 at 12:33 PM

    We have now at long last completed all maintenance relating to our kernel issues, lxcfs issues and migrating to the new hypervisor. All customer instances and systems have been upgraded and are running in Denmark. We are now confident enough in the new setup that we are closing this too-long running issue with our platform in the DK DC. Thank you all for your patience with us over the last month or so, where we have been working on resolving this issue.

    Have a great day, and if you encounter any problems don't hesitate to reach out to our support

    Arni Johannesson, CEO Webdock

  • Update
    August 12, 2024 at 7:44 AM
    In progress
    August 12, 2024 at 7:44 AM

    Due to last minute and changes we need to do, we are postponing the final round of maintenance connected to this issue to wednesday August 14th between 09.00 and 14.00 CEST. All affected customers will receive an email to this effect today. A small subset of customers will be migrated to other hosts instead of experiencing the upgrade/reboot procedure. In both cases the downtime for your VPS will be brief, or between 5-15 minutes or so.

    Based on our experiences with the earlier maintenance which has now given us a full week of runtime with the new setup the indications are extremely positive. We have not had a single crash or issue related to our earlier problems. It really seems to us we have finally found the correct cocktail of Kernel version, ZFS version and hypervisor settings which allow us to be in the stable state we were supposed to be in all along.

    We thank you for your continued patience, and we look forward to finally closing this issue soon.

    Arni Johannesson
    CEO Webdock

  • Update
    August 05, 2024 at 12:01 PM
    In progress
    August 05, 2024 at 12:01 PM

    The maintenance today has gone very well. We have a single system where we have opted to perform migrations of affected customers instead, which will happen in a maintenance window tomorrow.

    Now we will watch the upgraded systems carefully for about a week, and hopefully perform the upgrades on the remainder of our infrastructure beginning of next week. We will update here once we have fixed this upcoming maintenance window in place. We thank you yet again for your patience.

  • Update
    August 02, 2024 at 9:08 AM
    In progress
    August 02, 2024 at 9:08 AM

    We are now ready to start the first round of upgrades and conversion to the new hypervisor variant on Monday the 5th of August. We will be performing maintenance on a number of systems, all of which are in some bad state or another due to the crashing lxcfs. The maintenance will take place between 09.00 and 13.00 CEST and you can expect your VPS to go down for up to 20 minutes. In some cases downtime will be as low as 5 minutes, but it depends on the type of system your VPS is located on.

    Once this round of maintenance has been completed, we will watch all systems very closely for about 1 week, before performing the maintenance on the rest of our fleet.

    We are very hopeful that these changes we will be performing will greatly mitigate, if not completely solve, the issues we have been having with crashes and virtualization problems.

    We thank you for your continued patience, and we hope this will be the last disruption you will see connected with the new infrastructure.

  • Update
    July 31, 2024 at 6:30 AM
    In progress
    July 31, 2024 at 6:30 AM

    Update: We are now getting closer to working out a procedure which should bring us to a resolution of this issue. We are working out the details currently and doing quite a bit of testing as we do not want to launch something half-baked which will cause further issues. We will soon perform maintenance on a few hosts, after which we will observe behavior for some days. If all goes well as we expect, we will roll out all the fixes/changes to the remainder of our infrastructure, presumably in one or two large maintenance windows, where we fix almost all our hosts in one go. The remediation we will be performing:

    1. Downgrading our Kernels to a "last known good kernel" as reported by others in the community. This will not be a terribly old kernel and an LTS kernel which receives all patches and security fixes for some years still, so this gives us plenty of time to wait for a fix in the latest kernels

    2. We are switching to the latest zfs-dkms which is built by our friends at Zabbly where we can then get the latest version with fixes to our particular zfs crashes cherry picked as soon as they become available

    3. We are migrating our entire infrastructure over to Incus, which is the truly open source fork of our hypervisor. This will bring many benefits, but the primary one is that this will enable us to run lxcfs on a per-instance basis and not globally for a host. This means that IF the crashing lxcfs issue happens again, it will only affect the one vps where it crashed and not all customers on a system. Furthermore, remediation will be easy as it will only require a reboot of that one VPS, which the customer can do himself even.

    4. Lastly, we will be redoing some CPU pinning in our virtual machines in order to optimize resources as we discovered we had incorrect pinning and thus were wasting some potential performance on a number of our hosts

    This is a complicated maneuver but we have reduced the complexity by having built automated scripts for most of these steps, and this will only require a single reboot. However, since so many steps need to be completed, downtime will be a bit longer than usual, presumably up to 20 minutes or so.

  • Update
    July 23, 2024 at 7:09 AM
    In progress
    July 23, 2024 at 7:09 AM

    Progress update from the Webdock team on this issue: Our remediation efforts have progressed as planned but have unfortunately not produced the desired result. Despite having upgraded affected systems to a much cleaner upstream kernel build and newer zfs filesystem components, there is still some memory corruption happening which crashes lxcfs. We now suspect the issue may be isolated in libfuse, but we cannot be certain at this time.

    We have not seen any wholesale kernel crashes since the upgrades, but that issue cannot be considered closed either until we have had a bit more time without incident.

    In order to effectively solve the lxcfs issue, we have through extensive dialogue with the creator of our hypervisor decided to try and adopt a completely new and cutting edge feature which doesn't even exist yet: per-instance lxcfs virtualization. This essentially means that each VPS server gets its own lxcfs virtualization instead of it being global for the host system. This means that if an lxcfs instance crashes, it will only affect a single customer vps, and where it is easily remediated by just rebooting that one vps - instead of having to reboot ALL customer vps servers on a host.

    As mentioned this feature doesn't exist yet. It exists in lxcfs as a custom feature built for Netflix as they did not want their containerized workloads to have lxcfs crash across entire hosts but it actually hasn't been implemented in the hypervisor, LXD, yet. For this reason we have decided to sponsor the work of the fork of LXD called Incus for which the creator and former maintainer of LXD is responsible. He is currently working on implementing the feature which should be ready and tested within a few days.

    After which, Webdock will proceed to migrate from LXD to Incus in a rolling fashion across our entire infrastructure in Denmark. This will obviously mean yet another downtime event for almost all our customers (only kvm customers are not affected by this) - but after the upgrade we should be in a state where 99.99% of our customers will no longer be affected by this issue and we should see greater stability throughout.

    We will update here once we start the procedure, at first only on a couple of hosts where we will then watch behavior and performance for some days before moving on to the rest of our infrastructure.

    We thank you all for your continued patience

    Arni Johannesson, CEO Webdock

  • Update
    July 16, 2024 at 9:25 AM
    In progress
    July 16, 2024 at 9:25 AM

    We have now rolled out the supposed fix for our issues to all systems which were already affected by the crash. If these systems perform well over the next week or so, then we will roll out the fix to remaining systems whenever the problem occurs, but no sooner, in order to minimize disruption for our customers. We will update this issue once we know more and have had some time to observe and work with the new kernel and zfs packages.

  • Update
    July 12, 2024 at 9:25 AM
    In progress
    July 12, 2024 at 9:25 AM

    A quick status update on this issue from the Webdock team:

    The main symptom of the current issues is the continued crashing of lxcfs on our host systems. This causes a range of issues, primarily that top/htop will not work in your vps and that you see the resource utilization of the entire host server you are on.

    The main headache we are facing is that lxcfs is a global component on each host where if we restart it we have to restart all customer vps servers. This is obviously not great, as then everybody on that host has some downtime of a few minutes.

    So, instead of restarting everything willy-nilly which fixes things until the next crash of lxcfs, we are actively working with people at Canonical and the creator of our virtualization (LXD), who are absolute experts in the field as to how we can resolve the situation, so that hopefully we will "only" need a single restart of our hosts, and we will be in a good state after that.

    We are still seeing the occasional Kernel crash, but they are fortunately few and far between after we balanced our infrastructure and tweaked some knobs and dials in order to reduce the chance of a kernel crash as much as we could.

    The basic issue in the Ubuntu Kernel has not been identified yet nor fixed, and the people we speak to say that it could take months for an offical fix from Canonical.

    For this reason, we now have a roadmap for how to mitigate this issue, but it's rather technical and requires a lot of testing. We hope to complete this testing this week, and then we will do a rolling update of all host systems (with a restart for everybody...) in order to bring us up to a cleaner, newer kernel version which should (hopefully) be free of the bugs we are seeing. This is the recommended course of action by the experts we have been in dialogue with.

    This whole undertaking is a massive task, and we are hard at work on this.

    We thank you for your continued patience with us as we work to resolve these issues

    Arni Johannesson, CEO

  • Update
    July 05, 2024 at 7:30 AM
    In progress
    July 05, 2024 at 7:30 AM

    Unfortunately this issue is persisting on our infrastructure. We have now gotten as far as having determined the most likley core cause of this issue in cooperation with the lxcfs developer who has been assisting us. Unfortunately we have also started seeing some wholesale kernel crashes again, which was otherwise an issue we believed had gone away.

    Without going into too much detail, the issue seems to be in the latest Ubuntu kernel and the ZFS kernel modules. What happens is that on systems where there is a large disk read workloads, and if the ZFS ARC RAM cache fills up so ZFS starts hitting the disk drives a lot, this triggers some bug which can cause random memory corruption in the Linux Kernel. This in turn can cause interesting (bad) side-effects seemingly at random, such as a kernel crash requiring a reboot of the host system.

    We are actively working on this issue - actually we are not working on anything else these days - finding ways to mitigate and solve this issue. We hope to find a permanent resolution soon, but a lot of the actual work is out of our hands as this essentially speaking falls under the purview of the Ubuntu kernel maintainers.

    We will of course update here once we know more

    Arni Johannesson
    CEO

  • In progress
    June 19, 2024 at 1:22 PM
    In progress
    June 19, 2024 at 1:22 PM

    Quick update on this issue: Since we implemented crash collection code in order to assist the lxcfs developers, we have yet to see a single crash of this subsystem and everything has been running well. This is good news for our customers who'd otherwise be potentially impacted, but bad news for actually identifying the issue we were having. We are keeping this issue open for now and are keeping our crash collection in place until we see a crash of lxcfs so we can hopefully move forward on this issue.

  • Planned
    June 14, 2024 at 8:51 AM
    Planned
    June 14, 2024 at 8:51 AM

    If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.

    This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

    Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

    The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

    Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

    However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

    The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

    We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

    We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

    We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

    Arni Johannesson
    CEO