Webdock - Status

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required

Scheduled for June 14, 2024 at 8:51 AM – July 13, 2024 at 5:21 AM 29 days

Update
July 23, 2024 at 7:09 AM
In progress
July 23, 2024 at 7:09 AM
Progress update from the Webdock team on this issue: Our remediation efforts have progressed as planned but have unfortunately not produced the desired result. Despite having upgraded affected systems to a much cleaner upstream kernel build and newer zfs filesystem components, there is still some memory corruption happening which crashes lxcfs. We now suspect the issue may be isolated in libfuse, but we cannot be certain at this time.
We have not seen any wholesale kernel crashes since the upgrades, but that issue cannot be considered closed either until we have had a bit more time without incident.
In order to effectively solve the lxcfs issue, we have through extensive dialogue with the creator of our hypervisor decided to try and adopt a completely new and cutting edge feature which doesn't even exist yet: per-instance lxcfs virtualization. This essentially means that each VPS server gets its own lxcfs virtualization instead of it being global for the host system. This means that if an lxcfs instance crashes, it will only affect a single customer vps, and where it is easily remediated by just rebooting that one vps - instead of having to reboot ALL customer vps servers on a host.

As mentioned this feature doesn't exist yet. It exists in lxcfs as a custom feature built for Netflix as they did not want their containerized workloads to have lxcfs crash across entire hosts but it actually hasn't been implemented in the hypervisor, LXD, yet. For this reason we have decided to sponsor the work of the fork of LXD called Incus for which the creator and former maintainer of LXD is responsible. He is currently working on implementing the feature which should be ready and tested within a few days.

After which, Webdock will proceed to migrate from LXD to Incus in a rolling fashion across our entire infrastructure in Denmark. This will obviously mean yet another downtime event for almost all our customers (only kvm customers are not affected by this) - but after the upgrade we should be in a state where 99.99% of our customers will no longer be affected by this issue and we should see greater stability throughout.

We will update here once we start the procedure, at first only on a couple of hosts where we will then watch behavior and performance for some days before moving on to the rest of our infrastructure.

We thank you all for your continued patience

Arni Johannesson, CEO Webdock
Update
July 16, 2024 at 9:25 AM
In progress
July 16, 2024 at 9:25 AM
We have now rolled out the supposed fix for our issues to all systems which were already affected by the crash. If these systems perform well over the next week or so, then we will roll out the fix to remaining systems whenever the problem occurs, but no sooner, in order to minimize disruption for our customers. We will update this issue once we know more and have had some time to observe and work with the new kernel and zfs packages.
Update
July 12, 2024 at 9:25 AM
In progress
July 12, 2024 at 9:25 AM
A quick status update on this issue from the Webdock team:
The main symptom of the current issues is the continued crashing of lxcfs on our host systems. This causes a range of issues, primarily that top/htop will not work in your vps and that you see the resource utilization of the entire host server you are on.
The main headache we are facing is that lxcfs is a global component on each host where if we restart it we have to restart all customer vps servers. This is obviously not great, as then everybody on that host has some downtime of a few minutes.

So, instead of restarting everything willy-nilly which fixes things until the next crash of lxcfs, we are actively working with people at Canonical and the creator of our virtualization (LXD), who are absolute experts in the field as to how we can resolve the situation, so that hopefully we will "only" need a single restart of our hosts, and we will be in a good state after that.

We are still seeing the occasional Kernel crash, but they are fortunately few and far between after we balanced our infrastructure and tweaked some knobs and dials in order to reduce the chance of a kernel crash as much as we could.

The basic issue in the Ubuntu Kernel has not been identified yet nor fixed, and the people we speak to say that it could take months for an offical fix from Canonical.

For this reason, we now have a roadmap for how to mitigate this issue, but it's rather technical and requires a lot of testing. We hope to complete this testing this week, and then we will do a rolling update of all host systems (with a restart for everybody...) in order to bring us up to a cleaner, newer kernel version which should (hopefully) be free of the bugs we are seeing. This is the recommended course of action by the experts we have been in dialogue with.

This whole undertaking is a massive task, and we are hard at work on this.

We thank you for your continued patience with us as we work to resolve these issues

Arni Johannesson, CEO
Update
July 05, 2024 at 7:30 AM
In progress
July 05, 2024 at 7:30 AM
Unfortunately this issue is persisting on our infrastructure. We have now gotten as far as having determined the most likley core cause of this issue in cooperation with the lxcfs developer who has been assisting us. Unfortunately we have also started seeing some wholesale kernel crashes again, which was otherwise an issue we believed had gone away.

Without going into too much detail, the issue seems to be in the latest Ubuntu kernel and the ZFS kernel modules. What happens is that on systems where there is a large disk read workloads, and if the ZFS ARC RAM cache fills up so ZFS starts hitting the disk drives a lot, this triggers some bug which can cause random memory corruption in the Linux Kernel. This in turn can cause interesting (bad) side-effects seemingly at random, such as a kernel crash requiring a reboot of the host system.

We are actively working on this issue - actually we are not working on anything else these days - finding ways to mitigate and solve this issue. We hope to find a permanent resolution soon, but a lot of the actual work is out of our hands as this essentially speaking falls under the purview of the Ubuntu kernel maintainers.
We will of course update here once we know more
Arni Johannesson
CEO
In progress
June 19, 2024 at 1:22 PM
In progress
June 19, 2024 at 1:22 PM
Quick update on this issue: Since we implemented crash collection code in order to assist the lxcfs developers, we have yet to see a single crash of this subsystem and everything has been running well. This is good news for our customers who'd otherwise be potentially impacted, but bad news for actually identifying the issue we were having. We are keeping this issue open for now and are keeping our crash collection in place until we see a crash of lxcfs so we can hopefully move forward on this issue.
Planned
June 14, 2024 at 8:51 AM
Planned
June 14, 2024 at 8:51 AM
If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.
This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

Arni Johannesson
CEO

Recent notices

July 27, 2024

July 26, 2024

July 25, 2024

July 24, 2024

July 23, 2024

July 22, 2024

July 21, 2024

Host needs kernel upgrade and restart in Denmark

Completed
July 21, 2024 at 5:19 PM
Completed
July 21, 2024 at 5:19 PM
The maintenance is complete and all customer VPS servers are up
In progress
July 21, 2024 at 5:00 PM
In progress
July 21, 2024 at 5:00 PM
Maintenance is now in progress
Planned
July 21, 2024 at 5:00 PM
Planned
July 21, 2024 at 5:00 PM
We have a host machine in Denmark which needs a kernel upgrade and reboot. Your VPS will go down for at most 5-10 minutes, usually a lot less. You can check if your server(s) are affected by logging in to Webdock as this will then be shown at the top of the page in a red alert window.

Show notice history

Webdock - Status Page

Denmark: General Infrastructure under maintenance

Recent notices

July 27, 2024

July 26, 2024

July 25, 2024

July 24, 2024

July 23, 2024

July 22, 2024

July 21, 2024