Webdock - Status Page

All systems operational

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required
Scheduled for June 14, 2024 at 8:51 AM – 9:51 AM about 1 hour
  • Update
    July 05, 2024 at 7:30 AM
    In progress
    July 05, 2024 at 7:30 AM

    Unfortunately this issue is persisting on our infrastructure. We have now gotten as far as having determined the most likley core cause of this issue in cooperation with the lxcfs developer who has been assisting us. Unfortunately we have also started seeing some wholesale kernel crashes again, which was otherwise an issue we believed had gone away.

    Without going into too much detail, the issue seems to be in the latest Ubuntu kernel and the ZFS kernel modules. What happens is that on systems where there is a large disk read workloads, and if the ZFS ARC RAM cache fills up so ZFS starts hitting the disk drives a lot, this triggers some bug which can cause random memory corruption in the Linux Kernel. This in turn can cause interesting (bad) side-effects seemingly at random, such as a kernel crash requiring a reboot of the host system.

    We are actively working on this issue - actually we are not working on anything else these days - finding ways to mitigate and solve this issue. We hope to find a permanent resolution soon, but a lot of the actual work is out of our hands as this essentially speaking falls under the purview of the Ubuntu kernel maintainers.

    We will of course update here once we know more

    Arni Johannesson
    CEO

  • In progress
    June 19, 2024 at 1:22 PM
    In progress
    June 19, 2024 at 1:22 PM

    Quick update on this issue: Since we implemented crash collection code in order to assist the lxcfs developers, we have yet to see a single crash of this subsystem and everything has been running well. This is good news for our customers who'd otherwise be potentially impacted, but bad news for actually identifying the issue we were having. We are keeping this issue open for now and are keeping our crash collection in place until we see a crash of lxcfs so we can hopefully move forward on this issue.

  • Planned
    June 14, 2024 at 8:51 AM
    Planned
    June 14, 2024 at 8:51 AM

    If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.

    This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

    Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

    The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

    Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

    However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

    The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

    We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

    We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

    We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

    Arni Johannesson
    CEO

Denmark: Network Infrastructure - Operational

100% - uptime

Denmark: Storage Backend - Operational

100% - uptime

Denmark: General Infrastructure - Operational

100% - uptime

Canada: Network Infrastructure - Operational

100% - uptime

Canada: Storage Backend - Operational

100% - uptime

Canada: General Infrastructure - Operational

100% - uptime

Webdock Statistics Server - Operational

100% - uptime

Webdock Dashboard - Operational

100% - uptime

Webdock Website - Operational

100% - uptime

Webdock Image Server - Operational

100% - uptime

Webdock REST API - Operational

100% - uptime

Recent notices

No notices reported for the past 7 days

Show notice history