All systems operational

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required
Scheduled for June 14, 2024 at 8:51 AM – 9:51 AM about 1 hour
  • Planned
    June 14, 2024 at 8:51 AM
    Planned
    June 14, 2024 at 8:51 AM

    If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.

    This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

    Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

    The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

    Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

    However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

    The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

    We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

    We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

    We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

    Arni Johannesson
    CEO

Denmark: Network Infrastructure

99.94% uptime

Denmark: Storage Backend

100.0% uptime

Denmark: General Infrastructure

99.97% uptime
Canada: Network Infrastructure
99.89% uptime
Canada: Storage Backend
100.0% uptime

Canada: General Infrastructure

100.0% uptime
Webdock Statistics Server
100.0% uptime
Webdock Dashboard
100.0% uptime
Webdock Website
100.0% uptime
Webdock Image Server
100.0% uptime
Webdock REST API
100.0% uptime

Recent notices

June 18, 2024

June 17, 2024

June 16, 2024

June 15, 2024

June 14, 2024

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required
Scheduled for June 14, 2024 at 8:51 AM – 9:51 AM about 1 hour
  • Planned
    June 14, 2024 at 8:51 AM
    Planned
    June 14, 2024 at 8:51 AM

    If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.

    This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

    Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

    The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

    Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

    However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

    The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

    We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

    We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

    We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

    Arni Johannesson
    CEO

June 12, 2024

One fiber down in DK DC, some impact was seen on 217.78.237.0/24
  • Resolved
    Resolved

    Through pure luck we have been operational almost all day today as we had not shifted our entire network from Finland yet. We are still unclear as to how much of our network is incorrectly configured with GlobalConnect but we will make sure everything is in order with them tomorrow.

    We have now learned a bit more about the nature of the incident. It seems they had a major fiber break on a backbone bundle in connection with some freeway work near Kolding. As the break was beneath a busy freeway and the size of the bundle this meant their repair work has taken a very long time.

    They now report they should have our fiber up within the next 3 hours. As we are not operationally impacted and that we know the timeframe for their fix, we are calling this issue resolved on our side.

  • Monitoring
    Update
    Our ISP GlobalConnect still has an outage in about half of their locations in Denmark. We believe that at this point this is looking like one of the biggest outages this provider has ever had in the nordics, at least given the scale of their affected area and the duration. We are still waiting to receive light on the affected fiber pair, but fortunately due to our redundant setup, as of now we are unaffected and have no issues with connectivity. All customers and ip ranges are still operational. We will update here once we are in a redundant and normalized state again in the DK DC. We have in the meantime reached out to other ISPs in the region today and are looking into establishing more fiber to our facility for further redundancy. The incident today was a warning to us that despite us having redundant fully-diverse-route connections, we are still relying on the infrastructure (and BGP configuration) of a single upstream provider. We are working towards fixing that as soon as possible. In addition to this we have implemented further monitoring so we can catch partial outages sooner. This morning we did not realize at first that a single ip range was non-functional as all the others were working, so it took us 20-30 minutes to realize that we were indeed affected by the ISP outage. This time to react should be significantly reduced now as we have automated monitoring on all IP addresses on our network.
  • Monitoring
    Update

    It seems our ISP incorrectly configured one of our ranges so that it was not being advertised properly to the internet in a redundant fashion. We still have a tunnel from our old Finland location and we requested Hetzner to start advertising this prefix 217.78.237.0/24 again. They did so promptly and we now see connectivity again. The underlying issue is not fixed yet however so we will keep this incident open until the situation is fully resolved.

  • Monitoring
    Update

    It seems we are not completely unaffected, it turns out one of our ip ranges 217.78.237.0/24 is being affected by the outage for unknown reasons. It should be routed the same as all other nets, but for some reason the partial ISP outage is affecting this net. This is impacting about 8% of our customers, so if you are one of the unlucky ones, rest assured we are working to resolve this issue.

  • Monitoring
    Monitoring

    We saw a fiber connect to our DK DC loose light at about 8.40 CET. After speaking with our ISP they report some core equipment went down in their Kolding location. We have of course redundant fiber connections so our other connects took over all traffic and all we saw was a short duration of some packet loss while any traffic that was flowing through Kolding was redirected through our other fiber from that ISP. There is no lasting impact on us at this time it seems. The ISP is working on their issue and they say we should receive light again sometime today so we are fully redundant again. We will monitor the situation, but there is nothing we can do on our side at the moment except wait.

Show notice history