All systems operational

Denmark: Network Infrastructure

99.94% uptime
Apr 2024 · 100.0%May · 99.97%Jun · 99.84%
Apr 2024100.0% uptime
May 202499.97% uptime
Jun 202499.84% uptime

Denmark: Storage Backend

100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime

Denmark: General Infrastructure

99.97% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 99.90%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 202499.90% uptime
Canada: Network Infrastructure
100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime
Canada: Storage Backend
100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime

Canada: General Infrastructure

100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime
Webdock Statistics Server
100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime
Webdock Dashboard
100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime
Webdock Website
100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime
Webdock Image Server
100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime
Webdock REST API
100.0% uptime
Apr 2024 · 100.0%May · 100.0%Jun · 100.0%
Apr 2024100.0% uptime
May 2024100.0% uptime
Jun 2024100.0% uptime

Notice history

Jun 2024

Ongoing problems with virtualization subsystem, periodic restarts of some VPS servers required
Scheduled for June 14, 2024 at 8:51 AM – 9:51 AM about 1 hour
  • Planned
    June 14, 2024 at 8:51 AM
    Planned
    June 14, 2024 at 8:51 AM

    If you experience out-of-order reboots of your VPS or are wondering about low uptime of your system, this is likely due to us currently being forced to reboot some VPS servers from time to time. This issue is related to the one we posted during our migration period, but not as critical as the issue we saw then and seems unrelated to the performance tweaks we did which resolved outright crashes of our hosts.

    This is a new issue in a subsystem we use for virtualization of container VPS instances, called lxcfs. The maintainers of lxcfs believe this is likely an issue in either the Linux Kernel or in another component they use in lxcfs called libfuse.

    Now, the issue is fortunately relatively rare - we are seeing this happen maybe once every 48 hours at this time - but it requires a reboot of affected VPS servers as once lxcfs crashes, there is no way (currently) to reattach lxcfs to running container VPS servers. Reboots mean 3-5 minutes of downtime for the affected VPS servers.

    The symptoms of a crash is that suddenly your VPS will have visibility of all resources on the host system and all CPU and memory activity being performed. Also, in some VPS servers if you try to run htop, you will get the message "Cannot open /proc/stat: Transport endpoint is not connected". This means lxcfs has crashed on the system.

    Functionally speaking, for most workloads like websites and the like, the crashes have little impact: All resource limitations for your are still in place - you are just unable to run htop and your server now "sees" the incorrect utilization of CPU and amount of resources.

    However, as a lot of you are using monitoring tools that look for server resource consumption we cannot just ignore an lxcfs crash. Because a lot of you write support, rightly worried that something is wrong with your server (which there is)

    The systems most affected by this are container VPS servers running in Denmark, as they are (ironically enough) running on all the latest and greatest software (latest kernel version etc.) - which seems to be the root cause as we have not seen this issue before on any of our systems in the past. This certainly seems like a bug introduced in the latest kernel, lxcfs or libfuse.

    We have adopted the strategy of performing reboots of affected VPS servers as soon as time permits, as the issue becomes known to us. We are kind of stuck between a rock and a hard place here, as we get complaints if we leave systems without lxcfs running and we also (naturally) get complaints if we've had to reboot VPS servers.

    We are working with the maintainers of lxcfs to identify and resolve the issue, but as things go in Open Source, we first have to identify the correct subsystem which is at fault and then that system will need to be patched or rolled back. If there is any way for us to directly patch/rollback the affected systems and not have to wait for e.g. a new kernel release, we will do so. If you want all the gory technical details and follow along on progress, you can take a look here: https://github.com/lxc/lxcfs/issues/644

    We hope this will be quickly resolved, but it's looking like this may unfortunately be the status quo for some time yet. We will keep this maintenance notification open for the duration and update here once new information becomes available to us. Thank you for your patience and we apologize for the inconvenience caused by this issue.

    Arni Johannesson
    CEO

One fiber down in DK DC, some impact was seen on 217.78.237.0/24
  • Resolved
    Resolved

    Through pure luck we have been operational almost all day today as we had not shifted our entire network from Finland yet. We are still unclear as to how much of our network is incorrectly configured with GlobalConnect but we will make sure everything is in order with them tomorrow.

    We have now learned a bit more about the nature of the incident. It seems they had a major fiber break on a backbone bundle in connection with some freeway work near Kolding. As the break was beneath a busy freeway and the size of the bundle this meant their repair work has taken a very long time.

    They now report they should have our fiber up within the next 3 hours. As we are not operationally impacted and that we know the timeframe for their fix, we are calling this issue resolved on our side.

  • Monitoring
    Update
    Our ISP GlobalConnect still has an outage in about half of their locations in Denmark. We believe that at this point this is looking like one of the biggest outages this provider has ever had in the nordics, at least given the scale of their affected area and the duration. We are still waiting to receive light on the affected fiber pair, but fortunately due to our redundant setup, as of now we are unaffected and have no issues with connectivity. All customers and ip ranges are still operational. We will update here once we are in a redundant and normalized state again in the DK DC. We have in the meantime reached out to other ISPs in the region today and are looking into establishing more fiber to our facility for further redundancy. The incident today was a warning to us that despite us having redundant fully-diverse-route connections, we are still relying on the infrastructure (and BGP configuration) of a single upstream provider. We are working towards fixing that as soon as possible. In addition to this we have implemented further monitoring so we can catch partial outages sooner. This morning we did not realize at first that a single ip range was non-functional as all the others were working, so it took us 20-30 minutes to realize that we were indeed affected by the ISP outage. This time to react should be significantly reduced now as we have automated monitoring on all IP addresses on our network.
  • Monitoring
    Update

    It seems our ISP incorrectly configured one of our ranges so that it was not being advertised properly to the internet in a redundant fashion. We still have a tunnel from our old Finland location and we requested Hetzner to start advertising this prefix 217.78.237.0/24 again. They did so promptly and we now see connectivity again. The underlying issue is not fixed yet however so we will keep this incident open until the situation is fully resolved.

  • Monitoring
    Update

    It seems we are not completely unaffected, it turns out one of our ip ranges 217.78.237.0/24 is being affected by the outage for unknown reasons. It should be routed the same as all other nets, but for some reason the partial ISP outage is affecting this net. This is impacting about 8% of our customers, so if you are one of the unlucky ones, rest assured we are working to resolve this issue.

  • Monitoring
    Monitoring

    We saw a fiber connect to our DK DC loose light at about 8.40 CET. After speaking with our ISP they report some core equipment went down in their Kolding location. We have of course redundant fiber connections so our other connects took over all traffic and all we saw was a short duration of some packet loss while any traffic that was flowing through Kolding was redirected through our other fiber from that ISP. There is no lasting impact on us at this time it seems. The ISP is working on their issue and they say we should receive light again sometime today so we are fully redundant again. We will monitor the situation, but there is nothing we can do on our side at the moment except wait.

Intermittent host instability causing downtime in Denmark - resolution forthcoming
  • Resolved
    Resolved

    We are now calling this infrastructure issue resolved as now we have completed all migrations from Finland and we have not seen a virtualization crash since we implemented our fix on saturday. We have been forced to to host restarts on two seperate hosts in the same period however, but this was due to the systems being in an already bad state from the previously bad config.

    We were planning on shifting all of our IP ranges away from FInland to be announced directly in Denmark today, thus reducing latency by about 20ms, but due to the other unrelated issue we saw today where our ISP has a major outage in their Kolding DC and this affecting one of our fibers, and where it looks like they have incorrectly configured our IP ranges, we are postponing the changeover until our ISP has fixed everything on their side and our ranges are correctly configured with them.

    The changeover should have no noticeable impact on our customers, but we will post a maintenance notification here when we do the operation, just in case.

  • Monitoring
    Monitoring

    During the day Saturday after continued investigations of the root causes of our problems, we found a simple caching parameter in our system setup which had been set to an incorrectly low value. After modifying this value across all of our hosts we saw an immediate drop in load across the board. The incorrect value was written by our base setup orchestration scripts and was a holdover from earlier testing.

    We are actually quite amazed at what a huge difference it made setting this caching parameter to a proper value.

    Ever since we modified this parameter all systems have been green across the board and operating at a fantastic efficiency. In fact, the infrastructure seems to be performing now as we had planned all along (if not better) and we don't have a single host breaking a sweat at this time.

    Our customers should be able to notice a clear difference now that this fix has been implemented. What's even better news is that since we implemented this fix, we have not seen a single crash or hang of our virtualization.

    It is too early to call the issue completely fixed however due to the relative infrequency of crashes we saw before - but if everything runs stable for the next 48 hours or so, it's really looking like a simple configuration tweak was all that it took to resolve our (honestly quite major) issues.

    We will still proceed with our plan of moving certain high resource vps's to a dedicated location monday/tuesday and we are still postponing the last migrations of Finland until tuesday. We want to be absolutely sure things are OK and that this isn't a "too good to be true" type of situation.

    But for now, all is well in our cloud. We couldn't be happier that we found this resolution and we hope you are too :)

    Thanks again for sticking with us here

    Arni Johannesson

    CEO

  • Identified
    Identified

    Over the last few days, after having migrated about 90% of our workloads from Finland, we have identified an issue which is causing us a great deal of trouble. Essentially speaking, the new virtualization environment we have set up in Denmark has proven to be unstable when our host instances are under load.

    What we observe is that our virtual environment crashes and halts all processing. There is no common denominator except it's happening on specific hosts and all of these hosts are ones where we have placed high cpu and high i/o users. The crashes do not leave any trace in any logs and all we see is just a halt state where a reboot of the host is required to bring up the customer workloads. Fortunately restarts are quick and the crashes seem to not happen more than once or twice in any 48 hour period on the affected hosts. As infrequent as they are, these incidents are of course completely unacceptable.

    The only resolution to this issue is to deploy new hardware where we git rid of the virtualization components responsible for this behavior and migrate high cpu and i/o users over to the new hardware. We have put in an emergency order with our hardware vendor this morning and we expect to receive and deploy the new hardware monday. At which time a small subset of our customers (it looks to be about 30-40 vps instances at most causing these issues at this time) will be migrated to the new hardware in order to improve overall stability of our cloud. If you see an unplanned migration notification monday or tuesday after already being migrated to Denmark, this is the reason.

    Moving forward we will not be deploying workloads in the same virtual environment as we have set up now in Denmark, as there is no way for us to fix this issue with the virtualization. All we can do is to get rid of it and return to a more direct-to-bare-metal approach as we have traditionally done (and which has never caused us such probems in the past)

    We did test our new virtualization extensively and under load before deployment - and for at least 48 hours continuously for each system we deployed - but some subset of our customers are doing something "special" which we cant quite identify which puts our virtualization under some unique stress which causes it to hang from time to time. As it turns out, simulated load (stressing cpu, i/o and network) does not reflect real-world load closely enough for us to have caught this issue before we went full scale deployment in Denmark.

    As we don't know exactly what triggers this issue, then we are of course worried this will happen on otherwise (thus far) unaffected hosts if a customer starts performing some workload our virtualization doesn't like. If this turns out to be the case, we will likely be forced to migrate most if not all of our customers who are already in Denmark to hardware which is configured without the troublesome virtualization components. We will do what we can to avoid this however, as besides it being a huge amount of work and investment would mean another migration downtime period for our customers. If this turns out to be required, we will of course send out a news bulletin to all affected customers with details.

    We will of course monitor the situation closely over the weekend and respond as quickly as we can around the clock in order to bring back up customer instances if these hangs/crashes happen again. We are almost sure it will happen again, at least a couple of times in the next 2-3 days until we have this resolved, given the behavior we have seen these past two days.

    We sincerely apologize for these disruptions. Denmark DC was supposed to be a fast (and happy) place for your vps, but this issue caught us off guard. Rest assured we are doing everything we can to eliminate this and will do so as soon as is possible in the coming days.

    Thank you for sticking with us

    Arni Johannesson

    CEO Webdock

May 2024

Apr 2024

Host node outage in Finland
  • Resolved
    Resolved

    All customer VPS and block storage data has now been migrated and activated. We once again apologize for the long duration of this incident for some of you today.

  • Monitoring
    Update

    All customers have now been migrated. We have yet to synch some data (such as snapshots) and add-on block storage for a subset of customers. Some customers may see a bit more downtime later in the day as we activate block storage, but that is yet to be determined.

  • Monitoring
    Monitoring

    Migrations are completing as planned. We apologize for the long wait some of you may be experiencing. If you want your server prioritized for faster recovery, please write support.

  • Identified
    Update

    We are restoring all customer servers on different nodes in Finland. Your server will come up soon from the latest snapshot available. If you want your server to be prioritized, write support and we will bump it in front of the queue. We expect the operation to complete fully within an hour or two. Thank you for your patience during this incident today.

  • Identified
    Identified

    The hardware failure looks to be critical. We are evaluating next steps and it looks like we may need to restore all customer VPS on that host elsewhere on our infrastructure from the latest snapshot.

Apr 2024 to Jun 2024