All systems are operational

Past Incidents

26th May 2020

No incidents reported

25th May 2020

No incidents reported

24th May 2020

No incidents reported

23rd May 2020

No incidents reported

22nd May 2020

No incidents reported

21st May 2020

All issues relating to our infrastructure Focal Fossa upgrade seem to be resolved

We are now ready to call it and announce that all issues we have seen since the upgrade on monday have been resolved. In the last 26 hours everything has been ticking along normally. We will be sending out an email notification to all customers shortly with a high-level overview of what has happened and what remains to be tweaked.

20th May 2020

Runcloud Communcation issue resolved: Requires Server reboot - SSH connectivity is still 100% OK

We are now confident that the Runcloud communication issue has been resolved. As mentioned in our earlier post the issue related to the Go programming language which powers the Runcloud daemon which was unhappy on newer Ubuntu Kernels (inappropriately so it seems). Our workaround on setting higher limits has worked and has been rolled out across the entire infrastructure. However:

YOU PROBABLY NEED TO REBOOT YOUR WEBDOCK SERVER FOR THIS TO TAKE EFFECT

Only do this if you experience communication failure between Runcloud and your server.

A way to check if the fix is active for you is to SSH in to your server and execute "ulimit -l" it should print a number of 16384 or larger. If it prints something like 64 then the fix is not active for you and you need to reboot your server. As always if problems continue, please be in touch with us.

We will send this information in an email blast tomorrow morning as we want to send out a complete overview of all issues in a single email to all customers.

In addition the final fix we did this morning w/regards to SSH connectivity seems to have stuck and now more than 12 hours later everything is as it should be. We will give it till tomorrow morning before making the final announcement (to be doubly sure) but there is no doubt in our minds that this issue has also been resolved.

We look forward to put the problems of the past couple of days behind us and get back to focusing on optimizing your hosting experience.!

Runcloud communication issue identified

A mishap seldom comes alone as the old adage goes. We have identified the problem with Runcloud service not being able to start up on Webdock servers anymore, and it actually has to do with the Go programming language which Runcloud uses for their service which runs on your vps and communicates with the Runcloud control panel, and the fact that we are now on the latest Ubuntu Kernel. Some security-related mitigation code in Go chokes on these newer kernels. We are trying to parse this mammoth of a thread describing the issue:

https://github.com/golang/go/issues/37436

There is a lot of information here and a great deal of confusion and controversy going on in the Go community it seems - but if there is anything we can do with our security limits or settings, we will be sure to do so.

To repeat: This is not the fault of Webdock or Ubuntu or Runcloud for that matter - it seems some boffins at Go did some security fix with some logic which doesn't hold water on Ubuntu systems.

We hope to be able to work around the issue, and will update here as soon as we can.

SQL Error from LXD fixed - Tentatively calling the latest SSH fix resolved...

Thanks to some dilligent work by the guys from the LXD team at Canonical we relatively quickly identified the SQL error that was being kicked up when restarting containers on one of our hosts. This turned out to be due to some kernel parameters being set to their defaults, which was way too little for the system to function properly. In any case, this issue is resolved.

As we have been hesitant to go out and say "The SSH issue is finally fixed!" As we have done so 2 times already in the past 48 hours but been proven wrong over time as the systems slowly crept to a dysfunctional state - then today we have taken the approach to watch the systems very closely and see if the same behavior happens again.

We can now happily report that after 6 hours of operation SSH connection times are completely normal. We will let things run till tomorrow before giving this the official all-clear and throwing a party.

However that leaves us with a fix which involves turning off a component of our virtualization which ensures that load average reported in your server is correct and only reports your load average and not everybody elses. In most cases this is a cosmetic issue - but for Runcloud users or users with external monitoring tools they may see reports or get emails informing them of unusually high load, while this is not really the case.

We will work with the LXD team and Canonical to solve this issue over the coming days as well, and hopefully be able to re-enable this feature of LXD.

New potential fix implemented - SQL error when restarting servers

We have tried implementing a new fix this morning and for about 2 hours now everything seems to run normally. The side-effect of this fix is that you no longer get load averages for just your server but for the system as a whole. This may impact Runcloud users who will see higher than normal load/activity in their Runcloud control panel.

In other news we have hit a new error condition today where on some hosts containers have problems restarting kicking up a weird SQL error. This is coming from our hypervisor LXD and is presumably a bug in the latest LXD for Focal Fossa. We are in dialogue with the LXD team at Canonical investigating this issue. We have implemented a mitigation for this issue as well which hopefully prevent you from hitting it. We are watching things closely, so if you see this error then rest assured that within minutes someone from Webdock will step in and restart your server for you and notify you directly via. email

SSH Connection Fix may not have solved the issue completely

It seems, as we continue to watch the infrastructure that the fix we reported earlier today to our SSH connection issues may not have completely cleared up the issue. Although the vast majority of users can connect, in some cases delays of up to 20 seconds or more before the connection is successful has been observed. The issue seems to still be the PAM subsystem which has trouble completing authentication. The delay seems directly tied to load on the host system and we now suspect that recent kernel tweaks implemented to speed up the disk I/O system may be at fault.

As we continue to venture down the kernel rabbit hole we ask our customers to be patient with us and we promise to resolve this as quickly as is possible. In the meantime, if you are experiencing delays but can connect, you can try to disable PAM authentication (should be OK to do on most stacks) by editing /etc/ssh/sshd_config and setting "UsePAM no" and then restarting your ssh daemon with systemctl restart sshd

If you cannot connect at all, be in touch with support and we will try to restore your connectivity while we work through the issue. We are hitting the night cycle here in Europe, so we will not be working on this full strenght till morning.

We hope to update you all with the good news of a permanent resolution tomorrow at the latest.

19th May 2020

Only a single system affected

It really seems only a single system was affected by the fix not sticking and all others are working as expected. We consider the SSH connection issue finally solved / closed.

Fix does not stick on some systems

Unfortunately it seems the resolution we found does not stick on some systems and a full system reboot is required after all. These hosts will see an additional 5-10 minutes of downtime which will happen over the next couple of hours. We hope to finally have this issue fully resolved once we have gone through the systems that are still exhibiting anomalous behavior and finished rebooting them.

Issue fully resolved and all systems working nominally!

We managed to solve the underlying kernel module issue for everybody with at most 2 minutes of downtime (usually less than a minute) and see all systems working nominally again. We apologize for any inconvenience caused today. If you see any further issues, please be in touch with support.

Underlying issue identified and a fix found!

We have finally identified the underlying issue in a faulty kernel module (br_netfilter) - to fully disable the module and restore functionality we need to reboot all hosts unfortunately. All customers will see a brief 5-10 minute downtime for their server over the next couple of hours as we reboot all hosts.

Continuing to investigate the SSH login issues - Runcloud users may be affected

We are continuing to investigate the SSH login issues. It seems Runcloud users, and probably others who use 3rd party control panels that rely on SSH to perform commands may be unable to communicate with their Webdock server. We expect this issue to go away as soon as we find a solution to the underlying problem with SSH authentication.

Issues with Provisioning and failing Certbot solved

The issue with slow provisioning and failing Certbot has been resolved.

We are still investigating the SSH login / PAM subsystem failure however

Provisioning slower than usual, Certbot SSL Certificate Generation may fail

Provisioning may take a bit longer than usual and Certbot certificate generation may fail. If you get a Certbot failure after provisioning, restart the server and then generate the certificate manually.

We suspect this issue is related to the same underlying cause as the SSH login / PAM submodule issue and are investigating the cause.

Problem with SSH logins on all servers

After our big kernel upgrade yesterday, we have identified an issue which prevents you from logging in with SSH - and FTP if your server uses PAM authentication - relating to the PAM submodule. Disabling the PAM submodule in /etc/ssh/sshd_config (set UsePAM to no) and restarting the sshd daemon solves the issue.

We are still investigating the underlying mechanism which is at work here and hope to roll out a fix infrastructure wide as soon as possible. If you cannot log in with SSH please be in touch with Webdock Support and we will do the config file tweak mentioned here and this should restore access.

18th May 2020

Infrastructure upgrade complete

Todays infrastructure upgrade has been completed on schedule and all systems are working nominally. We apologize for any inconvenience caused by any disruption you may have seen today.

Infrastructure upgrade progress update

We are now at about 90% completion with the infrastructure upgrade today

Infrastructure upgrade progress update

We are now at about 70% completion with the infrastructure upgrade today

Infrastructure upgrade progress update

We are now at about 50% completion with the infrastructure upgrade today

Infrastructure upgrade progress update

We are now at about 30% completion with the infrastructure upgrade today

Major Infrastructure Upgrade Today

Today we will be performing important upgrades throuhgout our infrastructure beginning at 14.00. We expect the work to continue throughout the day untill about 22.00 tonight. You will experience your server(s) going down up to 2 times for a total downtime today of between 7 and 15 minutes. We will post continuous updates here today with an indicaton of progress. If you need to know which of your servers are currently affected by the upgrade, simply log in to your Webdock control panel where you will see a big red alert telling you which servers are affected at any given time.

17th May 2020

No incidents reported