All systems operational

Finland: Network Infrastructure
99.90% uptime
Nov 2023 · 99.69%Dec · 100.0%Jan 2024 · 100.0%
Nov 202399.69% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Finland: Storage Backend
100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime

Finland: General Infrastructure

100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Canada: Network Infrastructure
99.95% uptime
Nov 2023 · 100.0%Dec · 99.85%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 202399.85% uptime
Jan 2024100.0% uptime
Canada: Storage Backend
100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime

Canada: General Infrastructure

99.52% uptime
Nov 2023 · 98.97%Dec · 99.60%Jan 2024 · 99.98%
Nov 202398.97% uptime
Dec 202399.60% uptime
Jan 202499.98% uptime
Webdock Statistics Server
100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Webdock Dashboard
100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Webdock Website
100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Webdock Image Server
100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime
Webdock REST API
100.0% uptime
Nov 2023 · 100.0%Dec · 100.0%Jan 2024 · 100.0%
Nov 2023100.0% uptime
Dec 2023100.0% uptime
Jan 2024100.0% uptime

Notice history

Dec 2023

Nov 2023

Host down in Canada
  • Resolved
    Resolved

    All customers have been migrated and are operational. We are now doing a full barrage of tests on the affected system and will likely decommision it. We apologize for the repeated disruption these past 48 hours.

  • Identified
    Identified

    We have decided to go ahead and move all customers away from this failing machine, as we are clearly unable to stop the crashes from happening nor determine what is causing the issue exactly. It seems to be load dependent and the kernel does not give us any information when the halts happen. The system simply freezes up and our hardware monitoring is not showing any apparent CPU or RAM failures.

    In any case, we will move customers as quickly as we can today on a best effort basis. As we are already in a bit of a capacity crunch in Canada, some users may see that they have been upgraded to an equivalent Ryzen profile. But we will of course notify those users if that is the case.

    You are probably in no doubt whether you are affected or not, but if you are for some reason, then you can determine if you are by logging in to the Webdock dashboard where a big red alert will tell you which of your KVM machines are affected.

  • Resolved
    Resolved

    This incident has been resolved, for now. The team will once again take a look at this system tomorrow during office hours.

  • Investigating
    Investigating

    The same host which has given us problems recently went down again. The system is currently booting and all VPS servers should be up shortly.

Network outage in Finland
  • Postmortem
    Postmortem

    At around 3.30 CET our network equipment in Finland started reporting failures with connectivity. At 5.30 CET when the morning shift at Webdock took over from the night shift, they immediately became aware that there was an issue, which the night shift had not noticed due to a failure of alerting mechanisms. About 1 hour later the issue was diagnosed and being resolved and we were fully operational shortly before 07.00 CET.

    The cause of the issue was ironically determined to be the additional flow analysis we implemented after the last network incident we had in Finland. This additional flow analysis was implemented as until now we only had flow exports from our routers sitting behind our edge routers. In the last incident, once the edge routers were overwhelmed by the attack we saw at that time, they stopped passing flows to our switches and thus left us "blind" with regards to seeing what traffic was directed at our network. For this reason we implemented flow exporting from our edge routers but from what we can tell - likely due to a software bug in the Juniper JunOS - this flow exporting caused physical ports which were part of portchannels to "flap" by entering and leaving the portchannels as they stopped sending/receiving LACP packets.

    Put in other terms: The simple logging mechanisms of the Juniper routers started malfunctioning in such a way that it affected connectivity to the equipment sitting behind our routers, causing packet loss.

    The reason for the slow reaction time on our side was firstly that this happened during the night shift, but primarily due to the nature of the problem: We were not without connectivity, it was just severly degraded. This caused our secondary alerting mechanism not to escalate the alerts to a crit state (which would have resulted in numerous alarms and people being woken up). The primary monitoring directly on our equipment caught the issue immediately, but those monitors are not connected to the physical alerting mechanisms that we have.

    The task for us today is twofold: 1. Make sure that when the network is degraded like we saw today, there WILL be alarms triggered which will wake up our network team. 2. We will need to look into how we can export flows at our edge so we have the visibility we need in case of future attacks on our infrastructure, without relying on the apparently buggy implementation in our Juniper routers.

    We apologize profusely for the disruption and we should have done better in this case. This was Murphys law hitting us hard and we had not expected our routers to crap their pants on something as simple as flow exports to our logging facilities.

  • Resolved
    Resolved

    For a bit more than an hour now we have been up and operational. We will publish a post-mortem of this incident here shortly.

  • Identified
    Identified

    The network team reports that the issue seems to be a problem with interconnectivity between router and switches in the DC. Without getting too technical ports are apparently "flapping" up and down and causing the disruption we are seeing. We will update once we know more.

  • Investigating
    Investigating

    We are currently investigating this incident. This looks similar to the incident we saw the other day, potentially

Host down in Canada
  • Resolved
    Resolved

    After almost 24 hours of uptime we are calling this issue tentatively resolved, but will continue to monitor this in the coming days.

  • Monitoring
    Monitoring

    This system continues to vex us. We are unable to determine the root cause as all we see is a kernel crash. Temperatures and voltages are fine, we see no ECC ram errors or CPU exceptions, so this certainly looks like software dependent. We have already earlier in the day fully upgraded the system and hypervisor on the machine, which did not help. To us, it really looks like some customer workload on the machine is causing the kernel to panic and at an increasing rate today. We will now disable a couple of more customer instances which are high activity in an effort to locate the culprit.

  • Resolved
    Resolved

    We now hope we have correctly identified the root cause of the instability issues / kernel crashes we have seen today. We will continue to watch the system closely and we sincerely hope we will see not further issues here.

  • Monitoring
    Monitoring

    Unfortunately the system is again showing stability issues. We think we now have narrowed this issue down to a misbehaving virtual machine which is somehow crashing the kernel. We will boot the system again now and disable that VPS. Hopefully that will fix this issue.

  • Resolved
    Resolved

    This incident has been resolved for now. We are trying some things and are watching this system closely. It's looking like some user is running a workload which from time to time hits a kernel fault and the system freezes up. If this keeps happening, we will suspend/stop this users server in order to bring stability back to the system.

  • Monitoring
    Monitoring

    Seems like this system is having problems again. We will bring it up asap and investigate further.

  • Resolved
    Resolved

    This incident has been resolved, all VPS servers are up and we hope this instability issue is past us now.

  • Identified
    Update

    After having inspected all hardware, we now believe this is a kernel/software issue and will be running a full upgrade of the system, after which we will perform another reboot. You can expect your server to be up in 5-10 minutes from now.

  • Identified
    Identified

    We have a host down in Canada again, the same one as the other day. This may indicate there is some lurking hardware failure which we will investigate further. We will start with getting the system up, which it should be in about 5-10 minutes.

Network outage in Finland
  • Resolved
    Resolved

    We managed to work around the issue and it turned out we did not need another reboot after all. For this reason, we will call this issue resolved for now. We will keep monitoring the network of course and will open another incident if anything further happens. For now at least, we are good.

  • Monitoring
    Update

    We will be performing another reboot of key systems shortly. This will bring down the network for about 10 minutes. If network connectivity is OK after the reboot, then no further action will be taken and we will mark this incident as resolved.

    We apologize for any and all inconvenience this has caused tonight - trust us when we say that this has been no fun for us either :)

  • Monitoring
    Update

    We implemented a fix and are currently monitoring the result. All connectivity should now be OK. Unfortunately, we may need to reboot some systems causing another brief period of downtime of about 10-15 minutes before we can call this completely resolved. We will update here if that turns out to be required.

  • Monitoring
    Update

    Unfortunately we are still experiencing a severly degraded network. We identified an attack towards one of our customers and discarded that traffic - but it was not a high volume attack, only about 2Gbit/sec and 1.2 million packets/second. As you can see, mitigating this attack did not improve the situation so there are more attacks and/or issues at work at this time which we continue to investigate.

  • Monitoring
    Monitoring

    Still seeing degraded performance unfortunately and still investigating the situation. Our network team is hard at work trying to fix this issue, and have tried a few things already to narrow down the source, like disabling IPv6 to see if the traffic was originating on that part of our network and disabling certain other features. So far this looks to be an IPv4 based attack which is also interesting as IPv4 seems a lot less impacted than IPv6 - but this may be a quirk of how our hardware works rather than anything indicative of the source. We see periods of good functionality with periods of packet loss in between or about 50-60% packet loss on IPv4 continuing at this time.

  • Identified
    Update

    After key equipment reboot and various diagnostics we can now conclusively say this is an attack on our infrastructure and not a hardware fault. The reason why we couldn't see this immediately is because our monitoring is not flagging anything and our hardware is not being overloaded in the typical way where CPU or memory usage is super high (like when we have to deal with a lot of packets) - so this is some new type of attack we are unfamiliar with.

    We are also checking with our upstream provider (Hetzner) why their DOS filtering hasn't caught this and if they can spot the malicious traffic.

    We will keep updating here as we learn more.

  • Identified
    Update

    Key equipment is being rebooted. This process can take up to 10-15 minutes.

  • Identified
    Identified

    We are now working on a fix which entails taking some equipment down for a reboot, this means that we have a complete outage at the moment. We hope to be back up very soon.

  • Investigating
    Update

    We apologize for the wait for a fix here and the continued degraded performance. We are still pin-pointing the cause, this is not something we have seen before and we are trying to figure out if it's indeed a hardware fault or an attack of some sorts. Thank you for your patience this evening.

  • Investigating
    Update

    We are continuing to investigate the incident. IPv6 is much more impacted than IPv4 where we are seeing much higher packet loss and latency on IPv6 than IPv4, or as high as 90% and-6 seconds of latency. Without having found the exact root cause, this is now looking less like a DOS attack and more like a hardware fault or issue with our Arista switches in the Finland DC, as opposed to our firewalls being overwhelmed by malicious traffic. This is not conclusive yet however and we are still locating the root cause here.

  • Investigating
    Update

    We are seeing high latency and packet loss and not a complete outage as first observed. This is indicative of a DOS attack or similar event. We are continuing to work on identifying the root cause and a fix for this incident.

  • Investigating
    Investigating

    Looks like we are experiencing a network outage in Finland. We are currently investigating this incident.

Nov 2023 to Jan 2024