ResolvedNetwork outage in Finland
At around 3.30 CET our network equipment in Finland started reporting failures with connectivity. At 5.30 CET when the morning shift at Webdock took over from the night shift, they immediately became aware that there was an issue, which the night shift had not noticed due to a failure of alerting mechanisms. About 1 hour later the issue was diagnosed and being resolved and we were fully operational shortly before 07.00 CET.
The cause of the issue was ironically determined to be the additional flow analysis we implemented after the last network incident we had in Finland. This additional flow analysis was implemented as until now we only had flow exports from our routers sitting behind our edge routers. In the last incident, once the edge routers were overwhelmed by the attack we saw at that time, they stopped passing flows to our switches and thus left us "blind" with regards to seeing what traffic was directed at our network. For this reason we implemented flow exporting from our edge routers but from what we can tell - likely due to a software bug in the Juniper JunOS - this flow exporting caused physical ports which were part of portchannels to "flap" by entering and leaving the portchannels as they stopped sending/receiving LACP packets.
Put in other terms: The simple logging mechanisms of the Juniper routers started malfunctioning in such a way that it affected connectivity to the equipment sitting behind our routers, causing packet loss.
The reason for the slow reaction time on our side was firstly that this happened during the night shift, but primarily due to the nature of the problem: We were not without connectivity, it was just severly degraded. This caused our secondary alerting mechanism not to escalate the alerts to a crit state (which would have resulted in numerous alarms and people being woken up). The primary monitoring directly on our equipment caught the issue immediately, but those monitors are not connected to the physical alerting mechanisms that we have.
The task for us today is twofold: 1. Make sure that when the network is degraded like we saw today, there WILL be alarms triggered which will wake up our network team. 2. We will need to look into how we can export flows at our edge so we have the visibility we need in case of future attacks on our infrastructure, without relying on the apparently buggy implementation in our Juniper routers.
We apologize profusely for the disruption and we should have done better in this case. This was Murphys law hitting us hard and we had not expected our routers to crap their pants on something as simple as flow exports to our logging facilities.
For a bit more than an hour now we have been up and operational. We will publish a post-mortem of this incident here shortly.
The network team reports that the issue seems to be a problem with interconnectivity between router and switches in the DC. Without getting too technical ports are apparently "flapping" up and down and causing the disruption we are seeing. We will update once we know more.
We are currently investigating this incident. This looks similar to the incident we saw the other day, potentially