All systems operational

Resolved
Host down in Canada

Started
November 29, 2023 at 6:47 AM
Status
Resolved after 1 day

Impact

Partial outage
Affected
Canada: General Infrastructure
  • Resolved
    Resolved

    After almost 24 hours of uptime we are calling this issue tentatively resolved, but will continue to monitor this in the coming days.

  • Monitoring
    Monitoring

    This system continues to vex us. We are unable to determine the root cause as all we see is a kernel crash. Temperatures and voltages are fine, we see no ECC ram errors or CPU exceptions, so this certainly looks like software dependent. We have already earlier in the day fully upgraded the system and hypervisor on the machine, which did not help. To us, it really looks like some customer workload on the machine is causing the kernel to panic and at an increasing rate today. We will now disable a couple of more customer instances which are high activity in an effort to locate the culprit.

  • Resolved
    Resolved

    We now hope we have correctly identified the root cause of the instability issues / kernel crashes we have seen today. We will continue to watch the system closely and we sincerely hope we will see not further issues here.

  • Monitoring
    Monitoring

    Unfortunately the system is again showing stability issues. We think we now have narrowed this issue down to a misbehaving virtual machine which is somehow crashing the kernel. We will boot the system again now and disable that VPS. Hopefully that will fix this issue.

  • Resolved
    Resolved

    This incident has been resolved for now. We are trying some things and are watching this system closely. It's looking like some user is running a workload which from time to time hits a kernel fault and the system freezes up. If this keeps happening, we will suspend/stop this users server in order to bring stability back to the system.

  • Monitoring
    Monitoring

    Seems like this system is having problems again. We will bring it up asap and investigate further.

  • Resolved
    Resolved

    This incident has been resolved, all VPS servers are up and we hope this instability issue is past us now.

  • Identified
    Update

    After having inspected all hardware, we now believe this is a kernel/software issue and will be running a full upgrade of the system, after which we will perform another reboot. You can expect your server to be up in 5-10 minutes from now.

  • Identified
    Identified

    We have a host down in Canada again, the same one as the other day. This may indicate there is some lurking hardware failure which we will investigate further. We will start with getting the system up, which it should be in about 5-10 minutes.