Webdock - Notice history

All systems operational

Denmark: Network Infrastructure - Operational

100% - uptime
Nov 2025 · 100.0%Dec · 99.96%Jan 2026 · 99.99%
Nov 2025
Dec 2025
Jan 2026

Denmark: Storage Backend - Operational

100% - uptime
Nov 2025 · 100.0%Dec · 100.0%Jan 2026 · 100.0%
Nov 2025
Dec 2025
Jan 2026

Denmark: General Infrastructure - Operational

100% - uptime
Nov 2025 · 99.98%Dec · 99.97%Jan 2026 · 99.48%
Nov 2025
Dec 2025
Jan 2026

Webdock Statistics Server - Operational

100% - uptime
Nov 2025 · 100.0%Dec · 100.0%Jan 2026 · 100.0%
Nov 2025
Dec 2025
Jan 2026

Webdock Dashboard - Operational

100% - uptime
Nov 2025 · 100.0%Dec · 100.0%Jan 2026 · 100.0%
Nov 2025
Dec 2025
Jan 2026

Webdock Website - Operational

100% - uptime
Nov 2025 · 100.0%Dec · 100.0%Jan 2026 · 100.0%
Nov 2025
Dec 2025
Jan 2026

Webdock Image Server - Operational

100% - uptime
Nov 2025 · 100.0%Dec · 100.0%Jan 2026 · 100.0%
Nov 2025
Dec 2025
Jan 2026

Webdock REST API - Operational

100% - uptime
Nov 2025 · 100.0%Dec · 100.0%Jan 2026 · 100.0%
Nov 2025
Dec 2025
Jan 2026

Notice history

Jan 2026

Issue with another EPYC host
  • Resolved
    Resolved

    We have completed all migrations. This should conclude this incident. We apologize for any inconvenience caused.

  • Update
    Update

    All customers are now starting on the unstable system. You should see your VPS come up very soon. Migrations will begin shortly.

  • Update
    Update

    Unfortunately it turns out this system will not and can not support a single CPU layout while allowing our NVMe drives to function. The only and last option available to us is to insert the faulty CPU again and live-migrate all users away from this system as quickly as we can. You will get migration start and end notifications by email. We assume and expect we will be able to complete the migrations before the faulty CPU kicks up a fuss again. We will update here once the migrations are complete and this issue is fully resolved. We do not have a firm ETA on this, potentially this will take a couple of hours. You should see your instance come up within long, then at some point it will go down for a minute or two while being started in the new location, after which you should see no further disruption.

  • Monitoring
    Monitoring

    Turns out the fault did follow the CPU, so the CPU is simply bad. We just now removed the CPU and booted the system in a single CPU configuration. However, this resulted in our NVMe drives no longer being visible. For this reason we are switching the healthy CPU to another CPU socket, in the hopes that the PCIe lanes for the drives are tied to that socket and we can run with that single socket that way. If it turns out both CPUs are required for NVME to come up correctly, we will need to reinsert the bad CPU, and do a quick-as-possible migration away from this system for all customers currently there. We will update once we know more.

    Unfortunately we do not have a spare CPU of this exact type available in the DC, so these are the options open to us at the moment.

  • Identified
    Identified

    The issue reappeared. We are looking into it.

  • Postmortem
    Postmortem

    Incident Post-Mortem – Unexpected Server Reboots

    Affected system: Single compute node (Dell R6525, dual AMD EPYC)

    Summary

    One compute node experienced repeated unexpected reboots caused by hardware-level Machine Check Exceptions (MCEs) reported by the system firmware and operating system. The issue was resolved after on-site hardware intervention, and the system is now operating normally.

    Impact

    Customers hosted on this node experienced service interruptions during the reboot loop. No data loss occurred.

    Root Cause (most likely)

    The most likely cause was a marginal CPU socket contact (pin pressure / seating issue) on one processor socket. This can occasionally occur even on new systems and may only surface after some time in production.

    When the CPUs were removed, inspected, reseated, and swapped between sockets, the errors stopped and have not recurred.

    Other causes considered

    While investigating, we also evaluated and ruled out:

    • ECC memory failures (no memory errors were logged by firmware or iDRAC)

    • Operating system or kernel issues

    • Sustained thermal overload

    Other less likely contributors include transient socket power instability or inter-CPU fabric retraining issues, both of which can be cleared by a full power-off and reseat.

    Background

    The server was newly installed approximately 1½ months ago and successfully passed a 10-hour full system stress test before being placed into production. The issue developed later and was not present during initial burn-in.

    Resolution & current status

    • CPUs were reseated and swapped between sockets

    • System firmware counters were cleared

    • The server is now stable and operating normally under load

    • Ongoing monitoring has been increased

  • Resolved
    Resolved

    This incident has been resolved. Once again, sorry for the inconvenience.

  • Identified
    Identified

    Our DC guys looking into the issue. Apparently it looks like one of the CPU has failed (dual CPU setup). The guys are working on to bring up the host.

    Apologies for the inconvenience.

One of our AMD Epyc hosts needs a reboot
  • Resolved
    Resolved

    This incident has been resolved.

    Post-mortem: Dual NVMe Drive Failure on EPYC Host

    Here is our post-mortem for the incident today which caused extended downtime for approximately 317 EPYC-based customer instances.


    Summary

    Earlier today, a single EPYC hypervisor experienced a storage failure following a planned administrative restart. The restart itself was routine and performed to address degraded disk I/O performance that had been observed over the preceding days.

    Following the reboot, the system failed to come back online due to the unexpected loss of two NVMe drives, which together formed a complete ZFS top-level mirror vdev. The simultaneous loss of both members of a mirror rendered the ZFS pool unimportable and resulted in extended downtime while recovery operations were performed.


    Timeline and Detection

    Prior to the restart, we performed standard pre-maintenance checks:

    • The ZFS storage pool reported as ONLINE

    • No critical ZFS alerts were present

    • No hardware warnings or failures were reported by Dell iDRAC / IPMI

    • There was a single historical ZFS write error recorded on one device, but this was not accompanied by device faulting, checksum storms, or pool degradation

    This type of isolated write error is something we occasionally observe across large fleets and, based on long operational experience, does not normally indicate imminent or catastrophic failure. The expectation was therefore to proceed with a controlled reboot, followed by a scrub if necessary.

    At the time of the restart, there were no predictive indicators from either the storage layer or the hardware management layer that suggested an elevated risk of failure.


    Failure Event

    Upon reboot, the system failed to import its ZFS pool. Investigation revealed that two NVMe drives were no longer available to the system. These two drives together constituted an entire mirror vdev at the top level of the pool.

    In ZFS, the loss of a complete top-level vdev makes a pool unrecoverable by design, as data is striped across vdevs and cannot be reconstructed without at least one surviving replica.

    This immediately escalated the incident from a routine maintenance task to a full host recovery operation.


    Hardware Investigation

    With assistance from on-site datacenter technicians, extensive hardware diagnostics were performed:

    • Drives were reseated and moved to known-good NVMe bays

    • Backplane, cabling, and PCIe connectivity were verified

    • BIOS and iDRAC inventory were reviewed

    • Power cycling and cold starts were attempted

    The results were conclusive:

    • One drive was no longer detected at all by the system

    • The second drive was detected but reported a capacity of 0 GB and failed initialization

    At this point, it was clear that both NVMe drives had suffered irreversible failure, likely at the controller or firmware level.


    Why This Was Exceptionally Unlikely

    This failure mode is statistically extreme:

    • Both drives were enterprise-grade NVMe devices

    • Both were members of a mirror specifically designed to tolerate single-device failure

    • There were no SMART, iDRAC, or ZFS indicators suggesting a pending fault

    • The failures occurred effectively simultaneously and only became fully visible after a reboot

    In many thousands of host-years of operation, we have not previously encountered a scenario where both members of a ZFS mirror failed in such close succession without advance warning.

    The absence of meaningful alerts meant that there was no operational signal that would normally justify preemptive action such as taking the host out of service prior to the reboot.


    Impact

    • Approximately 317 customer instances on the affected host experienced downtime

    • The host itself required full storage reinitialization

    • Customer instances were restored from backup snapshots via our Incus-based recovery infrastructure

    Because the incident occurred while the current daily backup cycle was still in progress, restore points varied:

    • Approximately 20% of instances were recovered from backups taken earlier the same morning

    • Approximately 80% of instances were recovered from the most recent completed weekly backup, taken the previous morning (CET)


    Recovery and Resolution

    Once it was clear that the local ZFS pool could not be recovered:

    • The affected storage pool was destroyed and recreated

    • The host was re-initialized cleanly

    • Customer instances were restored from the most recent available snapshots

    • All affected services were brought back online


    Lessons Learned and Preventive Measures

    Although this incident stemmed from an extremely improbable hardware failure, we are still taking concrete steps to reduce the blast radius of similar edge cases in the future:

    • More conservative handling and escalation of any ZFS device-level errors, even when isolated

    • Additional scrutiny around storage health prior to maintenance reboots on high-density hosts

    • Adjustments to maintenance timing relative to active backup windows

    • Review of power and firmware interactions specific to NVMe devices under sustained I/O load

    • Continued evaluation of pool layout and recovery strategies to further limit worst-case scenarios


    Closing Notes

    This incident was not caused by a single mistake, misconfiguration, or ignored alert. It was the result of a rare and unfortunate convergence of hardware failures that only became fully apparent at reboot time.

    While ZFS behaved exactly as designed — refusing to mount a pool whose integrity could not be proven — the lack of advance warning made the outcome both surprising and severe.

    We regret the disruption caused and appreciate the patience shown while recovery was underway. Incidents like this feed directly into improving our operational resilience and recovery procedures going forward.

  • Update
    Update

    Some bad news. Unfortunately our DC guys saw a rare critical 2-drive failure. We'll reload all the servers on the affected host with the latest snapshots we have. Our sincere apologies for this.

  • Monitoring
    Monitoring

    There seems to be a problem with some cabling in the physical host. The DC guys are on this. Sorry for the inconvenience

  • Identified
    Identified

    The physical host needs a reboot. Instances running there will see a 15-20 minutes of downtime.

Dec 2025

Nov 2025 to Jan 2026

Next