Thanks to some dilligent work by the guys from the LXD team at Canonical we relatively quickly identified the SQL error that was being kicked up when restarting containers on one of our hosts. This turned out to be due to some kernel parameters being set to their defaults, which was way too little for the system to function properly. In any case, this issue is resolved.
As we have been hesitant to go out and say "The SSH issue is finally fixed!" As we have done so 2 times already in the past 48 hours but been proven wrong over time as the systems slowly crept to a dysfunctional state - then today we have taken the approach to watch the systems very closely and see if the same behavior happens again.
We can now happily report that after 6 hours of operation SSH connection times are completely normal. We will let things run till tomorrow before giving this the official all-clear and throwing a party.
However that leaves us with a fix which involves turning off a component of our virtualization which ensures that load average reported in your server is correct and only reports your load average and not everybody elses. In most cases this is a cosmetic issue - but for Runcloud users or users with external monitoring tools they may see reports or get emails informing them of unusually high load, while this is not really the case.
We will work with the LXD team and Canonical to solve this issue over the coming days as well, and hopefully be able to re-enable this feature of LXD.