Elevated Device URLs/Cloudlink Errors
Incident Report for balena.io
Postmortem

We've recently experienced some instabilities with our Cloudlink service which impacts: webterminal, ssh, tunnels, actions, and public-url.

We host our Cloudlink servers along with some other parts of the balena-cloud infrastructure on a Kubernetes cluster.

Kubernetes scale and reorganize those servers depending on the load.

When a Cloudlink server needs to be shut down, connections are drained, this is expected and in most cases, transparent as devices reconnect to another Cloudlink server automatically.

What happened lately, and is problematic, is our Kubernetes cluster moving Cloudlink servers on hosts that are not able to sustain them at that moment. Causing another move shortly after, while devices didn't have the time to reconnect.

Hopefully, only a small amount of devices are concerned as only a few servers are moved at any given time, but it's still a very big problem for the customers who own those devices.

We already deployed some remediation to reduce the time it takes to move one of those servers and we're working on multiple different solutions to continue to both reduce the recovery time and reduce the occurrence of such issues.

Posted Aug 10, 2023 - 15:30 UTC

Resolved
This incident has been resolved.
Posted Aug 09, 2023 - 18:05 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 09, 2023 - 16:52 UTC
Investigating
We're experiencing an elevated level of errors in our Device URLs and Cloudlink infrastructure and are currently looking into the issue.
Posted Aug 09, 2023 - 14:55 UTC
This incident affected: Device URLs and Cloudlink (VPN).