Elevated Device URLs/VPN Errors

Incident Report for balena.io

Postmortem

We observed degraded Cloudlink (VPN) connections following several subsequent API release deployments. These were spread out over the course of a day, and took some time to settle without any manual intervention. This is generally referred to as a "thundering herd" when 1000s of devices are attempting to connect to a new node at the same time and get rate limited.

Upon investigation we found that when we are running at peak usage, the load balancing policies in place for our TCP Cloudlink connections were not optimized to avoid proxying through nodes that were scaling up and scaling down during deploys. Due to the nature of TCP, even though our Cloudlink instances were largely unmoved, the proxied TCP connections were being interrupted by the shuffle of other backend services.

We have since implemented some changes to our load balancers to only route TCP Cloudlink traffic via nodes that have online and ready Cloudlink pods running. We are also in the early stages of enabling UDP connections for this endpoint and will announce more details in the future.

Posted Nov 22, 2024 - 14:06 UTC

Resolved

This incident has been resolved.

Posted Nov 20, 2024 - 14:01 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 19, 2024 - 21:24 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 19, 2024 - 20:32 UTC

Investigating

We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.

Posted Nov 19, 2024 - 10:00 UTC

This incident affected: Cloudlink (VPN).