We observed degraded Cloudlink (VPN) connections following several subsequent API release deployments. These were spread out over the course of a day, and took some time to settle without any manual intervention. This is generally referred to as a "thundering herd" when 1000s of devices are attempting to connect to a new node at the same time and get rate limited.
Upon investigation we found that when we are running at peak usage, the load balancing policies in place for our TCP Cloudlink connections were not optimized to avoid proxying through nodes that were scaling up and scaling down during deploys. Due to the nature of TCP, even though our Cloudlink instances were largely unmoved, the proxied TCP connections were being interrupted by the shuffle of other backend services.
We have since implemented some changes to our load balancers to only route TCP Cloudlink traffic via nodes that have online and ready Cloudlink pods running. We are also in the early stages of enabling UDP connections for this endpoint and will announce more details in the future.