On Sept. 10th, around 13:30pm UTC, our alerting system reported intermittent elevated API errors. We quickly determined the cause of the incident to be an overly aggressive liveness probe rotating our API pods. A fix was deployed immediately.
While the API was recovering, an automatic update of the Cloudlink pods occurred. As the API was slower to respond, device reconnections and SSH authentication were slower than usual, but steadily improving.
This apparent recovery was masking a different issue in the Cloudlink update itself—an issue that was severely impacting container performance under high concurrency. This problem was only apparent at the scale of the production environment and was completely invisible at the lower scale of our development and testing environments.
Once the concurrency issue had been properly identified, we quickly reverted the update and Cloudlink returned to its expected performance level.
In the aftermath of this incident, we're making a few important changes in our Cloudlink testing, validation, and deployment protocols to better detect and automatically revert issues that would only happen in production.