Elevated API Errors

Incident Report for balena.io

Postmortem

On Sept. 10th, around 13:30pm UTC, our alerting system reported intermittent elevated API errors. We quickly determined the cause of the incident to be an overly aggressive liveness probe rotating our API pods. A fix was deployed immediately.

While the API was recovering, an automatic update of the Cloudlink pods occurred. As the API was slower to respond, device reconnections and SSH authentication were slower than usual, but steadily improving.

This apparent recovery was masking a different issue in the Cloudlink update itself—an issue that was severely impacting container performance under high concurrency. This problem was only apparent at the scale of the production environment and was completely invisible at the lower scale of our development and testing environments.

Once the concurrency issue had been properly identified, we quickly reverted the update and Cloudlink returned to its expected performance level.

In the aftermath of this incident, we're making a few important changes in our Cloudlink testing, validation, and deployment protocols to better detect and automatically revert issues that would only happen in production.

Posted Sep 19, 2025 - 09:38 UTC

Resolved

This incident has been resolved.
Posted Sep 10, 2025 - 22:08 UTC

Update

All tests are passing and devices are stable. We'll keep monitoring.
Posted Sep 10, 2025 - 21:26 UTC

Update

We are continuing to monitor for any further issues.
Posted Sep 10, 2025 - 21:25 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 10, 2025 - 21:19 UTC

Identified

Most devices have reconnected, but we're still experiencing an elevated level of device SSH errors and are currently looking into the issue. Connection that goes through are slow. This is also affecting the public URL feature.
Posted Sep 10, 2025 - 19:29 UTC

Update

Most devices have reconnected, but we're still experiencing an elevated level of device SSH errors and are currently looking into the issue. Connection that goes through are slow. This is also affecting the public URL feature.
Posted Sep 10, 2025 - 18:55 UTC

Update

It appears that the cloudlink status of many devices didn't update when disconnecting.
If your device appear connected to cloudlink for more than 1h, its status is wrong and it has not yet reconnected (it will happen soon)
We're still monitoring the recovery.
Posted Sep 10, 2025 - 15:46 UTC

Update

During the incident a lot of devices disconnected from cloudlink.
They're slowly but steadily reconnecting.
We are monitoring the recovery.
Posted Sep 10, 2025 - 15:37 UTC

Update

We are continuing to monitor for any further issues.
Posted Sep 10, 2025 - 14:34 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 10, 2025 - 14:25 UTC

Investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Sep 10, 2025 - 13:47 UTC
This incident affected: API, BalenaOS Download, Dashboard, Device URLs, Cloudlink (VPN), and balenahub.