Elevated API Errors

Incident Report for balena.io

Postmortem

What Happened

On January 28, our API experienced degraded performance due to a cascading pod failure triggered by infrastructure inconsistency.

Root Cause

We unexpectedly had one node running on an older generation CPU, which caused uneven load distribution. When that node's pod failed a health check under normal traffic spikes, it triggered a cascading failure across other pods (a "thundering herd" effect).

Resolution

We resolved the incident by:

  • Scaling up the number of API pods to restore capacity during recovery
  • Removing the infrastructure inconsistency that caused the initial failure
  • Adjusting health check configurations to prevent similar cascading failures

We apologize for any disruption this caused and appreciate your patience as we continue improving our platform's resilience.

Posted Jan 29, 2026 - 21:49 UTC

Resolved

This incident has been resolved.
Posted Jan 28, 2026 - 22:06 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 28, 2026 - 20:08 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Jan 28, 2026 - 19:45 UTC

Investigating

We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Jan 28, 2026 - 18:54 UTC
This incident affected: API and Application Builder.