Builder partially degraded service

Incident Report for balena.io

Postmortem

Summary

On November 12, 2025, we experienced an incident causing increased build failures. This was traced to a significant network issue upstream of our hosting provider, which was outside their direct control. Build success rates returned to normal as of 17:15 UTC, and we formally closed the incident at 20:00 UTC after a period of extended monitoring.

What Happened

A network issue between upstream service providers resulted in severe packet loss. This network degradation disrupted connections to our remote builder workers, causing an increase in builds failing with connection timeout errors (e.g., ETIMEDOUT).

Our Response & Mitigation

  • While the upstream network issue was being addressed, we brought additional builder workers online in unaffected regions to successfully route and process builds.
  • We were in communication with our hosting provider, who was monitoring the external network problem.

Resolution

  • Recovery (17:15 UTC): The upstream network provider resolved the issue, and we observed build failure rates and connection quality return to normal levels.
  • Incident Closed (20:00 UTC): After an extended period of monitoring to ensure all systems remained stable, we formally closed the incident.

We apologize for the disruption this caused.

Posted Nov 12, 2025 - 20:09 UTC

Resolved

This incident has been resolved.
Posted Nov 12, 2025 - 19:58 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Nov 12, 2025 - 17:14 UTC

Identified

We have observed packet loss in the connections to our builder workers. We are working with our upstream provider to fix the issue. We have also provisioned a few replacement workers in an alternative location while we mitigate the problem.
Posted Nov 12, 2025 - 15:01 UTC

Investigating

We are currently investigating this issue.
Posted Nov 12, 2025 - 09:13 UTC
This incident affected: Application Builder.