Builder partially degraded service

Incident Report for balena.io

Postmortem

The core problem stemmed from processing large files in the build context, which led to increased memory usage and, at times, network latency between regions.

Summary of Events

  • Investigation and Identification (September 16, 2025): We initially identified the issue as excessive memory consumption on our builder services.
  • Temporary Mitigation (September 23, 2025): We detected that network latency, particularly with larger build contexts, was contributing to the problem. We deployed temporary servers to mitigate this regional impact.
  • Resolution (September 25, 2025): The issue was resolved by implementing higher memory allocations on the builder services to handle the larger build context files and buffer memory issues.

Permanent Fixes and Structural Changes

To ensure long-term stability and prevent recurrence, our team implemented a series of comprehensive structural improvements focused on reducing memory usage, improving stream handling, and modernizing our codebase:

  • Improved Build Stream Handling: The most significant fix involved updating the system to buffer individual task build streams via the filesystem rather than memory. This leverages disk space, which scales much better and more cheaply than memory, to massively reduce the RSS memory usage.
  • Streamlining and Modernization: We undertook a broad effort to update and optimize how data streams are processed, including switching to native Node.js stream implementations (stream.promises.pipeline, Transform, Passthrough) to replace older, archived dependencies, which reduces the likelihood of memory leaks and edge cases.
  • Other Enhancements: We introduced various performance tweaks, such as avoiding reprocessing contract headers, improving error forwarding for stream failures (fixing a memory leak), and using model info to improve typings across the system.

We realize the recent incidents can be very disruptive and understand your frustration. Each of these incidents as well as their cause and steps to resolution have been documented on our status page. You can view the recent ones here on the history page with links to the details about each event: https://status.balena.io/history.

As you may know, the recent frequency of incidents is historically unusual for us. Our DevOps team worked overtime to resolve the issues as they arose. Our general process is not only to resolve, but also involve the broader team to make careful structural changes as preventative measures. For instance, we are reducing our reliance on hosting partners that may suffer from regional network network issues.

Thank you for your patience.

Posted Oct 07, 2025 - 16:39 UTC

Resolved

Large files in build context caused increases of memory usage on builder-services. Higher memory allocations have been implemented resolving the buffer memory issues.
Posted Sep 25, 2025 - 08:57 UTC

Update

We have detected that network latency between different regions is causing issues on build context processing for build contexts containing larger files. We will deploy new servers temporarily to mitigate the impact while we work on a permanent fix.
Posted Sep 23, 2025 - 10:27 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 16, 2025 - 17:26 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Sep 16, 2025 - 17:01 UTC

Investigating

We are currently investigating this issue.
Posted Sep 16, 2025 - 02:33 UTC
This incident affected: Application Builder.