The core problem stemmed from processing large files in the build context, which led to increased memory usage and, at times, network latency between regions.
Summary of Events
- Investigation and Identification (September 16, 2025): We initially identified the issue as excessive memory consumption on our builder services.
- Temporary Mitigation (September 23, 2025): We detected that network latency, particularly with larger build contexts, was contributing to the problem. We deployed temporary servers to mitigate this regional impact.
- Resolution (September 25, 2025): The issue was resolved by implementing higher memory allocations on the builder services to handle the larger build context files and buffer memory issues.
Permanent Fixes and Structural Changes
To ensure long-term stability and prevent recurrence, our team implemented a series of comprehensive structural improvements focused on reducing memory usage, improving stream handling, and modernizing our codebase:
- Improved Build Stream Handling: The most significant fix involved updating the system to buffer individual task build streams via the filesystem rather than memory. This leverages disk space, which scales much better and more cheaply than memory, to massively reduce the RSS memory usage.
- Streamlining and Modernization: We undertook a broad effort to update and optimize how data streams are processed, including switching to native Node.js stream implementations (
stream.promises.pipeline, Transform, Passthrough) to replace older, archived dependencies, which reduces the likelihood of memory leaks and edge cases.
- Other Enhancements: We introduced various performance tweaks, such as avoiding reprocessing contract headers, improving error forwarding for stream failures (fixing a memory leak), and using model info to improve typings across the system.
We realize the recent incidents can be very disruptive and understand your frustration. Each of these incidents as well as their cause and steps to resolution have been documented on our status page. You can view the recent ones here on the history page with links to the details about each event: https://status.balena.io/history.
As you may know, the recent frequency of incidents is historically unusual for us. Our DevOps team worked overtime to resolve the issues as they arose. Our general process is not only to resolve, but also involve the broader team to make careful structural changes as preventative measures. For instance, we are reducing our reliance on hosting partners that may suffer from regional network network issues.
Thank you for your patience.