Elevated GIT/Application Builder Errors
Incident Report for balena.io
Postmortem

Discovered because builders were crash-looping, indicated by error codes related to long strings from large buffer sizes. Quickly linked to an accumulation of uncleaned images on the builder instances.

A manual volume prune was executed to remove images and temporarily recover builder performance.

A patch is awaiting review to lower the max storage consumption of the builder workers before automatic cleanup is performed, and projects have been proposed to patch the upstream libraries and/or move to ephemeral builder workers.

We also have a project in the queue to add host metrics monitoring to the builder-worker nodes to catch issues like this earlier.

Posted Apr 17, 2024 - 20:28 UTC

Resolved
This incident has been resolved.
Posted Apr 17, 2024 - 20:23 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 17, 2024 - 20:17 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Apr 17, 2024 - 19:05 UTC
Investigating
We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.
Posted Apr 17, 2024 - 18:17 UTC
This incident affected: Application Builder.