Elevated GIT/Application Builder Errors

Incident Report for balena.io

Postmortem

Over the past few months, we’ve been focussed heavily on improving the performance of the balenaCloud backend in order to scale with the growing number of devices joining the platform.

As part of that work, we’ve recently implemented cross-instance metrics throttling, to ensure that the cluster of API instances are aware of each other when throttling incoming device metrics. However, we’ve discovered a bug in our original implementation yesterday, which effectively invalidated the throttling gains, increasing the load on the backend database by a factor or 3-4.

We’ve now deployed a fix to correct and optimize handling of cross-instance metrics throttling across all of the API backends and have observed no further instances of unsustainably high backend load.

Posted Nov 10, 2021 - 15:37 UTC

Resolved

This incident has been resolved.

Posted Nov 10, 2021 - 00:26 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 10, 2021 - 00:17 UTC

Update

We've deployed a fix to the backend ensuring better performance under increased load and are now re-enabling device metrics.

Posted Nov 09, 2021 - 23:00 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 09, 2021 - 20:52 UTC

Update

We are continuing to investigate this issue.

Posted Nov 09, 2021 - 20:38 UTC

Update

While we continue to investigate, we've temporarily disabled device metrics.

Posted Nov 09, 2021 - 20:38 UTC

Investigating

We're experiencing an elevated level of errors in our application builder infrastructure and are currently looking into the issue.

Posted Nov 09, 2021 - 20:30 UTC

This incident affected: API and Application Builder.