Elevated error rate on the API, Builder and Delta server
Incident Report for Balena.io
Postmortem

On the 21st of March, an outage occurred that affected our Delta server.

At approximately 15:48 UTC, Database load increased to 100% and persisted. This, much like the previous day’s incident, was quickly traced to the Delta server making a large number of concurrent requests to the API.

We quickly, and temporarily, disabled the Delta servers to reduce load, and flagged an incident. Unlike the prior day’s incident, we did not see any API request failures once this act had been carried out.

Having completed the work spurred by the previous incident, we made the decision to roll out these changes to the Delta server to enable us to quickly reinstate it. This was deployed at 16:27 UTC, and we immediately began to see an increased load on the Delta server. However, unlike the previous incidents, this did not further increase load on the API, confirming that the changes were successfully mitigating further problems.

We continued to monitor the situation, noting that load levels were normal.

The incident was closed at 17:15 UTC when we were fully satisfied that the situation was rectified.

Unlike the previous day’s incident, only the Delta server was affected and this was quickly resolved within half an hour of its initial shutdown.

We are continuing to carry out work to further limit the potential for the API to be affected by a large number of requests.

Posted Mar 22, 2019 - 09:39 UTC

Resolved
This incident has been resolved.
Posted Mar 21, 2019 - 17:15 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 21, 2019 - 16:36 UTC
Investigating
We are currently investigating this issue.
Posted Mar 21, 2019 - 16:12 UTC
This incident affected: API, Application Builder, and Delta Image Downloads.