Elevated error rate on the API, Builder and Delta server
Incident Report for Balena.io
Postmortem

On the 20th of March, an outage occurred that affected our API, which in turn had an effect on some of the services we provide.

Around 14:14 UTC, the Database load increased to 100%. Whilst this can occasionally happen, in this particular case the load stayed constant; by 14:34 UTC it became clear that the load was abnormal. This started to have knock-on effects in the rest of the system, as the API was unable to process requests in a timely manner. This meant:

* API response times started to rise, and in some cases began to time out

* Delta downloads started to queue up

* Other parts of the service requiring authorisation from the API, such as the builder, were not able to receive a response in a timely manner and themselves started timing out

We determined that load on the Delta servers had spiked, causing a large number of requests to be made to the API. After discovering the cause, at 14:46 UTC we made the decision to temporarily take the Delta servers off to ensure that API load was reduced and allow it to recover. However, due to the number of backed up transactions, including from devices attempting to determine the current state, the load remained high. At this point, we created a new incident to alert customers to the issues we were seeing.

We then started to remove pending transaction queries by cycling through the API instances and restarting them, which we completed around 15:14 UTC. The number of queries started to reduce and by 15:35 UTC the system had stabilised. We made appropriate incident report updates. At around 15:55 UTC, DB load had reduced significantly, and we re-enabled the Delta servers. We continued to monitor the situation, but did not see any repeat of the situation.

Much like a previous incident in January, the catalyst for this outage has been traced to a large sized device fleet making a delta request for a new image release. Whilst we made a number of major changes after that incident in a response to further de-couple load on the Delta server affecting the API, including limits on the number of requests that were made by the Delta server, it appears that we misjudged the limit that we could impose.

In response to this incident, we have now carried out work which further limits the number of concurrent requests that are made to the API by the Delta server, while also enforcing a delay mechanism that also further reduces the number of requests made.

Posted Mar 22, 2019 - 09:38 UTC

Resolved
This incident has been resolved.
Posted Mar 20, 2019 - 16:24 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 20, 2019 - 16:05 UTC
Investigating
We are currently investigating this issue.
Posted Mar 20, 2019 - 14:50 UTC
This incident affected: API, Application Builder, and Delta Image Downloads.