Elevated error rate on the API
Incident Report for Balena.io
Postmortem

On the 10th of May at around 07:10 UTC we had an outage that affected our API for a period of approximately ninety minutes. This occurred due to a large fleet update requesting delta images for a multicontainer application.

Due to the fleet size, the Supervisors on the respective devices made a considerable number of requests for delta updates, resulting in a huge number of token generation requests, which became queued up in our API instances and overloaded the database. Unfortunately, this meant that the delta server was unable to create the delta due to failing API requests. This resulted in a 'vicious circle' where devices continued to retry the update, causing more API requests.

We applied a fix which limited the number of token generation requests made to the API at any one time, which significantly lowered the database load. We also restarted the delta and API services to flush the request queue, which immediately caused the delta updates to be generated successfully.

The API fix has now been officially been deployed to Production and we do not expect further issues of this nature. Additionally, the device Supervisor has been improved to ensure that generated tokens are cached for a period of ten minutes and reused, rather than requesting one every few seconds.

Posted May 14, 2019 - 19:18 UTC

Resolved
This incident has been resolved.
Posted May 10, 2019 - 09:45 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 10, 2019 - 08:57 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted May 10, 2019 - 08:40 UTC
Investigating
We are currently investigating this issue.
Posted May 10, 2019 - 08:32 UTC
This incident affected: API and Dashboard.