On the 10th of May at around 07:10 UTC we had an outage that affected our API for a period of approximately ninety minutes. This occurred due to a large fleet update requesting delta images for a multicontainer application.
Due to the fleet size, the Supervisors on the respective devices made a considerable number of requests for delta updates, resulting in a huge number of token generation requests, which became queued up in our API instances and overloaded the database. Unfortunately, this meant that the delta server was unable to create the delta due to failing API requests. This resulted in a 'vicious circle' where devices continued to retry the update, causing more API requests.
We applied a fix which limited the number of token generation requests made to the API at any one time, which significantly lowered the database load. We also restarted the delta and API services to flush the request queue, which immediately caused the delta updates to be generated successfully.
The API fix has now been officially been deployed to Production and we do not expect further issues of this nature. Additionally, the device Supervisor has been improved to ensure that generated tokens are cached for a period of ten minutes and reused, rather than requesting one every few seconds.