Partial API outage
Incident Report for Balena.io
Postmortem

Overview

During a large fleet update requesting delta images for a multicontainer application our service was disrupted due to high database load. Our system currently lazily computes deltas as they are being requested from devices. However, it’s often the case that devices need to make multiple requests to get them while they are being processed. During this time devices change their reported status which can cause a high flow of db write traffic.

In this case, the high write traffic prevented the deltas from being marked as successful and therefore letting the devices start the download. This led to further status update requests coming from the devices, accentuating the problem.

Resolution

The system eventually recovered after the deltas successfully completed and devices started downloading.

Mitigations

In light of this issue, we are immediately implementing two short term optimizations that will alleviate the database load during downloads.

1. Optimize the amount of data we write to the database on every update. Specifically, only write columns that have changed instead of doing redundant updates. This will reduce the write load.

2. Optimize the permission resolution when devices request to download an image to get the required data from the database in a single shot. This will reduce overall db load.

On top of these we’re also making deeper changes in the system to make it eagerly compute results that we know we’re going to need. Specifically, for applications that have deltas enabled we will move the delta generation during the build process. This way, by the time devices are notified for the update the delta will be ready to go.

Timeline

17:24 UTC Fleet update started

17:28 UTC Deltas start being processed

17:30 UTC Devices started requesting the delta from our API, causing the DB load to go up. At this point there was no notable service degradation except for slightly increased latency.

17:49 UTC The write load spiked even higher leading to db queries getting stalled or timing out

18:01 UTC First deltas successfully computed

18:27 UTC Final deltas successfully completed

18:35 UTC The request queues were clearer and system was operational

Posted Jun 12, 2019 - 18:20 UTC

Resolved
This incident has been resolved.
Posted Jun 11, 2019 - 18:56 UTC
Investigating
We are currently investigating this issue.
Posted Jun 11, 2019 - 18:15 UTC
This incident affected: API and Cloudlink (VPN).