Issue with device pinning

Incident Report for balena.io

Postmortem

Overview

On Monday, September 23rd, our system experienced data corruption as a result of a faulty database change. This caused a large number of previously pinned devices to get unpinned and point to the release pointed to by their application. For most applications this meant the latest release. Of those devices 20% “successfully” updated and ran different code than was intended. The remaining 80% were unaffected but still may have caused increased bandwidth usage.

Our team restored the database from backup and to our knowledge no data loss has occurred. Shortly after we recovered the database, the affected devices re-synced with the appropriate releases. A very small number of devices experienced local data loss due to a Docker volume mismatch between the pinned and latest release. We will be contacting affected customers directly about this issue.

We understand that this incident is very painful for our customers. Our testing procedures were unable to catch this issue in our staging environment. We are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.

Detailed analysis

We regularly deploy modifications to our database schema, ensuring that all changes are backwards compatible. Traditionally, we have been using a strategy of transitioning all the data in a single transaction. While this simple approach has served us well, as our system grows locking the tables for long periods of time results in unacceptable downtime.

We used a different strategy to deploy the data migration this time, where the schema passed through an intermediate state during which we migrated the data in small batches. Unfortunately, the change to the data was incorrect, pinning every device to its current release. This by itself isn’t an issue because freezing devices to their currently running release does not result in any service disruption nor does it trigger any updates. After noticing the error, an operator started a custom rollback to undo all the changes. Due to the urgency of the situation the rollback operation did not go through the regular process of review, leading to the unpinning of every device on the platform. To make matters worse, the fact that devices were incorrectly unpinned went unnoticed until we were made aware of high load on our system.

Our database backup system can restore the database to a sub-minute granularity, which allowed our operators to narrow down the database snapshots to immediately before and after the faulty changes. Using these two datasets along with the current state of the system we were able to perform a “3-way merge” and undo the faulty records while preserving any legitimate device pins that happened during the incident window.

Mitigation

It is clear that our current testing process and staging deployments are not able to catch errors like this. In light of this incident we will invest in better tooling for running migrations, including running all migrations on a snapshot of the production database and checking that a set of invariants still hold for both the forward and backward step. This incident was made particularly worrisome due to the fact that even though the forward migration was faulty it didn’t manifest any issues - a sanity check of the database afterwards would have passed. It was the rollback of the migration that was destructive.

Additionally, we are making changes in our communication procedures to make handling of incidents like this more timely.

Timeline

times in UTC

2019-09-23

22:12 Started the data migration
22:14 All devices were pinned to their currently running release
22:16 Trying to undo the previous pinning, an overly aggressive rollback unpinned all devices
22:20 High load alerts start coming in, initially pointing to delta generation issues

2019-09-24

00:42 Realised high load was due to incorrectly unpinned devices
01:00 Restorated multiple snapshots of the database at 4 minute intervals around the incident
01:28 Started working on data recovery from snapshots
03:00 Deployed data recovery plan and all devices were restored to their correct state

Posted Sep 24, 2019 - 23:24 UTC

Resolved

This incident has been resolved.

Posted Sep 24, 2019 - 06:43 UTC

Update

We are continuing to monitor for any further issues.

Posted Sep 24, 2019 - 03:22 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 24, 2019 - 03:21 UTC

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 24, 2019 - 01:22 UTC

Investigating

We are currently investigating this issue.

Posted Sep 24, 2019 - 01:21 UTC

This incident affected: Delta Image Downloads.