On Monday, September 23rd, our system experienced data corruption as a result of a faulty database change. This caused a large number of previously pinned devices to get unpinned and point to the release pointed to by their application. For most applications this meant the latest release. Of those devices 20% “successfully” updated and ran different code than was intended. The remaining 80% were unaffected but still may have caused increased bandwidth usage.
Our team restored the database from backup and to our knowledge no data loss has occurred. Shortly after we recovered the database, the affected devices re-synced with the appropriate releases. A very small number of devices experienced local data loss due to a Docker volume mismatch between the pinned and latest release. We will be contacting affected customers directly about this issue.
We understand that this incident is very painful for our customers. Our testing procedures were unable to catch this issue in our staging environment. We are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future.
We regularly deploy modifications to our database schema, ensuring that all changes are backwards compatible. Traditionally, we have been using a strategy of transitioning all the data in a single transaction. While this simple approach has served us well, as our system grows locking the tables for long periods of time results in unacceptable downtime.
We used a different strategy to deploy the data migration this time, where the schema passed through an intermediate state during which we migrated the data in small batches. Unfortunately, the change to the data was incorrect, pinning every device to its current release. This by itself isn’t an issue because freezing devices to their currently running release does not result in any service disruption nor does it trigger any updates. After noticing the error, an operator started a custom rollback to undo all the changes. Due to the urgency of the situation the rollback operation did not go through the regular process of review, leading to the unpinning of every device on the platform. To make matters worse, the fact that devices were incorrectly unpinned went unnoticed until we were made aware of high load on our system.
Our database backup system can restore the database to a sub-minute granularity, which allowed our operators to narrow down the database snapshots to immediately before and after the faulty changes. Using these two datasets along with the current state of the system we were able to perform a “3-way merge” and undo the faulty records while preserving any legitimate device pins that happened during the incident window.
It is clear that our current testing process and staging deployments are not able to catch errors like this. In light of this incident we will invest in better tooling for running migrations, including running all migrations on a snapshot of the production database and checking that a set of invariants still hold for both the forward and backward step. This incident was made particularly worrisome due to the fact that even though the forward migration was faulty it didn’t manifest any issues - a sanity check of the database afterwards would have passed. It was the rollback of the migration that was destructive.
Additionally, we are making changes in our communication procedures to make handling of incidents like this more timely.
times in UTC
2019-09-23
2019-09-24