Elevated API Errors
Incident Report for Balena.io
Postmortem

tl;dr an operations error lead to inadvertently targeting the live cluster instead of test and causing full re-provisioning of all backend assets, including load balancers and DNS records‌

balenaCloud services are deployed into Kubernetes, using the Flux GitOps framework with configuration historically split between test and production in different Git repositories. There is quite a lot of configuration duplication between these repositories, leading to operational drift between environments. One of the improvements we are currently working on is to combine them into a single source of truth using Kustomize bases/overlays approach.

In order to begin with the refactoring of the configuration, we deployed a new flux instance into the test cluster. Unfortunately, the operator ran the CLI, with the assumption that the tooling respected the KUBECTL_CONTEXT environment variable, which at the time was pointing to the test cluster. The specific tooling did not respect the environment variable and instead used the default Kubernetes context set in ~/.kube/config, which was pointing at the live production cluster. Once this new instance of flux, pointing to an empty GitHub configuration repository, was installed into the production cluster, it deleted all of the existing production assets, including replica-sets, deployments and their corresponding services and ingress(s) (i.e AWS ELB/NLB/ALBs).

Recovery effort involved four engineers and took just under an hour to restore the service. It involved manually applying the configuration back to the production cluster using Kubernetes CLI tooling, addressing some configuration inconsistencies, which were blocking creation of some cloud assets, and recreating/repointing DNS records at the new load balancers. Lastly, Flux was re-configured/re-enabled against production GitHub repositories.

To prevent these sorts of event in the future, we’ve made a note of updating the default context to point to the test cluster as well as double and triple checking assumptions made when using less familiar tool chains (e.g.):

$ grep current-context ~/.kube/config
current-context: arn:aws:eks:us-east-1:1234567890:cluster/test
Posted Sep 12, 2023 - 15:26 UTC

Resolved
This incident has been resolved.
Posted Sep 12, 2023 - 01:01 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 12, 2023 - 01:00 UTC
Update
We are continuing to monitor for any further issues.
Posted Sep 12, 2023 - 00:12 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Sep 12, 2023 - 00:12 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Sep 11, 2023 - 23:58 UTC
Investigating
We're experiencing an elevated level of API errors and are currently looking into the issue.
Posted Sep 11, 2023 - 23:34 UTC
This incident affected: API, Application Builder, Application Registry, BalenaOS Download, Dashboard, Delta Image Downloads, Device URLs, Git, Cloudlink (VPN), and balenahub.