Over the past few months, we’ve been focussed heavily on improving the performance of the balenaCloud backend in order to scale with the growing number of devices joining the platform.
As part of that work, we’ve recently implemented cross-instance metrics throttling, to ensure that the cluster of API instances are aware of each other when throttling incoming device metrics. However, we’ve discovered a bug in our original implementation yesterday, which effectively invalidated the throttling gains, increasing the load on the backend database by a factor or 3-4.
We’ve now deployed a fix to correct and optimize handling of cross-instance metrics throttling across all of the API backends and have observed no further instances of unsustainably high backend load.