At 19:20 UTC on the 11th of November, we were alerted to our internal end-to-end tests failing. On closer inspection this appeared to be due to a large ramp-up in the number of connection attempts by devices to the VPN, appearing to be a Denial of Service attempt. However, further investigation showed many connections from customer device fleets, and we began to receive support queries from several customers.
Investigation continued and our testing determined that new connections to the VPN could not be made at all. We discovered this was due to expired VPN certificates, and quickly regenerated and deployed these. Around 21:56 UTC connections for devices became stabilized and we started to monitor in case there were further incidents. There were not.
This incident has exposed an issue with our certificate renewal process. balena’s service certificates are automatically rotated by our cloud infrastructure vendor, however the VPN certificates are separate to this infrastructure and during the move to the new system the alerts we would have normally received prior to the expiration were removed by accident. Going forward, we have put a plan in place to ensure that we are alerted far in advance of certificate expiry. We are also discussing other plans to fully automate the checking and renewal of these certificates.
Additionally, we are discussing the modification of our end-to-end testing to ensure that situations such as these are reported as verbosely as possible.