Elevated Device URLs/VPN Errors
Incident Report for Balena.io
Postmortem

At 19:20 UTC on the 11th of November, we were alerted to our internal end-to-end tests failing. On closer inspection this appeared to be due to a large ramp-up in the number of connection attempts by devices to the VPN, appearing to be a Denial of Service attempt. However, further investigation showed many connections from customer device fleets, and we began to receive support queries from several customers.

Investigation continued and our testing determined that new connections to the VPN could not be made at all. We discovered this was due to expired VPN certificates, and quickly regenerated and deployed these. Around 21:56 UTC connections for devices became stabilized and we started to monitor in case there were further incidents. There were not.

This incident has exposed an issue with our certificate renewal process. balena’s service certificates are automatically rotated by our cloud infrastructure vendor, however the VPN certificates are separate to this infrastructure and during the move to the new system the alerts we would have normally received prior to the expiration were removed by accident. Going forward, we have put a plan in place to ensure that we are alerted far in advance of certificate expiry. We are also discussing other plans to fully automate the checking and renewal of these certificates.

Additionally, we are discussing the modification of our end-to-end testing to ensure that situations such as these are reported as verbosely as possible.

Posted Nov 14, 2019 - 21:24 UTC

Resolved
This incident has been resolved.
Posted Nov 12, 2019 - 00:10 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 11, 2019 - 21:49 UTC
Update
We are continuing to work on a fix for this issue.
Posted Nov 11, 2019 - 21:22 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Nov 11, 2019 - 21:22 UTC
Investigating
We're experiencing an elevated level of errors in our Device URLs and VPN infrastructure and are currently looking into the issue.
Posted Nov 11, 2019 - 20:02 UTC
This incident affected: Cloudlink (VPN).