Small percentage of devices incorrectly marked as offline

Incident Report for balena.io

Postmortem

We have discovered that device status syncing between our VPN server and our main database can get desynchronized over time. The root cause turned out to be a fault in our application logic marking devices' online status when they disconnect/reconnect, where the error was accidentally introduced in a change improving our VPN services' resource utilization (and thus speed).

We are running multiple VPN servers, and when a device disconnects and reconnects, it may end up connecting to any of them. In the case that a device's network connection get interrupted that the VPN connection is not explicitly terminated, the server keeps the connection alive for a while before considering it closed (timing out due to the lack of keep-alive communication). If in the meantime (before that timeout), the device reconnected to another one of the VPN instances, it kept its online status in our database, but then the original connection's timeout would have marked it offline. This situation is then only resolved for a device, when it again reconnected (due to network changes, reboot, or service restart, for example).

The triaging of this issue took us longer than expected. As a side-effect we now have even more effective tooling in place for troubleshooting any future issues in our VPN.

We've implemented correct accounting for the device connection events in scenarios such as above and it has been deployed for more than a week. We have not seen the desynchronization issue to reoccur since.

Posted Oct 16, 2019 - 14:05 UTC

Resolved

This incident has been resolved.

Posted Oct 07, 2019 - 18:39 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Oct 07, 2019 - 15:05 UTC

Update

Difference is currently ~1%

Posted Oct 05, 2019 - 16:56 UTC

Investigating

We have discovered that device status syncing between our VPN server and our main database can get desynchronized over time. This started happening as a result of performance optimisations we did lately. The effect of this is that a device is shown offline in the dashboard when it's actually connected.

We are continuously monitoring the difference and are rotating instances to keep the difference small while we investigate the underlying issue.

Posted Oct 05, 2019 - 16:38 UTC

This incident affected: Device URLs and Cloudlink (VPN).