We have discovered that device status syncing between our VPN server and our main database can get desynchronized over time. The root cause turned out to be a fault in our application logic marking devices' online status when they disconnect/reconnect, where the error was accidentally introduced in a change improving our VPN services' resource utilization (and thus speed).
We are running multiple VPN servers, and when a device disconnects and reconnects, it may end up connecting to any of them. In the case that a device's network connection get interrupted that the VPN connection is not explicitly terminated, the server keeps the connection alive for a while before considering it closed (timing out due to the lack of keep-alive communication). If in the meantime (before that timeout), the device reconnected to another one of the VPN instances, it kept its online status in our database, but then the original connection's timeout would have marked it offline. This situation is then only resolved for a device, when it again reconnected (due to network changes, reboot, or service restart, for example).
The triaging of this issue took us longer than expected. As a side-effect we now have even more effective tooling in place for troubleshooting any future issues in our VPN.
We've implemented correct accounting for the device connection events in scenarios such as above and it has been deployed for more than a week. We have not seen the desynchronization issue to reoccur since.