Elevated Device VPN Tunnel Errors

Incident Report for balena.io

Postmortem

On September 30th, following a production deployment of a core component, a critical authorization failure affected the balena device tunnel command, preventing users from establishing port tunnel connections.

Summary of the Incident

A subtle bug was introduced several weeks ago in a small database query change within a component. This change passed our standard review process and all pre-deployment testing, including unit tests utilizing mocked API endpoints.

Because of other pending changes, this component update was not immediately deployed. When it was finally released to production on September 30th, the query exhibited an unexpected incompatibility with the live production API environment.

The failure was not immediately apparent through our primary monitoring, but once the authorization issue was identified, our team quickly found the flawed query, deployed a patched component, and restored full functionality.

Root Cause and Timeline

  • Change Introduction: A small change to a database query was merged several weeks ago.
  • Failed Validation: The change passed unit tests and code review but failed to correctly interact with the real-world production API due to a subtle environmental or data-specific condition.
  • Deployment & Failure: On September 30th, the component was deployed. The authorization failure for balena device tunnel was subsequently observed.
  • Resolution: The bug was quickly diagnosed, the query was patched, and a fixed component was deployed to production, resolving the incident.

Corrective Actions

We are taking immediate steps to prevent this type of failure from recurring:

  1. Unit Test Realism: We have updated our unit test mocks to more accurately reflect invalid or non-standard production responses, ensuring future query changes are validated against real-world failure modes.
  2. End-to-End Test Scheduling: We are prioritizing the development and scheduling of a new end-to-end test specifically dedicated to validating the full functionality of the device tunnel command in a production-like environment. This will catch integration errors sooner.

We apologize for the interruption this caused to your workflow. We are committed to using this incident to improve the robustness and reliability of our continuous deployment process.

Posted Oct 01, 2025 - 12:52 UTC

Resolved

This incident has been resolved.
Posted Sep 30, 2025 - 21:56 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Sep 30, 2025 - 21:34 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Sep 30, 2025 - 20:35 UTC

Investigating

We're experiencing an elevated level of errors in our Device VPN Tunnel infrastructure and are currently looking into the issue.
Posted Sep 30, 2025 - 19:47 UTC
This incident affected: Cloudlink (VPN).