On January 28, our API experienced degraded performance due to a cascading pod failure triggered by infrastructure inconsistency.
We unexpectedly had one node running on an older generation CPU, which caused uneven load distribution. When that node's pod failed a health check under normal traffic spikes, it triggered a cascading failure across other pods (a "thundering herd" effect).
We resolved the incident by:
We apologize for any disruption this caused and appreciate your patience as we continue improving our platform's resilience.