Below you will find the Root Cause Analysis for the incident that occurred on Thursday, October 3rd, 2019.
Original Reported Subject: TeleSign SMS Delivery Delay
Date: Thursday, October 3rd, 2019
Start Time: 13:01 UTC
End Time: 13:04 UTC
On Thursday, October 3rd customers experienced API response latency or HTTP errors for all products. The incident stared at 13:01:45 UTC and lasted until 13:04:05 UTC. During this time, impacted customers received the following possible HTTP errors: 404, 500, 503, & 504.
Root Cause Analysis:
A previous configuration of TeleSign's Content Delivery Network's load balancing would trigger an outage during the convergence of a new configuration when a common data center did not exist between the old and new configurations. When traffic was moved from an existing configuration where at least one of the previous data centers did not exist in the new version being activated, the endpoint would become unavailable for a period of time (~1-3 minutes).
To minimize the risk of, and/or prevent this issue from recurring in the future, TeleSign’s Tech OPS team has taken the following actions:
· A fix was deployed to all endpoints that ensures all data centers are available in all configurations and changes are only made to the routing algorithms for each data center. This allows a common data center to be present in all configurations, all the time and prevent the unavailability during configuration convergence.
· TeleSign followed up with our server host to update their documentation and best practices.
We apologize for the inconvenience this may have caused you. Should you have any questions, please don’t hesitate to contact us at firstname.lastname@example.org.