On September 1, 2022, at 9:32 AM PT, a large number of simultaneous voice broadcasts (automatically delivered interactive voice messages) triggered two defects in the DialogTech platform:
These two independent, but related, defects effectively rendered our database system unusable, which prevented the DialogTech platform from operating normally.
As a result of these defects, intermittently for 4 hours and 6 minutes over the course of September 1st and 2nd, calls did not connect, and the administrator portal was inaccessible. During the course of the incident, to help mitigate the impact on customers, dynamic number insertion (DNI) was disabled for a little more than an hour. In addition, scheduled voice broadcasts were delayed for at most 12.5 hours, after which all were ultimately delivered.
September 1st (9:00 AM - 3:17 PM PT)
At 9:00 AM PT, a large number of voice broadcasts were scheduled for delivery.
Between 9:00 AM and 10:26 AM PT, the database’s performance slowly degraded.
At 10:26 AM PT, the database proxy – a service that balances the load between DialogTech databases – determined the primary database was not responsive and shifted call traffic to the secondary database.
Between 11:13 AM to 12:30 PM PT, because database errors were still occurring, the voice broadcast delivery was intentionally stopped, and the primary database was completely removed from service.
At 12:33 PM PT, the database system performance was sufficiently degraded to the point that no calls were able to be processed.
Between 12:33 PM PT to 3:17 PM PT, the teams worked on restoring call processing by getting both the primary and secondary databases back into production.
At 3:17 PM PT, calls were being successfully processed. The investigation of the root cause of the incident began and around-the-clock monitoring of the system was in place.
At 5:06 PM PT, voice broadcasts were re-enabled.
September 2nd (9:30AM - 12:58 PM PT)
At 9:30 AM PT, another large batch of voice broadcasts was scheduled for delivery and was closely monitored by a number of technical staff.
At 10:01 AM PT, the database performance was again degraded.
At 10:17 AM PT, in order to reduce the impact on customers, the dynamic number insertion system was disabled.
Between 10:14 AM PT and 12:58 PM PT, the system was unstable and calls were not processed for approximately 42 minutes while the team actively investigated the issue.
At 12:58 PM PT, the system was restored to full operation and the root cause investigation continued, with the focus on two systems – the database proxy and the voice broadcast process.
There were three factors that contributed to the incident:
First, in June 2022, when migrating the DialogTech platform from an on-premise data center to the cloud, a configuration error was made in the configuration file for the third-party database proxy.
Second, the configuration error, combined with the heavier-than-usual load from the large number of simultaneous voice broadcasts, triggered a defect in the third-party database proxy that caused it to intermittently fail to route database traffic.
Third, the database connectivity issues caused by the proxy triggered a defect in the voice broadcast system that caused it to place calls at a much faster rate than intended. The increased rate of calls put a strain on the database, causing its performance to degrade, which in turn caused the voice broadcast system to place calls at an even faster rate, thereby creating a feedback loop that ultimately overwhelmed the database.
We have already:
In the coming weeks we will:
We apologize for the impact on our customers. We know that customer calls are a critical part of your business. You can be assured that we will do everything in our power to prevent incidents like this in the future.