DialogTech Service Disruption
Incident Report for DialogTech
Postmortem

What happened?

Summary

On September 1, 2022, at 9:32 AM PT, a large number of simultaneous voice broadcasts (automatically delivered interactive voice messages) triggered two defects in the DialogTech platform: 

  1. A defect in a third-party software system used for balancing our database traffic
  2. A defect in our software that controls the rate at which DialogTech sends voice broadcasts.

These two independent, but related, defects effectively rendered our database system unusable, which prevented the DialogTech platform from operating normally.

As a result of these defects, intermittently for 4 hours and 6 minutes over the course of September 1st and 2nd, calls did not connect, and the administrator portal was inaccessible. During the course of the incident, to help mitigate the impact on customers, dynamic number insertion (DNI) was disabled for a little more than an hour. In addition, scheduled voice broadcasts were delayed for at most 12.5 hours, after which all were ultimately delivered. 

Technical Sequence of Events

September 1st (9:00 AM - 3:17 PM PT) 

At 9:00 AM PT, a large number of voice broadcasts were scheduled for delivery.

Between 9:00 AM and 10:26 AM PT, the database’s performance slowly degraded.

At 10:26 AM PT, the database proxy – a service that balances the load between DialogTech databases – determined the primary database was not responsive and shifted call traffic to the secondary database.

Between 11:13 AM to 12:30 PM PT, because database errors were still occurring, the voice broadcast delivery was intentionally stopped, and the primary database was completely removed from service.

At 12:33 PM PT, the database system performance was sufficiently degraded to the point that no calls were able to be processed.

Between 12:33 PM PT to 3:17 PM PT, the teams worked on restoring call processing by getting both the primary and secondary databases back into production.

At 3:17 PM PT, calls were being successfully processed. The investigation of the root cause of the incident began and around-the-clock monitoring of the system was in place.

At 5:06 PM PT, voice broadcasts were re-enabled.

September 2nd (9:30AM - 12:58 PM PT) 

At 9:30 AM PT, another large batch of voice broadcasts was scheduled for delivery and was closely monitored by a number of technical staff.

At 10:01 AM PT, the database performance was again degraded.

At 10:17 AM PT, in order to reduce the impact on customers, the dynamic number insertion system was disabled. 

Between 10:14 AM PT and 12:58 PM PT, the system was unstable and calls were not processed for approximately 42 minutes while the team actively investigated the issue.

At 12:58 PM PT, the system was restored to full operation and the root cause investigation continued, with the focus on two systems – the database proxy and the voice broadcast process. 

Why did it happen?

There were three factors that contributed to the incident:

First, in June 2022, when migrating the DialogTech platform from an on-premise data center to the cloud, a configuration error was made in the configuration file for the third-party database proxy.

Second, the configuration error, combined with the heavier-than-usual load from the large number of simultaneous voice broadcasts, triggered a defect in the third-party database proxy that caused it to intermittently fail to route database traffic.

Third, the database connectivity issues caused by the proxy triggered a defect in the voice broadcast system that caused it to place calls at a much faster rate than intended. The increased rate of calls put a strain on the database, causing its performance to degrade, which in turn caused the voice broadcast system to place calls at an even faster rate, thereby creating a feedback loop that ultimately overwhelmed the database.

What are we doing to ensure it does not happen again?

We have already:

  • Fixed the configuration error in the configuration file of the database proxy 
  • Validated that the configuration fix will prevent the defect that caused it to incorrectly load balance database traffic from being triggered.
  • Fixed the defect in the voice broadcast system to prevent it from placing calls at faster rates when the database performance is degraded.

In the coming weeks we will:

  • Update the database proxy to the latest version that has a fix for the defect that caused the proxy to intermittently fail to properly route the database traffic.
  • Improve and increase our monitoring and alerting of our database infrastructure to aid in faster detection and mitigation should issues arise.
  • Update our incident management protocols to engage mitigation measures (disabling DNI) earlier during an incident.

We apologize for the impact on our customers. We know that customer calls are a critical part of your business. You can be assured that we will do everything in our power to prevent incidents like this in the future.

Posted Sep 12, 2022 - 08:34 CDT

Resolved
This incident is now resolved. A Root Cause Analysis (RCA) will be added to the post-mortem section later this week. We know that calls are mission-critical to your business, and appreciate your partnership as we worked to resolve this issue.
Posted Sep 06, 2022 - 15:11 CDT
Update
Inbound call volume continues to operate normally, as do outbound phone calls via the DialogTech voice broadcast application. The team continues to monitor system performance closely.
Posted Sep 04, 2022 - 17:07 CDT
Update
Inbound call volume continues to operate normally, as do outbound phone calls via the DialogTech voice broadcast application. We have introduced additional instrumentation to assist with ongoing monitoring and investigation.
Posted Sep 03, 2022 - 12:19 CDT
Monitoring
Inbound call volume has been operating normally as of 4:15 PM EDT. Outbound phone calls via the DialogTech Voice Broadcast application are currently paused. We will continue to monitor the stability of the platform.
Posted Sep 02, 2022 - 18:02 CDT
Update
We are continuing to work on a fix for this issue.
Posted Sep 02, 2022 - 17:13 CDT
Identified
We have re-enabled dynamic number insertion. Calls should be flowing through the platform again. We're still troubleshooting the issue and are monitoring the platform's stability.
Posted Sep 02, 2022 - 15:23 CDT
Update
We have disabled dynamic number insertion on customer websites so that hard-coded numbers are shown to web visitors.
Posted Sep 02, 2022 - 14:46 CDT
Investigating
We are continuing to experience incremental issues with call processing.
Posted Sep 02, 2022 - 14:37 CDT
Identified
We are experiencing intermittent issues which can impact call processing, reporting, and API requests.
Posted Sep 02, 2022 - 14:25 CDT
Monitoring
Dynamic number insertion is currently operational. The issue has been mitigated and we're currently monitoring the stability of the platform.
Posted Sep 02, 2022 - 12:54 CDT
Update
We're beginning to reprocess calls through the system. User Portal is now available.
Posted Sep 02, 2022 - 12:41 CDT
Identified
The issue has been identified. Access to the User Portal is impacted and there are intermittent call failures. The rotation of dynamic numbers on customer websites has been disabled.
Posted Sep 02, 2022 - 12:35 CDT
Investigating
We're investigating a system outage on our User Portal and call processing.
Posted Sep 02, 2022 - 12:14 CDT
Update
Call processing has been upgraded from Degraded Performance to Operational. User Portal is also fully operational. We will continue to monitor the platform to ensure stability continues.
Posted Sep 01, 2022 - 18:05 CDT
Monitoring
Call processing has been reinstated and the DialogTech UI is operational. We are continuing to monitor the system's stability.
Posted Sep 01, 2022 - 17:46 CDT
Identified
The issue has been identified and we are in the process of mitigating the root cause. Thank you for your patience.
Posted Sep 01, 2022 - 17:20 CDT
Update
We are still investigating issues with user portal access and call processing. We will continue to update this incident as we work towards a resolution.
Posted Sep 01, 2022 - 16:19 CDT
Investigating
We are investigating a system outage involving the User Portal being unavailable as well as other application issues including call processing.
Posted Sep 01, 2022 - 15:33 CDT
This incident affected: DialogTech Platform - Call Processing and DialogTech Platform - User Portal.