We are currently experiencing an outage affecting Conversations Processing. Our Engineering Team is aware and working to resolve.

Incident Report for DANAConnect

Postmortem

Descargar informe de reporte en español: https://drive.google.com/file/d/1IiEQArry8AjBU4M-g6TPHQ95j-iK7EXo/view?usp=sharing

‌

Incident Report

Incident ID:CYMSLP-126-241105

Date:11/05/2024

INCIDENT DESCRIPTION

Incident reported by: DevOps

Date and time of the incident: 11/05/2024 - 12:30 PM (GMT-4)

Time elapsed from when the DevOps team noticed the incident until its resolution: 12:30 PM - 4:00 PM (GMT-4) 3 hours 30 minutes

Details about the services affected by the incident:

High degradation in communication delivery speed.There was prolonged queuing in the delivery of communications across various channels, including email, SMS, and push notifications.
Both mass campaigns and transactional notifications were affected.
A group of platform users reported that communication processing was halted.

Severity of the incident (Critical, High, Medium, Low):Critical. 3 hours 30 minutes since the initial incident report.

Frequency of this type of incident:Very low: First occurrence in 15 years.

‌

CAUSE OF THE INCIDENT

Details about the vulnerability that caused the incident: A very high number of write accesses to the tables managing the error logs of the communication orchestrator were detected.

This occurs when a message cannot be processed for some reason. The most common reasons include: the recipient's address is invalid and the processing route is undefined.

Although logging errors is a normal process within the platform, the massive access caused a bottleneck in writing to the table, halting message processing.

Example of contained query:

Degradation of insertion times in the error log:

It was determined that the failure originated from the simultaneous activation of 8 conversations, totaling approximately 8 million messages with destination addresses in an invalid or empty format.

‌

CORRECTIVE ACTIONS

Volume of transactions to process within the orchestrator verified by DevOps 12:30 PM (GMT-4)
Transaction slowness and wait times in the Database verified by DevOps 12:35 PM (GMT-4)
DANAconnect team was notified via Slack on the Incident channel 12:45 PM (GMT-4)
The injection of new activations that could be affected was paused 12:50 PM (GMT-4)
All clients were notified about the platform incident and posted on https://status.danaconnect.com 1:06 PM (GMT-4)
A Security Snapshot of the Database Cluster was generated 1:10 PM (GMT-4)
It was determined that the origin of the platform degradation was the write access to the error log 1:56 PM (GMT-4)
The conversations causing the massive generation of error logs were located 2:51 PM (GMT-4)
The conversations that initiated the incident to prevent writing to the error log were stopped3:00 PM (GMT-4)
The corrective action did not improve platform performance to the expected level 3:10 PM (GMT-4)
It was decided to initiate a general truncation/cleanup of the queue accumulating the processing of messages that caused the incident3:30 PM (GMT-4)
All API/WS and Orchestrator services were restarted 3:50 PM (GMT-4)
An internal testing cycle by the DevOps team was initiated 3:55 PM (GMT-4)
The resolution of the incident was confirmed to the entire DANAconnect team internally via Slack - Incident Channel about 4:00 PM (GMT-4)

Details about the solution/patch implemented to resolve the incident:

Cleaned up the messages that caused the failure at the processing queue level.
Test reports clearly indicate that the implemented solution is functioning.

Database during transaction processing / Incident errors12:30 PM (GMT-4)Note: The time on the graphs is local in AWS.

‌

Database after queue cleanup:Processing was fully restored4:00 PM (GMT-4)Note: The time on the graphs is local in AWS.

‌

ACTIONS TO PREVENT THE INCIDENT FROM RECURRING

Specialized Monitoring:

Configuring a special dashboard and alerts for excessive access to the error log table. This will alert the DevOps team to analyze the need for preventive or corrective action.
Monitoring dashboard for conversations writing errors: This will quickly identify which conversations may be generating massive bursts of writes to the error log.

Solution Changes:

We are considering configuring a parameter that can be adjusted on-demand to disable writing to the error log.

Documentation:

Added this new type of case and its solution to the Incident Response Plan to provide a quick and effective response.

Posted Nov 06, 2024 - 11:03 EST

Resolved

This incident has been resolved.

Posted Nov 05, 2024 - 15:16 EST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Nov 05, 2024 - 15:03 EST

Identified

The issue has been identified and a fix is being implemented.

Posted Nov 05, 2024 - 14:30 EST

Investigating

We are currently investigating this issue.

Posted Nov 05, 2024 - 13:06 EST

This incident affected: Outbound Dispatchers (Email platform, Director - Workflow Orchestration, Webhooks (API Request Node)) and API (Do Not Contact List API, One Time Password API, Conversation API, SMTP Service, Bulk Contacts Load API).