We are currently experiencing an outage affecting Conversations Processing. Our Engineering Team is aware and working to resolve.

Incident Report for DANAConnect

Postmortem

Descargar informe de reporte en español: https://drive.google.com/file/d/1IiEQArry8AjBU4M-g6TPHQ95j-iK7EXo/view?usp=sharing

Incident Report

Incident ID:CYMSLP-126-241105

Date:11/05/2024

INCIDENT DESCRIPTION

Incident reported by: DevOps

Date and time of the incident: 11/05/2024 - 12:30 PM (GMT-4)

Time elapsed from when the DevOps team noticed the incident until its resolution: 12:30 PM - 4:00 PM (GMT-4) 3 hours 30 minutes

Details about the services affected by the incident:

  • High degradation in communication delivery speed.There was prolonged queuing in the delivery of communications across various channels, including email, SMS, and push notifications.
  • Both mass campaigns and transactional notifications were affected.
  • A group of platform users reported that communication processing was halted.

Severity of the incident (Critical, High, Medium, Low):Critical. 3 hours 30 minutes since the initial incident report.

Frequency of this type of incident:Very low: First occurrence in 15 years.

CAUSE OF THE INCIDENT

Details about the vulnerability that caused the incident: A very high number of write accesses to the tables managing the error logs of the communication orchestrator were detected.

This occurs when a message cannot be processed for some reason. The most common reasons include: the recipient's address is invalid and the processing route is undefined.

Although logging errors is a normal process within the platform, the massive access caused a bottleneck in writing to the table, halting message processing.

Example of contained query:

Degradation of insertion times in the error log:

It was determined that the failure originated from the simultaneous activation of 8 conversations, totaling approximately 8 million messages with destination addresses in an invalid or empty format.

CORRECTIVE ACTIONS

  1. Volume of transactions to process within the orchestrator verified by DevOps 12:30 PM (GMT-4)
  2. Transaction slowness and wait times in the Database verified by DevOps 12:35 PM (GMT-4)
  3. DANAconnect team was notified via Slack on the Incident channel 12:45 PM (GMT-4)
  4. The injection of new activations that could be affected was paused 12:50 PM (GMT-4)
  5. All clients were notified about the platform incident and posted on https://status.danaconnect.com 1:06 PM (GMT-4)
  6. A Security Snapshot of the Database Cluster was generated 1:10 PM (GMT-4)
  7. It was determined that the origin of the platform degradation was the write access to the error log 1:56 PM (GMT-4)
  8. The conversations causing the massive generation of error logs were located 2:51 PM (GMT-4)
  9. The conversations that initiated the incident to prevent writing to the error log were stopped3:00 PM (GMT-4)
  10. The corrective action did not improve platform performance to the expected level 3:10 PM (GMT-4)
  11. It was decided to initiate a general truncation/cleanup of the queue accumulating the processing of messages that caused the incident3:30 PM (GMT-4)
  12. All API/WS and Orchestrator services were restarted 3:50 PM (GMT-4)
  13. An internal testing cycle by the DevOps team was initiated 3:55 PM (GMT-4)
  14. The resolution of the incident was confirmed to the entire DANAconnect team internally via Slack - Incident Channel about 4:00 PM (GMT-4)

Details about the solution/patch implemented to resolve the incident:

  • Cleaned up the messages that caused the failure at the processing queue level.
  • Test reports clearly indicate that the implemented solution is functioning.

Database during transaction processing / Incident errors12:30 PM (GMT-4)Note: The time on the graphs is local in AWS.

Database after queue cleanup:Processing was fully restored4:00 PM (GMT-4)Note: The time on the graphs is local in AWS.

ACTIONS TO PREVENT THE INCIDENT FROM RECURRING

Specialized Monitoring:

  • Configuring a special dashboard and alerts for excessive access to the error log table. This will alert the DevOps team to analyze the need for preventive or corrective action.
  • Monitoring dashboard for conversations writing errors: This will quickly identify which conversations may be generating massive bursts of writes to the error log.

Solution Changes:

  • We are considering configuring a parameter that can be adjusted on-demand to disable writing to the error log.

Documentation:

  • Added this new type of case and its solution to the Incident Response Plan to provide a quick and effective response.
Posted Nov 06, 2024 - 11:03 EST

Resolved

This incident has been resolved.
Posted Nov 05, 2024 - 15:16 EST

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Nov 05, 2024 - 15:03 EST

Identified

The issue has been identified and a fix is being implemented.
Posted Nov 05, 2024 - 14:30 EST

Investigating

We are currently investigating this issue.
Posted Nov 05, 2024 - 13:06 EST
This incident affected: Outbound Dispatchers (Email platform, Director - Workflow Orchestration, Webhooks (API Request Node)) and API (Do Not Contact List API, One Time Password API, Conversation API, SMTP Service, Bulk Contacts Load API).