Descargar informe de reporte en español: https://drive.google.com/file/d/1IiEQArry8AjBU4M-g6TPHQ95j-iK7EXo/view?usp=sharing
Incident Report
Incident ID:CYMSLP-126-241105
Date:11/05/2024
INCIDENT DESCRIPTION
Incident reported by: DevOps
Date and time of the incident: 11/05/2024 - 12:30 PM (GMT-4)
Time elapsed from when the DevOps team noticed the incident until its resolution: 12:30 PM - 4:00 PM (GMT-4) 3 hours 30 minutes
Details about the services affected by the incident:
- High degradation in communication delivery speed.There was prolonged queuing in the delivery of communications across various channels, including email, SMS, and push notifications.
- Both mass campaigns and transactional notifications were affected.
- A group of platform users reported that communication processing was halted.
Severity of the incident (Critical, High, Medium, Low):Critical. 3 hours 30 minutes since the initial incident report.
Frequency of this type of incident:Very low: First occurrence in 15 years.
CAUSE OF THE INCIDENT
Details about the vulnerability that caused the incident: A very high number of write accesses to the tables managing the error logs of the communication orchestrator were detected.
This occurs when a message cannot be processed for some reason. The most common reasons include: the recipient's address is invalid and the processing route is undefined.
Although logging errors is a normal process within the platform, the massive access caused a bottleneck in writing to the table, halting message processing.
Example of contained query:
Degradation of insertion times in the error log:
It was determined that the failure originated from the simultaneous activation of 8 conversations, totaling approximately 8 million messages with destination addresses in an invalid or empty format.
CORRECTIVE ACTIONS
- Volume of transactions to process within the orchestrator verified by DevOps 12:30 PM (GMT-4)
- Transaction slowness and wait times in the Database verified by DevOps 12:35 PM (GMT-4)
- DANAconnect team was notified via Slack on the Incident channel 12:45 PM (GMT-4)
- The injection of new activations that could be affected was paused 12:50 PM (GMT-4)
- All clients were notified about the platform incident and posted on https://status.danaconnect.com 1:06 PM (GMT-4)
- A Security Snapshot of the Database Cluster was generated 1:10 PM (GMT-4)
- It was determined that the origin of the platform degradation was the write access to the error log 1:56 PM (GMT-4)
- The conversations causing the massive generation of error logs were located 2:51 PM (GMT-4)
- The conversations that initiated the incident to prevent writing to the error log were stopped3:00 PM (GMT-4)
- The corrective action did not improve platform performance to the expected level 3:10 PM (GMT-4)
- It was decided to initiate a general truncation/cleanup of the queue accumulating the processing of messages that caused the incident3:30 PM (GMT-4)
- All API/WS and Orchestrator services were restarted 3:50 PM (GMT-4)
- An internal testing cycle by the DevOps team was initiated 3:55 PM (GMT-4)
- The resolution of the incident was confirmed to the entire DANAconnect team internally via Slack - Incident Channel about 4:00 PM (GMT-4)
Details about the solution/patch implemented to resolve the incident:
- Cleaned up the messages that caused the failure at the processing queue level.
- Test reports clearly indicate that the implemented solution is functioning.
Database during transaction processing / Incident errors12:30 PM (GMT-4)Note: The time on the graphs is local in AWS.
Database after queue cleanup:Processing was fully restored4:00 PM (GMT-4)Note: The time on the graphs is local in AWS.
ACTIONS TO PREVENT THE INCIDENT FROM RECURRING
Specialized Monitoring:
- Configuring a special dashboard and alerts for excessive access to the error log table. This will alert the DevOps team to analyze the need for preventive or corrective action.
- Monitoring dashboard for conversations writing errors: This will quickly identify which conversations may be generating massive bursts of writes to the error log.
Solution Changes:
- We are considering configuring a parameter that can be adjusted on-demand to disable writing to the error log.
Documentation:
- Added this new type of case and its solution to the Incident Response Plan to provide a quick and effective response.