We are currently experiencing an issue processing bulk conversations

Incident Report for DANAConnect

Postmortem

Incident Overview:

At 1:07 am on April 17th, our team encountered a significant issue impacting our ability to process bulk conversations efficiently. The root cause of this problem was identified as a failure in the master database, which disrupted our normal operations and led to delays in query execution and failures in processors.

Logs related to the incident:

watchdog: BUG: soft lockup - CPU#31 stuck
watchdog: BUG: soft lockup - CPU#31 stuck for 17964s! [scp:3499982]
[6791981.871733] Modules linked in: binfmt_misc raid0 xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay tls sunrpc nls_ascii nls_cp437 vfat fat ena ghash_clmulni_intel ptp aesni_intel i8042 serio pps_core crypto_simd cryptd button sch_fq_codel dm_mod fuse configfs loop dax dmi_sysfs crc32_pclmul crc32c_intel efivarfswatchdog: BUG: soft lockup - CPU#32 stuck
837474.767511] watchdog: BUG: soft lockup - CPU#32 stuck for 544s! [xtrabackup:3496469]
[6837474.798684] Modules linked in: binfmt_misc raid0 xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay tls sunrpc nls_ascii nls_cp437 vfat fat ena ghash_clmulni_intel ptp aesni_intel i8042 serio pps_core crypto_simd cryptd button sch_fq_codel dm_mod fuse configfs loop dax dmi_sysfs crc32_pclmul crc32c_intel efivarfs
[6837474.999592] CPU: 32 PID: 3496469 Comm: xtrabackup Tainted: G L 6.1.72-96.166.amzn2023.x86_64 #1
[6837475.048435] Hardware name: Amazon EC2 m6idn.12xlarge/, BIOS 1.0 10/16/2017
[6837475.082238] RIP: 0010:xas_descend+0x16/0x80
[6837475.103266] Code: 07 48 c1 e8 20 48 89 57 08 c3 cc cc cc cc cc cc cc cc cc cc 0f b6 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 <48> 8b 44 c6 08 48 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0e 88

‌

Impact: The impact of this incident was felt across multiple aspects of our operations:

Delays in query execution: The failure in the master database resulted in significant delays in executing queries, slowing down our overall processing speed.
Processor failures: Due to the disruption in the database, several processors failed to function as expected, further exacerbating the processing delays.

‌

Timeline:

April 17th, 1:07 AM: Incident detected, delays in query execution and failures observed.
April 17th, 1:20 AM: Immediate investigation initiated to identify the root cause.
April 17th, 2:00 AM: Root cause identified as a failure in the master database.
April 17th, 2:30 AM: Recovery efforts initiated to restore normal database functionality.
April 17th, 8:42 AM: Database functionality restored, bulk conversations processing resumed.

‌

Resolution: Upon identifying the root cause of the issue, our team immediately mobilized to address the situation and mitigate the impact on our operations. We initiated a series of troubleshooting steps to restore functionality to the master database and implemented temporary workarounds to minimize the impact on our processing capabilities.

Maintenance Window Scheduled:

To ensure that the failure is not repeated, it was decided to create an emergency maintenance window to perform a full recovery of the Master Database Server. on April 17th, 9:00 PM EDT.

Upon identifying the root cause, our team initiated recovery efforts to restore normal database functionality. This involved implementing backup systems and rerouting traffic to ensure minimal disruption to our services. Database functionality was successfully restored by April 18th, at 12:06 AM.

‌

Lessons Learned: This incident has provided us with valuable insights and lessons that will guide our future actions:

Database Monitoring and Redundancy: We recognize the need to enhance our database monitoring systems to detect issues proactively and implement redundancy measures to ensure continuity of operations in the event of a failure.
Communication Protocols: Clear and timely communication is essential during incidents to keep all stakeholders informed about the situation, the steps being taken to address it, and the expected timelines for resolution.
Resilience Testing: Regular testing of our systems' resilience and failover mechanisms will help us identify potential weaknesses and ensure that we are adequately prepared to handle similar incidents in the future.

Next Steps: Moving forward, we are committed to implementing the necessary improvements to strengthen our infrastructure and processes, minimizing the likelihood of similar incidents occurring in the future. We will also conduct a thorough review of our incident response procedures to identify areas for refinement and enhancement.

We want to extend my sincere appreciation to everyone who was involved in responding to this incident, your dedication and expertise were instrumental in minimizing the impact on our operations and restoring functionality. If you have any further questions or concerns regarding this incident or our response efforts, please don't hesitate to reach out.

Thank you for your understanding and continued support as we work together to ensure the reliability and resilience of our systems.

Posted Apr 18, 2024 - 09:53 EDT

Resolved

This incident has been resolved.

Posted Apr 17, 2024 - 13:00 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 17, 2024 - 09:33 EDT

Investigating

We are currently investigating this issue.

Posted Apr 17, 2024 - 06:49 EDT

This incident affected: ETL, Audit & Reports (Audit Reports & Delivery Logs) and Outbound Dispatchers (Director - Workflow Orchestration).