Incident Overview:
At 1:07 am on April 17th, our team encountered a significant issue impacting our ability to process bulk conversations efficiently. The root cause of this problem was identified as a failure in the master database, which disrupted our normal operations and led to delays in query execution and failures in processors.
Logs related to the incident:
watchdog: BUG: soft lockup - CPU#31 stuck
watchdog: BUG: soft lockup - CPU#31 stuck for 17964s! [scp:3499982]
[6791981.871733] Modules linked in: binfmt_misc raid0 xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay tls sunrpc nls_ascii nls_cp437 vfat fat ena ghash_clmulni_intel ptp aesni_intel i8042 serio pps_core crypto_simd cryptd button sch_fq_codel dm_mod fuse configfs loop dax dmi_sysfs crc32_pclmul crc32c_intel efivarfswatchdog: BUG: soft lockup - CPU#32 stuck
837474.767511] watchdog: BUG: soft lockup - CPU#32 stuck for 544s! [xtrabackup:3496469]
[6837474.798684] Modules linked in: binfmt_misc raid0 xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables nfnetlink br_netfilter bridge stp llc overlay tls sunrpc nls_ascii nls_cp437 vfat fat ena ghash_clmulni_intel ptp aesni_intel i8042 serio pps_core crypto_simd cryptd button sch_fq_codel dm_mod fuse configfs loop dax dmi_sysfs crc32_pclmul crc32c_intel efivarfs
[6837474.999592] CPU: 32 PID: 3496469 Comm: xtrabackup Tainted: G L 6.1.72-96.166.amzn2023.x86_64 #1
[6837475.048435] Hardware name: Amazon EC2 m6idn.12xlarge/, BIOS 1.0 10/16/2017
[6837475.082238] RIP: 0010:xas_descend+0x16/0x80
[6837475.103266] Code: 07 48 c1 e8 20 48 89 57 08 c3 cc cc cc cc cc cc cc cc cc cc 0f b6 0e 48 8b 47 08 48 d3 e8 48 89 c1 83 e1 3f 89 c8 48 83 c0 04 <48> 8b 44 c6 08 48 89 77 18 48 89 c2 83 e2 03 48 83 fa 02 74 0e 88
Impact: The impact of this incident was felt across multiple aspects of our operations:
Timeline:
Resolution: Upon identifying the root cause of the issue, our team immediately mobilized to address the situation and mitigate the impact on our operations. We initiated a series of troubleshooting steps to restore functionality to the master database and implemented temporary workarounds to minimize the impact on our processing capabilities.
Maintenance Window Scheduled:
To ensure that the failure is not repeated, it was decided to create an emergency maintenance window to perform a full recovery of the Master Database Server. on April 17th, 9:00 PM EDT.
Upon identifying the root cause, our team initiated recovery efforts to restore normal database functionality. This involved implementing backup systems and rerouting traffic to ensure minimal disruption to our services. Database functionality was successfully restored by April 18th, at 12:06 AM.
Lessons Learned: This incident has provided us with valuable insights and lessons that will guide our future actions:
Next Steps: Moving forward, we are committed to implementing the necessary improvements to strengthen our infrastructure and processes, minimizing the likelihood of similar incidents occurring in the future. We will also conduct a thorough review of our incident response procedures to identify areas for refinement and enhancement.
We want to extend my sincere appreciation to everyone who was involved in responding to this incident, your dedication and expertise were instrumental in minimizing the impact on our operations and restoring functionality. If you have any further questions or concerns regarding this incident or our response efforts, please don't hesitate to reach out.
Thank you for your understanding and continued support as we work together to ensure the reliability and resilience of our systems.