Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34834

cluster hang because of conflict between DDL from wsrep and DML executed from trigger

    XMLWordPrintable

Details

    • Bug
    • Status: Needs Feedback (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.6.18
    • 10.6
    • Galera
    • None

    Description

      After 10 years of successful operation of the script, a customer upgrades to a new version and encounters a failure in the following scenario:
      ------------------------
      We have our internal tool named live_alter. In short it is used to make live DDLs without blocking the databases nodes and clusters. The working scenario is very near to the Percona tool with the same purpose.

      New table structure is copied from the original one with the suffix _LIVE_ALTER at the end of the name.

      Usually Necessary DDLs are executed on the _LIVE_ALTER table upfront, before following steps. This time I was missed one index and executed it while the LIVE_ALTER process was fully working, as described bellow.

      Triggers for Insert,Update,Delete DMLs are attached to the original table to replicate the queries from original to the new table.

      Then the script start moving groups of rows from original table to the LIVE_ALTER with INSERT SELECT expressions.

      The difference here is that I ran CREATE INDEX statement on the new, LIVE_ALTER table, when the whole migration process is working: Triggers are manipulating the table with fresh DMLs and the script is moving old rows to the _LIVE_ALTER table.

      I believe this is the reason for 2 of 3 nodes Cluster hang. Actually only the node which was used to execute the index remains alive.
      ------------------------
      Excerpt from the log:

      2024-08-06 13:34:56 9 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)a4c20bad 78e73f51: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414095382 trx_id: 83290372165 tstamp: 4097012309090758; state:  seqnos (l: 1850571743, g: 52074153223, s: 52074153220, d: 52074153132) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097012308867404; state:  seqnos (l: 1850571742, g: 52074153222, s: 52074153220, d: 52074153221) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490
      2024-08-06 13:34:56 18 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)a4c20bad 78e73f51: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414095369 trx_id: 83290371134 tstamp: 4097012309316348; state:  seqnos (l: 1850571744, g: 52074153224, s: 52074153220, d: 52074153132) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097012308867404; state:  seqnos (l: 1850571742, g: 52074153222, s: 52074153220, d: 52074153221) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490
      2024-08-06 13:34:56 36 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)a4c20bad 78e73f51: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414095390 trx_id: 83290372799 tstamp: 4097012310113431; state:  seqnos (l: 1850571746, g: 52074153226, s: 52074153220, d: 52074153132) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097012308867404; state:  seqnos (l: 1850571742, g: 52074153222, s: 52074153220, d: 52074153221) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490
      2024-08-06 13:34:56 33 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)a4c20bad 78e73f51: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414095370 trx_id: 83290371214 tstamp: 4097012310890468; state:  seqnos (l: 1850571748, g: 52074153227, s: 52074153221, d: 52074153179) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097012308867404; state:  seqnos (l: 1850571742, g: 52074153222, s: 52074153220, d: 52074153221) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490
      2024-08-06 13:34:56 28 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)a4c20bad 78e73f51: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414095372 trx_id: 83290371351 tstamp: 4097012311479950; state:  seqnos (l: 1850571749, g: 52074153228, s: 52074153221, d: 52074153179) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097012308867404; state:  seqnos (l: 1850571742, g: 52074153222, s: 52074153220, d: 52074153221) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490
      2024-08-06 13:34:56 25 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)a4c20bad 78e73f51: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414095373 trx_id: 83290371492 tstamp: 4097012314550639; state:  seqnos (l: 1850571751, g: 52074153230, s: 52074153221, d: 52074153179) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097012308867404; state:  seqnos (l: 1850571742, g: 52074153222, s: 52074153220, d: 52074153221) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490->COMMITTING:1301
      2024-08-06 13:34:56 57 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)a4c20bad 78e73f51: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414095381 trx_id: 83290371683 tstamp: 4097012322639331; state:  seqnos (l: 1850571754, g: 52074153233, s: 52074153221, d: 52074153179) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097012308867404; state:  seqnos (l: 1850571742, g: 52074153222, s: 52074153220, d: 52074153221) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490->COMMITTING:1301
      2024-08-06 13:38:36 27 [Note] WSREP: SH-EX trx conflict for key (0,FLAT8)244902f1 d9c0c8dd: source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 65 conn_id: 1414210759 trx_id: 83297369612 tstamp: 4097231736502658; state:  seqnos (l: 1850696001, g: 52074267978, s: 52074267975, d: 52074267951) WS pa_range: 65535; state history: REPLICATING:0->CERTIFYING:3224 <---> source: 59f93a4e-2e9d-11ef-ab8a-8a4efc00aac2 version: 5 local: 0 flags: 69 conn_id: 1409957861 trx_id: -1 tstamp: 4097231734057415; state:  seqnos (l: 1850695999, g: 52074267976, s: 52074267975, d: 52074267975) WS pa_range: 1; state history: REPLICATING:0->CERTIFYING:3224->APPLYING:490->COMMITTING:1301
      2024-08-06 13:38:36 65 [Note] WSREP: MDL BF-BF conflict
      schema:  nl_game_providers
      request: (65    seqno 52074267979       wsrep (high priority, exec, executing) cmd 0 161        UPDATE nl_game_providers.game_sessions SET expire_ts = ( UNIX_TIMESTAMP() + '5400' ) WHERE recno = '1923530980',ý±f^SÇ`^A)
      granted: (59    seqno 52074267976       wsrep (toi, exec, committed) cmd 0 2    create index `create_ts` ON game_sessions_LIVE_ALTER(`create_ts`))
      2024-08-06 13:38:36 65 [ERROR] Aborting
      2024-08-06 13:48:41 0 [ERROR] [FATAL] InnoDB: innodb_fatal_semaphore_wait_threshold was exceeded for dict_sys.latch. Please refer to https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/
      240806 13:48:41 [ERROR] mysqld got signal 6 ;
      Sorry, we probably made a mistake, and this is a bug.
       
      Your assistance in bug reporting will enable us to fix this for the next release.
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.6.18-MariaDB-log source revision: 887bb3f73555ff8a50138a580ca8308b9b5c069c
      key_buffer_size=5242880
      read_buffer_size=131072
      max_used_connections=1285
      max_threads=65537
      thread_count=981
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 75555347 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      0x133c60c <my_print_stacktrace+0x3c> at /usr/local/libexec/mariadbd
      0xccb3cf <handle_fatal_signal+0x27f> at /usr/local/libexec/mariadbd
      0x828b4d4af <pthread_sigmask+0x53f> at /lib/libthr.so.3
      0x828b4ca6b <pthread_setschedparam+0x83b> at /lib/libthr.so.3
      0x7ffffffff2d3 <???> at ???
      0x82d04c41a <__sys_thr_kill+0xa> at /lib/libc.so.7
      0x82cfc5e64 <__raise+0x34> at /lib/libc.so.7
      0x82d0766f9 <abort+0x49> at /lib/libc.so.7
      0x12dcf50 <wsrep_thd_is_local_transaction+0x1d0300> at /usr/local/libexec/mariadbd
      0x12b6ee5 <wsrep_thd_is_local_transaction+0x1aa295> at /usr/local/libexec/mariadbd
      0x12dfb8d <_ZN5tpool19thread_pool_generic13timer_generic3runEv+0x3d> at /usr/local/libexec/mariadbd
      0x12e0497 <_ZN5tpool4task7executeEv+0x27> at /usr/local/libexec/mariadbd
      0x12de0d6 <_ZN5tpool19thread_pool_generic11worker_mainEPNS_11worker_dataE+0x76> at /usr/local/libexec/mariadbd
      0x12dfc66 <_ZN5tpool19thread_pool_generic13timer_generic3runEv+0x116> at /usr/local/libexec/mariadbd
      The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/ contains
      information that should help you find out what is causing the crash.
      Core pattern: %N.core

      Attachments

        1. gdb_trace.log
          111 kB
        2. gdb_aggr_201774.txt
          9 kB
        3. gdb_aggr_201774.svg
          50 kB
        4. db5_send.tar
          123 kB

        Activity

          People

            janlindstrom Jan Lindström
            sysprg Julius Goryavsky
            Votes:
            2 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.