Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-23528

Galera: simultaneous DDL and DML lead to entire cluster crash

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.4.12
    • 10.4
    • Galera, Server
    • None

    Description

      Scenario:

      • large number of tables per schema, large number of foreign keys (basically every table has references to at least another table, often to multiples)
      • normal user traffic flowing in (mainly updates to a session table, which however involve locking rows on more than 500 tables due to FK constraints). At same time, a thread performs a series of schema changes, altering and dropping tables
      • 3 nodes galera cluster, 10.4.12 ENTERPRISE, with all traffic going to node 1 (no read/write split)

      Suddendly, node 1 errors out due to not being able to find a row to update. Notice that the error seems to come from the applier thread even if we are only writing on this node:

      2020-08-20 15:55:55 1 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 12 seqno: 4957340)
      2020-08-20 15:55:55 1 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146
      2020-08-20 15:55:55 1 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340
      

      The node then asserts and shuts down:

      mysqld: /__w/1/s/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/wsrep-lib/src/transaction.cpp:632: int wsrep::transaction::before_rollback(): Assertion `state() == s_executing || state() == s_preparing || state() == s_prepared || state() == s_must_abort || state() == s_aborting || state() == s_cert_failed || state() == s_must_replay' failed.
      

      At the same time, almost identical errors are logged on nodes 2 and 3:

      Node2:

      2020-08-20 15:55:55 17 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 4957340)
      2020-08-20 15:55:55 17 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146
      2020-08-20 15:55:55 17 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340
      2020-08-20 15:55:55 17 [ERROR] WSREP: Failed to apply write set: gtid: 83eac97a-dd74-11ea-80ac-ce4a396b2662:4957340 server_id: aad2eaa6-dd7c-11ea-a233-ca1d2c022f17 client_id: 5187420 trx_id: 411828617 flags: 3 (start_transaction | commit)
      

      Node 3:

      020-08-20 15:55:55 19 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 4957340)
      2020-08-20 15:55:55 19 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146
      2020-08-20 15:55:55 19 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340
      2020-08-20 15:55:55 19 [ERROR] WSREP: Failed to apply write set: gtid: 83eac97a-dd74-11ea-80ac-ce4a396b2662:4957340 server_id: aad2eaa6-dd7c-11ea-a233-ca1d2c022f17 client_id: 5187420 trx_id: 411828617 flags: 3 (start_transaction | commit)
      

      Node 2 and 3 should shut down but they don't, and go into wsrep-disabled mode instead.

      The result of this scenario is a completely down cluster. Kill -9 is needed to shut down nodes 2 and 3, and only way to recover is to re-bootstrap the cluster, with node 2 and 3 undergoing an SST.

      I verified by checking binary logs of node1 , that writes only go on node 1 (no multi master). Still it puzzles me that the error comes from the applier thread!

      The last statement logged to binary log before the crash is the drop table referenced by the error. But the update that precedes it is on another schema. The last reference to the dropped table is from an update that is 40 seconds before the drop.

      Attachments

        Activity

          People

            seppo Seppo Jaakola
            rpizzi Rick Pizzi
            Votes:
            4 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.