[MDEV-23528] Galera: simultaneous DDL and DML lead to entire cluster crash - Jira

XML

Word

Printable

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.4.12
Fix Version/s: 10.4
Component/s: Galera, Server
Labels:
None

Description

Scenario:

large number of tables per schema, large number of foreign keys (basically every table has references to at least another table, often to multiples)

normal user traffic flowing in (mainly updates to a session table, which however involve locking rows on more than 500 tables due to FK constraints). At same time, a thread performs a series of schema changes, altering and dropping tables

3 nodes galera cluster, 10.4.12 ENTERPRISE, with all traffic going to node 1 (no read/write split)

Suddendly, node 1 errors out due to not being able to find a row to update. Notice that the error seems to come from the applier thread even if we are only writing on this node:

2020-08-20 15:55:55 1 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 12 seqno: 4957340)

2020-08-20 15:55:55 1 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146

2020-08-20 15:55:55 1 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340

The node then asserts and shuts down:

mysqld: /__w/1/s/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/wsrep-lib/src/transaction.cpp:632: int wsrep::transaction::before_rollback(): Assertion `state() == s_executing || state() == s_preparing || state() == s_prepared || state() == s_must_abort || state() == s_aborting || state() == s_cert_failed || state() == s_must_replay' failed.

At the same time, almost identical errors are logged on nodes 2 and 3:

Node2:

2020-08-20 15:55:55 17 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 4957340)

2020-08-20 15:55:55 17 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146

2020-08-20 15:55:55 17 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340

2020-08-20 15:55:55 17 [ERROR] WSREP: Failed to apply write set: gtid: 83eac97a-dd74-11ea-80ac-ce4a396b2662:4957340 server_id: aad2eaa6-dd7c-11ea-a233-ca1d2c022f17 client_id: 5187420 trx_id: 411828617 flags: 3 (start_transaction | commit)

Node 3:

020-08-20 15:55:55 19 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 4957340)

2020-08-20 15:55:55 19 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146

2020-08-20 15:55:55 19 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340

2020-08-20 15:55:55 19 [ERROR] WSREP: Failed to apply write set: gtid: 83eac97a-dd74-11ea-80ac-ce4a396b2662:4957340 server_id: aad2eaa6-dd7c-11ea-a233-ca1d2c022f17 client_id: 5187420 trx_id: 411828617 flags: 3 (start_transaction | commit)

Node 2 and 3 should shut down but they don't, and go into wsrep-disabled mode instead.

The result of this scenario is a completely down cluster. Kill -9 is needed to shut down nodes 2 and 3, and only way to recover is to re-bootstrap the cluster, with node 2 and 3 undergoing an SST.

I verified by checking binary logs of node1 , that writes only go on node 1 (no multi master). Still it puzzles me that the error comes from the applier thread!

The last statement logged to binary log before the crash is the drop table referenced by the error. But the update that precedes it is on another schema. The last reference to the dropped table is from an update that is 40 seconds before the drop.

Attachments

Activity

People

Assignee:: Seppo Jaakola

Reporter:: Rick Pizzi

Votes:: 4 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2020-08-21 08:54

Updated:: 2023-11-17 10:10

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.