Details
-
Bug
-
Status: Open (View Workflow)
-
Major
-
Resolution: Unresolved
-
10.4.12
-
None
Description
Scenario:
- large number of tables per schema, large number of foreign keys (basically every table has references to at least another table, often to multiples)
- normal user traffic flowing in (mainly updates to a session table, which however involve locking rows on more than 500 tables due to FK constraints). At same time, a thread performs a series of schema changes, altering and dropping tables
- 3 nodes galera cluster, 10.4.12 ENTERPRISE, with all traffic going to node 1 (no read/write split)
Suddendly, node 1 errors out due to not being able to find a row to update. Notice that the error seems to come from the applier thread even if we are only writing on this node:
2020-08-20 15:55:55 1 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 12 seqno: 4957340)
|
2020-08-20 15:55:55 1 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146
|
2020-08-20 15:55:55 1 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340
|
The node then asserts and shuts down:
mysqld: /__w/1/s/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX_ON_ES_BACKUP_DEBUGSOURCE/wsrep-lib/src/transaction.cpp:632: int wsrep::transaction::before_rollback(): Assertion `state() == s_executing || state() == s_preparing || state() == s_prepared || state() == s_must_abort || state() == s_aborting || state() == s_cert_failed || state() == s_must_replay' failed.
|
At the same time, almost identical errors are logged on nodes 2 and 3:
Node2:
2020-08-20 15:55:55 17 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 4957340)
|
2020-08-20 15:55:55 17 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146
|
2020-08-20 15:55:55 17 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340
|
2020-08-20 15:55:55 17 [ERROR] WSREP: Failed to apply write set: gtid: 83eac97a-dd74-11ea-80ac-ce4a396b2662:4957340 server_id: aad2eaa6-dd7c-11ea-a233-ca1d2c022f17 client_id: 5187420 trx_id: 411828617 flags: 3 (start_transaction | commit)
|
Node 3:
020-08-20 15:55:55 19 [Warning] WSREP: BF applier failed to open_and_lock_tables: 1146, fatal: 0 wsrep = (exec_mode: 2 conflict_state: 0 seqno: 4957340)
|
2020-08-20 15:55:55 19 [ERROR] Slave SQL: Error executing row event: 'Table 'someschema.sometable' doesn't exist', Internal MariaDB error code: 1146
|
2020-08-20 15:55:55 19 [Warning] WSREP: Event 577 Update_rows_v1 apply failed: 1146, seqno 4957340
|
2020-08-20 15:55:55 19 [ERROR] WSREP: Failed to apply write set: gtid: 83eac97a-dd74-11ea-80ac-ce4a396b2662:4957340 server_id: aad2eaa6-dd7c-11ea-a233-ca1d2c022f17 client_id: 5187420 trx_id: 411828617 flags: 3 (start_transaction | commit)
|
Node 2 and 3 should shut down but they don't, and go into wsrep-disabled mode instead.
The result of this scenario is a completely down cluster. Kill -9 is needed to shut down nodes 2 and 3, and only way to recover is to re-bootstrap the cluster, with node 2 and 3 undergoing an SST.
I verified by checking binary logs of node1 , that writes only go on node 1 (no multi master). Still it puzzles me that the error comes from the applier thread!
The last statement logged to binary log before the crash is the drop table referenced by the error. But the update that precedes it is on another schema. The last reference to the dropped table is from an update that is 40 seconds before the drop.