[MDEV-26177] WSREP assertion failure in bf_abort when replicating certain DDL against in-memory tables Created: 2021-07-19 Updated: 2022-10-03 Resolved: 2022-10-03 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.5.11 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Kent Hoover | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS7 / RH7 |
||
| Description |
|
The bug is triggered by truncates of ENGINE=MEMORY tables: TRUNCATE TABLE statement is (erroneously?) replicated, fails to be applied because in-memory tables are node-local, tx is brute force aborted, then “local or streaming tx” assertion fails in WSREP and node crashes with an error similar to the following, and may eventually lead to cluster lockup when WSREP goes completely haywire due to repeated node crashes caused by this same transaction replicated again and again: Perhaps mode==local should evaluate to true ? 2021-07-17 10:26:43 10 [ERROR] Slave SQL: Error 'Table 'lsa.ttm$etl_dimension_discovery_etl06_lsa_001' doesn't exist' on query. Default database: 'lsa'. Query: '/* DATABASE_EXECUTE_DDL*/ TRUNCATE TABLE `ttm$etl_dimension_discovery_etl06_lsa_001` /* PROCESS(LSA [ETLs] <lsa@etl06>, metadata_background|abandoned[1h]|time_to_live[4h]|maximum_reuse[15m])*/', Internal MariaDB error code: 1146 We will try our best to scrape up some info that will hopefully help Server version: 10.5.11-MariaDB Thread pointer: 0x0 |
| Comments |
| Comment by Kent Hoover [ 2021-09-13 ] |
|
Hello? Anything? |
| Comment by Gabor Orosz [ 2021-09-22 ] |
|
Hi, I think the problem is not the DDL itself, but the fact that schema changes are non-transactional and require special care. By default Galera uses Total Order Isolation to execute such changes, which means that the cluster members have to reach a synchronization point where application of ordinary DML transactions are suspended and then resumed after DDL is completed on all nodes. The crash happens in this resume phase as the applier threads are trying to process transactions in parallel and two of them are in a lock contention situation due to their replay order. I suspect that this is the same issue as the one that is reported in Best regards, |
| Comment by Kent Hoover [ 2021-12-28 ] |
|
Have you concluded that this issue, was indeed the same as Cheers, |
| Comment by Jan Lindström (Inactive) [ 2022-10-03 ] |
|
I did not see any problem using TRUNCATE on MEMORY tables. Note that DDL will be replicated but DML is not. |