[MDEV-21153] Galera: Replica nodes crash with signal 6 after indexed virtual columns and FK cascading deletes: WSREP has not yet prepared node for application use Created: 2019-11-26 Updated: 2021-01-26 Resolved: 2021-01-20 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Manipulation - Delete, Dynamic Columns, Galera |
| Affects Version/s: | 10.3.21, 10.4.11 |
| Fix Version/s: | 10.3.28, 10.4.18, 10.5.9, 10.6.0 |
| Type: | Bug | Priority: | Major |
| Reporter: | Stepan Patryshev (Inactive) | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | galera | ||
| Environment: |
OS: CentOS Linux release 7.6.1810 (Core). |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
It was discovered trying to reproduce MDEV-19601. Two (replica nodes) of three galera nodes crash. MariaDB Version 10.4.11-MariaDB-debug: Repository: MariaDB/server; branch 10.4; Revision ae72205e31e7665238c1757ede352d9ca53d5327. Galera lib 26.4.3(r4548): Repository: MariaDB/galera; branch mariadb-4.x; Revision a5431753a3f6bdd348adcbca00e3450ba0ef9045. Client output:
Node 2 log
It is also reproduced on 10.3, but with additional error in the client:
MariaDB Version 10.3.21-MariaDB-debug: Repository: MariaDB/server; branch 10.3; Revision a14544260c33dcdb057d2f62c4aab33cb09ebcb1. Galera lib 25.3.28(r3879)): Repository: MariaDB/galera; branch mariadb-3.x; Revision fa9f6d0127a060cf12031858bedbd766fc6cdb61. Client output:
|
| Comments |
| Comment by Jan Lindström (Inactive) [ 2019-11-27 ] | |||||||||||||||||||||||||||||||||||||
|
branch: 10.3 I can't repeat the crash without Galera, but in my opinion it does not work correctly on InnoDB only as foreign key is defined as ON DELETE CASCASE the row with pid=1 should be deleted when we execute DELETE FROM testMain WHERE primid=1; deleting the only parent for it. This inconsistency could be reason to crash seen on Galera or there could be more than one bug.
| |||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-04-03 ] | |||||||||||||||||||||||||||||||||||||
|
jplindst, I just accidentally found out you had assigned this bug to me. Can you please provide a mtr test case that repeats the problem without Galera? | |||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2021-01-01 ] | |||||||||||||||||||||||||||||||||||||
|
There was a chance that The above bug is specific to the legacy replication and is fixed to correct the slave applier However WSREP replication does not support the base:
ends up with a stack below to most probably indicate the prelocking does not work correctly;
When the stack is sorted out, the FK-related prelocking of | |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-01-13 ] | |||||||||||||||||||||||||||||||||||||
|
seppo Can you have a look. In my understanding first problem is that when executing the second DELETE on master foreign key is not added to write set (see jan.test) . Now I tried test case on | |||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-01-13 ] | |||||||||||||||||||||||||||||||||||||
|
Reproduced this replica node crash with 10.3 version. As Andrei pointed out, the issue seems to be same as with Debugger shows that, when the replica node is applying cascaded delete in testRef table, row_upd_del_mark_clust_rec() execution skips setting mysql table reference for row_upd_store_row() call :
This leads to assert 3 stack levels lower in innobase_get_computed_value():
| |||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-01-14 ] | |||||||||||||||||||||||||||||||||||||
|
Further debugging shows that slave SQL applying happens a bit differently between async and galera replication, with regard to table prelocking.
slave_fk_event_map is set only for asyn replication, and with this galera replication slave will skip extending table list in open_tables(), which eventually leads to the assert. | |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-01-14 ] | |||||||||||||||||||||||||||||||||||||
|
sujatha.sivakumar Why slave_fk_event_map is set only for async replication ? | |||||||||||||||||||||||||||||||||||||
| Comment by Sujatha Sivakumar (Inactive) [ 2021-01-14 ] | |||||||||||||||||||||||||||||||||||||
|
Hello jplindst At present the 'slave_fk_event_map' is set only for async replication as there is an issue in galera code.
Stack trance is provided by Elkin as part the below mentioned comment Once the galera issue is fixed, this 'if (!WSREP_ON)' condition can be removed and | |||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2021-01-14 ] | |||||||||||||||||||||||||||||||||||||
|
Streaming replication is supported from 10.4 (there could be issue around this also), but we are here talking about 10.3. | |||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-01-14 ] | |||||||||||||||||||||||||||||||||||||
|
I will submit a PR to fix 10.3, by simply removing the (!WSREP_ON) condition, so that also galera replication will be in the domain of the fix in I will submit also separate PR for 10.4, which removes the same WSREP_ON condition and also contains fix for streaming replication, which used corrupt data in THD::LEX:query_table_list, in wsrep system table access phase and which resulted in table prelocking when it shouldn't. The 10.4 PR will take a little longer due to our internal review. | |||||||||||||||||||||||||||||||||||||
| Comment by Seppo Jaakola [ 2021-01-18 ] | |||||||||||||||||||||||||||||||||||||
|
submitted two PR's, for 10.3 and 10.4 |