[MDEV-21153] Galera: Replica nodes crash with signal 6 after indexed virtual columns and FK cascading deletes: WSREP has not yet prepared node for application use Created: 2019-11-26  Updated: 2021-01-26  Resolved: 2021-01-20

Status: Closed
Project: MariaDB Server
Component/s: Data Manipulation - Delete, Dynamic Columns, Galera
Affects Version/s: 10.3.21, 10.4.11
Fix Version/s: 10.3.28, 10.4.18, 10.5.9, 10.6.0

Type: Bug Priority: Major
Reporter: Stepan Patryshev (Inactive) Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 0
Labels: galera
Environment:

OS: CentOS Linux release 7.6.1810 (Core).


Attachments: File 191125_10.4.11_mysqld.2.err     Zip Archive 191125_Replica_Crash_10.3.21.zip     Zip Archive 191125_Replica_Crash_10.4.11.zip     File jan.test     File jan2.test    
Issue Links:
Relates
relates to MDEV-19601 MDEV-13708 is still unfixed Stalled
relates to MDEV-23033 All slaves crash once in ~24 hours an... Closed

 Description   

It was discovered trying to reproduce MDEV-19601. Two (replica nodes) of three galera nodes crash.

MariaDB Version 10.4.11-MariaDB-debug: Repository: MariaDB/server; branch 10.4; Revision ae72205e31e7665238c1757ede352d9ca53d5327.

Galera lib 26.4.3(r4548): Repository: MariaDB/galera; branch mariadb-4.x; Revision a5431753a3f6bdd348adcbca00e3450ba0ef9045.

Client output:

Server version: 10.4.11-MariaDB-debug-log Source distribution
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> use test;
Database changed
MariaDB [test]> CREATE TABLE testMain (primid INT UNSIGNED NOT NULL AUTO_INCREMENT, PRIMARY KEY (primid)) ENGINE = InnoDB;
Query OK, 0 rows affected (0.047 sec)
 
MariaDB [test]> CREATE TABLE testRef (id INT UNSIGNED NOT NULL AUTO_INCREMENT, pid INT UNSIGNED, bitmap TINYINT UNSIGNED NOT NULL DEFAULT 0, bitmap5 TINYINT UNSIGNED GENERATED ALWAYS AS (bitmap&(1<<5)) VIRTUAL, PRIMARY KEY (id), FOREIGN KEY (pid) REFERENCES testMain (primid) ON DELETE CASCADE ON UPDATE CASCADE);
Query OK, 0 rows affected (0.034 sec)
 
MariaDB [test]> CREATE INDEX bitmap5 ON testRef(bitmap5) USING BTREE;
Query OK, 0 rows affected (0.073 sec)
Records: 0  Duplicates: 0  Warnings: 0
 
MariaDB [test]> INSERT INTO testMain VALUES(0);
Query OK, 1 row affected (0.016 sec)
 
MariaDB [test]> SELECT * FROM testMain;
+--------+
| primid |
+--------+
|      1 |
+--------+
1 row in set (0.001 sec)
 
MariaDB [test]>  INSERT INTO testRef(pid) VALUES(1);
Query OK, 1 row affected (0.007 sec)
 
MariaDB [test]> DELETE FROM testMain WHERE primid=1;
Query OK, 1 row affected (0.008 sec)
 
MariaDB [test]> SELECT * FROM testMain;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use
MariaDB [test]> SELECT * from testRef;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

Node 2 log:
191125 19:31:49 [ERROR] mysqld got signal 6 ;

2019-11-25 19:31:49 2 [Note] WSREP: assigned new next trx id: 15
mysqld: /home/stepan/mariadb/10.4/git/storage/innobase/handler/ha_innodb.cc:20586: dfield_t* innobase_get_computed_value(dtuple_t*, const dict_v_col_t*, const dict_index_t*, mem_heap_t**, mem_heap_t*, const dict_field_t*, THD*, TABLE*, byte*, const dict_table_t*, upd_t*, dict_foreign_t*): Assertion `mysql_table' failed.
191125 19:31:49 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
 
Server version: 10.4.11-MariaDB-debug-log
key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=1
max_threads=153
thread_count=8
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 63618 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fecc0000af0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fecd0588950 thread_stack 0x49000
/home/stepan/mariadb/10.4/git/sql/mysqld(my_print_stacktrace+0x40)[0x55fcd330c976]
mysys/stacktrace.c:269(my_print_stacktrace)[0x55fcd2a7b215]
sigaction.c:0(__restore_rt)[0x7fece3f5a5d0]
:0(__GI_raise)[0x7fece2248207]
:0(__GI_abort)[0x7fece22498f8]
:0(__assert_fail_base)[0x7fece2241026]
:0(__GI___assert_fail)[0x7fece22410d2]
handler/ha_innodb.cc:20588(innobase_get_computed_value(dtuple_t*, dict_v_col_t const*, dict_index_t const*, mem_block_info_t**, mem_block_info_t*, dict_field_t const*, THD*, TABLE*, unsigned char*, dict_table_t const*, upd_t*, dict_foreign_t*))[0x55fcd2ce9794]
row/row0upd.cc:2178(row_upd_store_v_row(upd_node_t*, upd_t const*, THD*, TABLE*))[0x55fcd2ece6e9]
row/row0upd.cc:2245(row_upd_store_row(upd_node_t*, THD*, TABLE*))[0x55fcd2ece9b5]
row/row0upd.cc:2989(row_upd_del_mark_clust_rec(upd_node_t*, dict_index_t*, unsigned long*, que_thr_t*, unsigned long, bool, mtr_t*))[0x55fcd2ed19b5]
row/row0upd.cc:3174(row_upd_clust_step(upd_node_t*, que_thr_t*))[0x55fcd2ed2496]
row/row0upd.cc:3298(row_upd(upd_node_t*, que_thr_t*))[0x55fcd2ed2a50]
row/row0upd.cc:3442(row_upd_step(que_thr_t*))[0x55fcd2ed3017]
row/row0mysql.cc:2265(row_update_cascade_for_mysql(que_thr_t*, upd_node_t*, dict_table_t*))[0x55fcd2e70478]
row/row0ins.cc:1442(row_ins_foreign_check_on_constraint(que_thr_t*, dict_foreign_t*, btr_pcur_t*, dtuple_t*, mtr_t*))[0x55fcd2e4261e]
row/row0ins.cc:1850(row_ins_check_foreign_constraint(unsigned long, dict_foreign_t*, dict_table_t*, dtuple_t*, que_thr_t*))[0x55fcd2e436cb]
row/row0upd.cc:297(row_upd_check_references_constraints(upd_node_t*, btr_pcur_t*, dict_table_t*, dict_index_t*, unsigned long*, que_thr_t*, mtr_t*))[0x55fcd2ec94cc]
row/row0upd.cc:3000(row_upd_del_mark_clust_rec(upd_node_t*, dict_index_t*, unsigned long*, que_thr_t*, unsigned long, bool, mtr_t*))[0x55fcd2ed1a72]
row/row0upd.cc:3174(row_upd_clust_step(upd_node_t*, que_thr_t*))[0x55fcd2ed2496]
row/row0upd.cc:3298(row_upd(upd_node_t*, que_thr_t*))[0x55fcd2ed2a50]
row/row0upd.cc:3442(row_upd_step(que_thr_t*))[0x55fcd2ed3017]
row/row0mysql.cc:1891(row_update_for_mysql(row_prebuilt_t*))[0x55fcd2e6f535]
handler/ha_innodb.cc:8970(ha_innobase::delete_row(unsigned char const*))[0x55fcd2cd4556]
sql/handler.cc:6782(handler::ha_delete_row(unsigned char const*))[0x55fcd2a91ea9]
sql/log_event.cc:14434(Delete_rows_log_event::do_exec_row(rpl_group_info*))[0x55fcd2beb291]
sql/log_event.cc:11606(Rows_log_event::do_apply_event(rpl_group_info*))[0x55fcd2be4450]
sql/log_event.h:1482(Log_event::apply_event(rpl_group_info*))[0x55fcd261ecbf]
sql/wsrep_applier.cc:200(wsrep_apply_events(THD*, Relay_log_info*, void const*, unsigned long))[0x55fcd299b301]
sql/wsrep_high_priority_service.cc:496(Wsrep_applier_service::apply_write_set(wsrep::ws_meta const&, wsrep::const_buffer const&, wsrep::mutable_buffer&))[0x55fcd297b03d]
src/server_state.cpp:328(apply_write_set(wsrep::server_state&, wsrep::high_priority_service&, wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::const_buffer const&))[0x55fcd33afc95]
src/server_state.cpp:1111(wsrep::server_state::on_apply(wsrep::high_priority_service&, wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::const_buffer const&))[0x55fcd33b3497]
wsrep/high_priority_service.hpp:48(wsrep::high_priority_service::apply(wsrep::ws_handle const&, wsrep::ws_meta const&, wsrep::const_buffer const&))[0x55fcd33ca613]
src/wsrep_provider_v26.cpp:492((anonymous namespace)::apply_cb(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsrep_trx_meta const*, bool*))[0x55fcd33c794b]
src/trx_handle.cpp:414(galera::TrxHandleSlave::apply(void*, wsrep_cb_status (*)(void*, wsrep_ws_handle const*, unsigned int, wsrep_buf const*, wsrep_trx_meta const*, bool*), wsrep_trx_meta const&, bool&))[0x7fecdbe9b0e1]
src/replicator_smm.cpp:514(galera::ReplicatorSMM::apply_trx(void*, galera::TrxHandleSlave&))[0x7fecdbecdf72]
src/replicator_smm.cpp:2157(galera::ReplicatorSMM::process_trx(void*, boost::shared_ptr<galera::TrxHandleSlave> const&))[0x7fecdbed3b70]
src/gcs_action_source.cpp:63(galera::GcsActionSource::process_writeset(void*, gcs_action const&, bool&))[0x7fecdbeb4cbb]
src/gcs_action_source.cpp:109(galera::GcsActionSource::dispatch(void*, gcs_action const&, bool&))[0x7fecdbeb4e95]
src/gcs_action_source.cpp:188(galera::GcsActionSource::process(void*, bool&))[0x7fecdbeb526d]
src/replicator_smm.cpp:391(galera::ReplicatorSMM::async_recv(void*))[0x7fecdbed4174]
/home/stepan/mariadb/10.4/git/sql/mysqld(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0x30)[0x55fcd33c8836]
src/wsrep_provider_v26.cpp:727(wsrep::wsrep_provider_v26::run_applier(wsrep::high_priority_service*))[0x55fcd299c005]
sql/wsrep_thd.cc:62(wsrep_replication_process(THD*, void*))[0x55fcd298c6ee]
sql/wsrep_mysqld.cc:2688(start_wsrep_THD(void*))[0x55fcd329ace5]
pthread_create.c:0(start_thread)[0x7fece3f52dd5]
/lib64/libc.so.6(clone+0x6d)[0x7fece230fead]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7fecd3b7e423): DELETE FROM testMain WHERE primid=1
Connection ID (thread ID): 2
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file...
Working directory at /home/stepan/mariadb/10.4/git/mysql-test/var/mysqld.2/data
Resource Limits:
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             4096                 23005                processes 
Max open files            1024                 4096                 files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       23005                23005                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
Core pattern: |/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e %P %I %h

Other logs.

It is also reproduced on 10.3, but with additional error in the client:

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

MariaDB Version 10.3.21-MariaDB-debug: Repository: MariaDB/server; branch 10.3; Revision a14544260c33dcdb057d2f62c4aab33cb09ebcb1.

Galera lib 25.3.28(r3879)): Repository: MariaDB/galera; branch mariadb-3.x; Revision fa9f6d0127a060cf12031858bedbd766fc6cdb61.

Client output:

Server version: 10.3.21-MariaDB-debug-log Source distribution
 
Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.
 
Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
 
MariaDB [(none)]> use test;
Database changed
MariaDB [test]> CREATE TABLE testMain (primid INT UNSIGNED NOT NULL AUTO_INCREMENT, PRIMARY KEY (primid)) ENGINE = InnoDB;
Query OK, 0 rows affected (0.031 sec)
 
MariaDB [test]> CREATE TABLE testRef (id INT UNSIGNED NOT NULL AUTO_INCREMENT, pid INT UNSIGNED, bitmap TINYINT UNSIGNED NOT NULL DEFAULT 0, bitmap5 TINYINT UNSIGNED GENERATED ALWAYS AS (bitmap&(1<<5)) VIRTUAL, PRIMARY KEY (id), FOREIGN KEY (pid) REFERENCES testMain (primid) ON DELETE CASCADE ON UPDATE CASCADE);
Query OK, 0 rows affected (0.038 sec)
 
MariaDB [test]> CREATE INDEX bitmap5 ON testRef(bitmap5) USING BTREE;
Query OK, 0 rows affected (0.033 sec)
Records: 0  Duplicates: 0  Warnings: 0
 
MariaDB [test]> INSERT INTO testMain VALUES(0);
Query OK, 1 row affected (0.009 sec)
 
MariaDB [test]>  SELECT * FROM testMain;
+--------+
| primid |
+--------+
|      1 |
+--------+
1 row in set (0.001 sec)
 
MariaDB [test]>  INSERT INTO testRef(pid) VALUES(1);
Query OK, 1 row affected (0.017 sec)
 
MariaDB [test]> DELETE FROM testMain WHERE primid=2;
Query OK, 0 rows affected (0.001 sec)
 
MariaDB [test]>  SELECT * FROM testMain;
+--------+
| primid |
+--------+
|      1 |
+--------+
1 row in set (0.001 sec)
 
MariaDB [test]> DELETE FROM testMain WHERE primid=1;
Query OK, 1 row affected (0.016 sec)
 
MariaDB [test]>  SELECT * FROM testMain;
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
MariaDB [test]> SELECT * from testRef;
ERROR 1047 (08S01): WSREP has not yet prepared node for application use

Other logs.



 Comments   
Comment by Jan Lindström (Inactive) [ 2019-11-27 ]

branch: 10.3
commit: bf58ec77a1adaa653a0b044b950cf420f8c19de9

I can't repeat the crash without Galera, but in my opinion it does not work correctly on InnoDB only as foreign key is defined as ON DELETE CASCASE the row with pid=1 should be deleted when we execute DELETE FROM testMain WHERE primid=1; deleting the only parent for it. This inconsistency could be reason to crash seen on Galera or there could be more than one bug.

CREATE TABLE testMain (primid INT UNSIGNED NOT NULL AUTO_INCREMENT, PRIMARY KEY (primid)) ENGINE = InnoDB;
CREATE TABLE testRef (id INT UNSIGNED NOT NULL AUTO_INCREMENT, pid INT UNSIGNED, bitmap TINYINT UNSIGNED NOT NULL DEFAULT 0, bitmap5 TINYINT UNSIGNED GENERATED ALWAYS AS (bitmap&(1<<5)) VIRTUAL, PRIMARY KEY (id), FOREIGN KEY (pid) REFERENCES testMain (primid) ON DELETE CASCADE ON UPDATE CASCADE);
CREATE INDEX bitmap5 ON testRef(bitmap5) USING BTREE;
INSERT INTO testMain VALUES(0);
SELECT * FROM testMain;
primid
1
SELECT * from testRef;
id	pid	bitmap	bitmap5
INSERT INTO testRef(pid) VALUES(1);
SELECT * FROM testMain;
primid
1
SELECT * from testRef;
id	pid	bitmap	bitmap5
1	1	0	0
DELETE FROM testMain WHERE primid=2;
SELECT * FROM testMain;
primid
1
SELECT * from testRef;
id	pid	bitmap	bitmap5
1	1	0	0
DELETE FROM testMain WHERE primid=1;
SELECT * FROM testMain;
primid
SELECT * from testRef;
id	pid	bitmap	bitmap5
1	1	0	0
DROP TABLE testRef, testMain;

Comment by Marko Mäkelä [ 2020-04-03 ]

jplindst, I just accidentally found out you had assigned this bug to me. Can you please provide a mtr test case that repeats the problem without Galera?

Comment by Andrei Elkin [ 2021-01-01 ]

There was a chance that MDEV-23033 fixes would cover this case but turns out they don't.

The above bug is specific to the legacy replication and is fixed to correct the slave applier
complience with the prelocking protocol. The fixes rely on a base of the support for
-- mysqld=--slave_run_triggers_for_rbr=yes which in its turn follows the master/general server
logics of FK-constraint table discovery/prelocking.
That is slave_run_triggers_for_rbr pattern is a base for the MDEV-23033 case.

However WSREP replication does not support the base:

./mtr galera_sr.galera_sr_multirow_rollback  --mysqld=--slave_run_triggers_for_rbr=yes

ends up with a stack below to most probably indicate the prelocking does not work correctly;
the slack from the vanilla origin/10.4:

Thread 1 (Thread 0x7facb4158700 (LWP 2529)):
#0  __pthread_kill (threadid=<optimized out>, signo=11) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
#1  0x000055ec819c5b13 in my_write_core (sig=11) at /home/sujatha/bug_repo/test-10.4/mysys/stacktrace.c:386
#2  0x000055ec81007bcd in handle_fatal_signal (sig=11) at /home/sujatha/bug_repo/test-10.4/sql/signal_handler.cc:343
#3  <signal handler called>
#4  0x000055ec80c068da in TABLE_LIST::placeholder (this=0x815) at /home/sujatha/bug_repo/test-10.4/sql/table.h:2623
#5  0x000055ec80bfb9eb in mark_real_tables_as_free_for_reuse (table_list=0x7fac9403bd60) at /home/sujatha/bug_repo/test-10.4/sql/sql_base.cc:5428
#6  0x000055ec80bfc1c5 in lock_tables (thd=0x7fac94000ce8, tables=0x7facb4155d50, count=1, flags=2059) at /home/sujatha/bug_repo/test-10.4/sql/sql_base.cc:5593
#7  0x000055ec80bfb576 in open_and_lock_tables (thd=0x7fac94000ce8, options=..., tables=0x7facb4155d50, derived=false, flags=2059, prelocking_strategy=0x7facb4155cc8) at /home/sujatha/bug_repo/test-10.4/sql/sql_base.cc:5290
#8  0x000055ec80befcc9 in open_and_lock_tables (thd=0x7fac94000ce8, tables=0x7facb4155d50, derived=false, flags=2059, prelocking_strategy=0x7facb4155cc8) at /home/sujatha/bug_repo/test-10.4/sql/sql_base.h:271
#9  0x000055ec80bfad82 in open_n_lock_single_table (thd=0x7fac94000ce8, table_l=0x7facb4155d50, lock_type=TL_WRITE, flags=2059, prelocking_strategy=0x7facb4155cc8) at /home/sujatha/bug_repo/test-10.4/sql/sql_base.cc:5124
#10 0x000055ec80da2ab0 in open_n_lock_single_table (thd=0x7fac94000ce8, table_l=0x7facb4155d50, lock_type=TL_WRITE, flags=2059) at /home/sujatha/bug_repo/test-10.4/sql/sql_base.h:483
#11 0x000055ec80f26ece in Wsrep_schema_impl::open_table (thd=0x7fac94000ce8, schema_name=0x7facb4156470, table_name=0x7facb4156480, lock_type=TL_WRITE, table=0x7facb41564e8) at /home/sujatha/bug_repo/test-10.4/sql/wsrep_schema.cc:234
#12 0x000055ec80f27097 in Wsrep_schema_impl::open_for_write (thd=0x7fac94000ce8, table_name=0x55ec832af5a0 "wsrep_streaming_log", table=0x7facb41564e8) at /home/sujatha/bug_repo/test-10.4/sql/wsrep_schema.cc:255
#13 0x000055ec80f294e7 in Wsrep_schema::append_fragment (this=0x55ec83883450, thd=0x7fac94000ce8, server_id=..., transaction_id=..., seqno=..., flags=1, data=...) at /home/sujatha/bug_repo/test-10.4/sql/wsrep_schema.cc:906
#14 0x000055ec80f03396 in Wsrep_high_priority_service::append_fragment_and_commit (this=0x7facb4157dc0, ws_handle=..., ws_meta=..., data=..., xid=...) at /home/sujatha/bug_repo/test-10.4/sql/wsrep_high_priority_service.cc:245
#15 0x000055ec81a46267 in apply_fragment (server_state=..., high_priority_service=..., streaming_applier=0x7fac9401f330, ws_handle=..., ws_meta=..., data=...) at /home/sujatha/bug_repo/test-10.4/wsrep-lib/src/server_state.cpp:134
#16 0x000055ec81a470bd in apply_write_set (server_state=..., high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/sujatha/bug_repo/test-10.4/wsrep-lib/src/server_state.cpp:355
#17 0x000055ec81a4ab62 in wsrep::server_state::on_apply (this=0x55ec833d46b0, high_priority_service=..., ws_handle=..., ws_meta=..., data=...) at /home/sujatha/bug_repo/test-10.4/wsrep-lib/src/server_state.cpp:1137
#18 0x000055ec81a642f5 in wsrep::high_priority_service::apply (this=0x7facb4157dc0, ws_handle=..., ws_meta=..., data=...) at /home/sujatha/bug_repo/test-10.4/wsrep-lib/include/wsrep/high_priority_service.hpp:47
#19 0x000055ec81a61562 in (anonymous namespace)::apply_cb (ctx=0x7facb4157dc0, wsh=0x7facb4156df0, flags=64, buf=0x7facb4156e00, meta=0x7facb41570f0, exit_loop=0x7facb41570af) at /home/sujatha/bug_repo/test-10.4/wsrep-lib/src/wsrep_provider_v26.cpp:502
#20 0x00007facb76106ae in galera::TrxHandleSlave::apply (this=this@entry=0x7fac940258d0, recv_ctx=recv_ctx@entry=0x7facb4157dc0, apply_cb=0x55ec81a61333 <(anonymous namespace)::apply_cb(void*, wsrep_ws_handle_t const*, uint32_t, wsrep_buf_t const*, wsrep_trx_meta_t const*, wsrep_bool_t*)>, meta=..., exit_loop=exit_loop@entry=@0x7facb41570af: false) at galera/src/trx_handle.cpp:391
#21 0x00007facb7662388 in galera::ReplicatorSMM::apply_trx (this=this@entry=0x55ec83418d70, recv_ctx=recv_ctx@entry=0x7facb4157dc0, ts=...) at galera/src/replicator_smm.cpp:504
#22 0x00007facb7669f25 in galera::ReplicatorSMM::process_trx (this=0x55ec83418d70, recv_ctx=0x7facb4157dc0, ts_ptr=...) at galera/src/replicator_smm.cpp:2127
#23 0x00007facb7638dc8 in galera::GcsActionSource::process_writeset (this=0x55ec8344abe0, recv_ctx=0x7facb4157dc0, act=..., exit_loop=@0x7facb41579af: false) at galera/src/gcs_action_source.cpp:62
#24 0x00007facb7639c22 in galera::GcsActionSource::dispatch (this=this@entry=0x55ec8344abe0, recv_ctx=recv_ctx@entry=0x7facb4157dc0, act=..., exit_loop=@0x7facb41579af: false) at galera/src/gcs_action_source.cpp:110
#25 0x00007facb7639ef2 in galera::GcsActionSource::process (this=0x55ec8344abe0, recv_ctx=0x7facb4157dc0, exit_loop=@0x7facb41579af: false) at galera/src/gcs_action_source.cpp:183
#26 0x00007facb7664b20 in galera::ReplicatorSMM::async_recv (this=0x55ec83418d70, recv_ctx=0x7facb4157dc0) at galera/src/replicator_smm.cpp:390
#27 0x00007facb768291b in galera_recv (gh=<optimized out>, recv_ctx=<optimized out>) at galera/src/wsrep_provider.cpp:236
#28 0x000055ec81a62568 in wsrep::wsrep_provider_v26::run_applier (this=0x55ec832b4bd0, applier_ctx=0x7facb4157dc0) at /home/sujatha/bug_repo/test-10.4/wsrep-lib/src/wsrep_provider_v26.cpp:740
#29 0x000055ec80f23183 in wsrep_replication_process (thd=0x7fac94000ce8, arg=0x55ec83453b60) at /home/sujatha/bug_repo/test-10.4/sql/wsrep_thd.cc:58
#30 0x000055ec80f14c1a in start_wsrep_THD (arg=0x55ec83453b60) at /home/sujatha/bug_repo/test-10.4/sql/wsrep_mysqld.cc:2850
#31 0x000055ec81883b9e in pfs_spawn_thread (arg=0x55ec8344c9c8) at /home/sujatha/bug_repo/test-10.4/storage/perfschema/pfs.cc:1869
#32 0x00007facbd2a06db in start_thread (arg=0x7facb4158700) at pthread_create.c:463
#33 0x00007facbc43b71f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
 - saving '/home/sujatha/bug_repo/test-10.4/bld/mysql-test/var/log/galera_sr.galera_sr_multirow_rollback-binlogon,innodb/' to '/home/sujatha/bug_repo/test-10.4/bld/mysql-test/var/log/galera_sr.galera_sr_multirow_rollback-binlogon,innodb/'

When the stack is sorted out, the FK-related prelocking of MDEV-23033 fixes can be opened up when WSREP_ON.

Comment by Jan Lindström (Inactive) [ 2021-01-13 ]

seppo Can you have a look. In my understanding first problem is that when executing the second DELETE on master foreign key is not added to write set (see jan.test) . Now I tried test case on MDEV-23033 and it also crashes (see jan2.test). However, if I use -mysqld=-slave_run_triggers_for_rbr=yes same test case does not crash. While these computed values are in 10.3 (not sure if they are reason), I think 10.2 should be also checked.

Comment by Seppo Jaakola [ 2021-01-13 ]

Reproduced this replica node crash with 10.3 version. As Andrei pointed out, the issue seems to be same as with MDEV-23033, however, my version has the fix for MDEV-23033, which suggest that the fix it is either not effective with Galera replication, or there is another problem.

Debugger shows that, when the replica node is applying cascaded delete in testRef table, row_upd_del_mark_clust_rec() execution skips setting mysql table reference for row_upd_store_row() call :

	if (!row_upd_store_row(node, trx->mysql_thd,
			  thr->prebuilt  && thr->prebuilt->table == node->table
			  ? thr->prebuilt->m_mysql_table : NULL)) {
		err = DB_COMPUTE_VALUE_FAILED;
		return err;
	}
(gdb) p thr->prebuilt->table->name
$30 = {m_name = 0x55ad69199f90 "test/testMain", static part_suffix = "#P#"}
(gdb) p node->table->name
$31 = {m_name = 0x7f4684027830 "test/testRef", static part_suffix = "#P#"}

This leads to assert 3 stack levels lower in innobase_get_computed_value():

    row_upd_store_row(,...TABLE*		mysql_table)
         row_upd_store_v_row(,...TABLE* mysql_table)
              innobase_get_computed_value(...TABLE*	mysql_table,...)
                  ut_ad(mysql_table);

Comment by Seppo Jaakola [ 2021-01-14 ]

Further debugging shows that slave SQL applying happens a bit differently between async and galera replication, with regard to table prelocking.
Commit 608b0ee52ef3e854ce14a407e64e936adbbeba23 has this patch in log_event.cc:

 
    Rows_log_event::do_apply_event()
...
    else if (!WSREP_ON)
      {
        tables->slave_fk_event_map= new_trg_event_map;
        lex->query_tables_last= &tables->next_global;
      }

slave_fk_event_map is set only for asyn replication, and with this galera replication slave will skip extending table list in open_tables(), which eventually leads to the assert.

Comment by Jan Lindström (Inactive) [ 2021-01-14 ]

sujatha.sivakumar Why slave_fk_event_map is set only for async replication ?

Comment by Sujatha Sivakumar (Inactive) [ 2021-01-14 ]

Hello jplindst

At present the 'slave_fk_event_map' is set only for async replication as there is an issue in galera code.
To reproduce the galera issue, please try as shown below.

./mtr galera_sr.galera_sr_multirow_rollback  --mysqld=--slave_run_triggers_for_rbr=yes

Stack trance is provided by Elkin as part the below mentioned comment
https://jira.mariadb.org/browse/MDEV-21153?focusedCommentId=176249&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-176249

Once the galera issue is fixed, this 'if (!WSREP_ON)' condition can be removed and MDEV-23033 fix should
work for both async and galera replication.

Comment by Jan Lindström (Inactive) [ 2021-01-14 ]

Streaming replication is supported from 10.4 (there could be issue around this also), but we are here talking about 10.3.

Comment by Seppo Jaakola [ 2021-01-14 ]

I will submit a PR to fix 10.3, by simply removing the (!WSREP_ON) condition, so that also galera replication will be in the domain of the fix in MDEV-23033

I will submit also separate PR for 10.4, which removes the same WSREP_ON condition and also contains fix for streaming replication, which used corrupt data in THD::LEX:query_table_list, in wsrep system table access phase and which resulted in table prelocking when it shouldn't. The 10.4 PR will take a little longer due to our internal review.

Comment by Seppo Jaakola [ 2021-01-18 ]

submitted two PR's, for 10.3 and 10.4

Generated at Thu Feb 08 09:04:59 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.