[MDEV-25551] Assertion `mode_ == m_local || transaction_.is_streaming()` failed in BF abort Created: 2021-04-28  Updated: 2023-10-16  Resolved: 2021-05-26

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.4.17, 10.4.19
Fix Version/s: 10.6.2, 10.4.20, 10.5.11

Type: Bug Priority: Major
Reporter: Gabor Orosz Assignee: Jan Lindström (Inactive)
Resolution: Fixed Votes: 3
Labels: crash, galera, innodb, replication
Environment:

SUSE Linux Enterprise 15 Service Pack 2


Issue Links:
Relates
relates to MDEV-21181 Automatic invisible primary key Open
relates to MDEV-24001 Implement hidden PK for RBR of no-uni... Open
relates to MDEV-26298 Galera cluster slave abort with Asser... Closed

 Description   

We are experiencing frequent crashes on instances that are trying to replicate a particular sequence of statements:

2021-04-26T15:11:01.993175+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] InnoDB: *** Priority TRANSACTION:
2021-04-26T15:11:01.993335+00:00 db2.mydomain.example mysqld[21482]: TRANSACTION 6470784, ACTIVE 0 sec starting index read
2021-04-26T15:11:01.993459+00:00 db2.mydomain.example mysqld[21482]: mysql tables in use 1, locked 1
2021-04-26T15:11:01.993579+00:00 db2.mydomain.example mysqld[21482]: 1 lock struct(s), heap size 1128, 0 row lock(s)
2021-04-26T15:11:01.993702+00:00 db2.mydomain.example mysqld[21482]: MySQL thread id 12, OS thread handle 140241487787776, query id 7 Update_rows_log_event::find_row(2294786)
2021-04-26T15:11:01.993819+00:00 db2.mydomain.example mysqld[21482]: BINLOG '
2021-04-26T15:11:01.993941+00:00 db2.mydomain.example mysqld[21482]: S8WCYBMBAAAAPwAAAClBBAAAADUAAAAAAAEAB25ldXRyb24AD2FsZW1iaWNfdmVyc2lvbgABDwJg
2021-04-26T15:11:01.994057+00:00 db2.mydomain.example mysqld[21482]: AAC5oiS8
2021-04-26T15:11:01.994171+00:00 db2.mydomain.example mysqld[21482]: S8WCYBgBAAAAPAAAAGVBBAAAADUAAAAAAAEAAf///gtjNmMxMTI5OTJjOf4LNWZmY2VlYmZhZGHQ
2021-04-26T15:11:01.994285+00:00 db2.mydomain.example mysqld[21482]: CaZR
2021-04-26T15:11:01.994404+00:00 db2.mydomain.example mysqld[21482]: '
2021-04-26T15:11:01.994518+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] InnoDB: *** Priority TRANSACTION:
2021-04-26T15:11:01.994631+00:00 db2.mydomain.example mysqld[21482]: TRANSACTION 6470782, ACTIVE 0 sec
2021-04-26T15:11:01.994761+00:00 db2.mydomain.example mysqld[21482]: 2 lock struct(s), heap size 1128, 1 row lock(s), undo log entries 1
2021-04-26T15:11:01.994891+00:00 db2.mydomain.example mysqld[21482]: MySQL thread id 2, OS thread handle 140241761765120, query id 5 committing
2021-04-26T15:11:01.995017+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] InnoDB: *** WAITING FOR THIS LOCK TO BE GRANTED:
2021-04-26T15:11:01.995132+00:00 db2.mydomain.example mysqld[21482]: RECORD LOCKS space id 13788 page no 3 n bits 72 index GEN_CLUST_INDEX of table `neutron`.`alembic_version` trx id 6470782 lock_mode X locks rec but not gap
2021-04-26T15:11:01.995259+00:00 db2.mydomain.example mysqld[21482]: Record lock, heap no 2
2021-04-26T15:11:01.995386+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] InnoDB:  SQL1: BINLOG '
2021-04-26T15:11:01.995504+00:00 db2.mydomain.example mysqld[21482]: S8WCYBMBAAAAPwAAAClBBAAAADUAAAAAAAEAB25ldXRyb24AD2FsZW1iaWNfdmVyc2lvbgABDwJg
2021-04-26T15:11:01.995623+00:00 db2.mydomain.example mysqld[21482]: AAC5oiS8
2021-04-26T15:11:01.995744+00:00 db2.mydomain.example mysqld[21482]: S8WCYBgBAAAAPAAAAGVBBAAAADUAAAAAAAEAAf///gtjNmMxMTI5OTJjOf4LNWZmY2VlYmZhZGHQ
2021-04-26T15:11:01.995857+00:00 db2.mydomain.example mysqld[21482]: CaZR
2021-04-26T15:11:01.995977+00:00 db2.mydomain.example mysqld[21482]: 'K305202`^S^A
2021-04-26T15:11:01.996115+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] InnoDB:  SQL2: NULL
2021-04-26T15:11:01.996231+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] WSREP: cluster conflict due to high priority abort for threads:
2021-04-26T15:11:01.996351+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] WSREP: Winning thread:
2021-04-26T15:11:01.996471+00:00 db2.mydomain.example mysqld[21482]:    THD: 12, mode: high priority, state: exec, conflict: executing, seqno: 2294786
2021-04-26T15:11:01.996584+00:00 db2.mydomain.example mysqld[21482]:    SQL: BINLOG '
2021-04-26T15:11:01.996700+00:00 db2.mydomain.example mysqld[21482]: S8WCYBMBAAAAPwAAAClBBAAAADUAAAAAAAEAB25ldXRyb24AD2FsZW1iaWNfdmVyc2lvbgABDwJg
2021-04-26T15:11:01.996813+00:00 db2.mydomain.example mysqld[21482]: AAC5oiS8
2021-04-26T15:11:01.996932+00:00 db2.mydomain.example mysqld[21482]: S8WCYBgBAAAAPAAAAGVBBAAAADUAAAAAAAEAAf///gtjNmMxMTI5OTJjOf4LNWZmY2VlYmZhZGHQ
2021-04-26T15:11:01.997046+00:00 db2.mydomain.example mysqld[21482]: CaZR
2021-04-26T15:11:01.997166+00:00 db2.mydomain.example mysqld[21482]: 'K305202`^S^A
2021-04-26T15:11:01.997284+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] WSREP: Victim thread:
2021-04-26T15:11:01.997399+00:00 db2.mydomain.example mysqld[21482]:    THD: 2, mode: high priority, state: exec, conflict: committing, seqno: 2294788
2021-04-26T15:11:01.997514+00:00 db2.mydomain.example mysqld[21482]:    SQL: NULL
2021-04-26T15:11:01.997632+00:00 db2.mydomain.example mysqld[21482]: 2021-04-26 15:11:01 12 [Note] WSREP: context: /home/abuild/rpmbuild/BUILD/mariadb-10.4.17/storage/innobase/handler/ha_innodb.cc:18632
2021-04-26T15:11:01.997749+00:00 db2.mydomain.example mysqld[21482]: mysqld: /home/abuild/rpmbuild/BUILD/mariadb-10.4.17/wsrep-lib/include/wsrep/client_state.hpp:603: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
2021-04-26T15:11:01.997864+00:00 db2.mydomain.example mysqld[21482]: 210426 15:11:01 [ERROR] mysqld got signal 6 ;
2021-04-26T15:11:01.997982+00:00 db2.mydomain.example mysqld[21482]: This could be because you hit a bug. It is also possible that this binary
2021-04-26T15:11:01.998098+00:00 db2.mydomain.example mysqld[21482]: or one of the libraries it was linked against is corrupt, improperly built,
2021-04-26T15:11:01.998212+00:00 db2.mydomain.example mysqld[21482]: or misconfigured. This error can also be caused by malfunctioning hardware.
2021-04-26T15:11:01.998327+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:01.998441+00:00 db2.mydomain.example mysqld[21482]: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
2021-04-26T15:11:01.998555+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:01.998669+00:00 db2.mydomain.example mysqld[21482]: We will try our best to scrape up some info that will hopefully help
2021-04-26T15:11:01.998788+00:00 db2.mydomain.example mysqld[21482]: diagnose the problem, but since we have already crashed,
2021-04-26T15:11:01.998903+00:00 db2.mydomain.example mysqld[21482]: something is definitely wrong and this may fail.
2021-04-26T15:11:01.999016+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:01.999130+00:00 db2.mydomain.example mysqld[21482]: Server version: 10.4.17-MariaDB
2021-04-26T15:11:01.999243+00:00 db2.mydomain.example mysqld[21482]: key_buffer_size=67108864
2021-04-26T15:11:01.999357+00:00 db2.mydomain.example mysqld[21482]: read_buffer_size=131072
2021-04-26T15:11:01.999470+00:00 db2.mydomain.example mysqld[21482]: max_used_connections=0
2021-04-26T15:11:01.999590+00:00 db2.mydomain.example mysqld[21482]: max_threads=10002
2021-04-26T15:11:01.999705+00:00 db2.mydomain.example mysqld[21482]: thread_count=11
2021-04-26T15:11:01.999819+00:00 db2.mydomain.example mysqld[21482]: It is possible that mysqld could use up to
2021-04-26T15:11:01.999932+00:00 db2.mydomain.example mysqld[21482]: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 22077178 K  bytes of memory
2021-04-26T15:11:02.000064+00:00 db2.mydomain.example mysqld[21482]: Hope that's ok; if not, decrease some variables in the equation.
2021-04-26T15:11:02.000181+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:02.000294+00:00 db2.mydomain.example mysqld[21482]: Thread pointer: 0x7f89e4000bf8
2021-04-26T15:11:02.000409+00:00 db2.mydomain.example mysqld[21482]: Attempting backtrace. You can use the following information to find out
2021-04-26T15:11:02.000548+00:00 db2.mydomain.example mysqld[21482]: where mysqld died. If you see no messages after this, something went
2021-04-26T15:11:02.000664+00:00 db2.mydomain.example mysqld[21482]: terribly wrong...
2021-04-26T15:11:02.000779+00:00 db2.mydomain.example mysqld[21482]: stack_bottom = 0x7f8c840fed08 thread_stack 0x49000
2021-04-26T15:11:02.014312+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(my_print_stacktrace+0x3c)[0x55fcbd6027cc]
2021-04-26T15:11:02.014890+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(handle_fatal_signal+0x535)[0x55fcbd119f05]
2021-04-26T15:11:02.028285+00:00 db2.mydomain.example mysqld[21482]: /lib64/libpthread.so.0(+0x132d0)[0x7f8c99aab2d0]
2021-04-26T15:11:02.041285+00:00 db2.mydomain.example mysqld[21482]: /lib64/libc.so.6(gsignal+0x110)[0x7f8c981dc520]
2021-04-26T15:11:02.041518+00:00 db2.mydomain.example mysqld[21482]: /lib64/libc.so.6(abort+0x151)[0x7f8c981ddb01]
2021-04-26T15:11:02.041640+00:00 db2.mydomain.example mysqld[21482]: /lib64/libc.so.6(+0x31b1a)[0x7f8c981d4b1a]
2021-04-26T15:11:02.041765+00:00 db2.mydomain.example mysqld[21482]: /lib64/libc.so.6(+0x31b92)[0x7f8c981d4b92]
2021-04-26T15:11:02.055417+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_Z14wsrep_bf_abortPK3THDPS_+0x4a7)[0x55fcbd090a27]
2021-04-26T15:11:02.055927+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(wsrep_thd_bf_abort+0x2c)[0x55fcbd097efc]
2021-04-26T15:11:02.056560+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xa1c4d8)[0x55fcbd2ad4d8]
2021-04-26T15:11:02.057158+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xa5bf28)[0x55fcbd2ecf28]
2021-04-26T15:11:02.057757+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xa5fe22)[0x55fcbd2f0e22]
2021-04-26T15:11:02.058374+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xa659f1)[0x55fcbd2f69f1]
2021-04-26T15:11:02.059071+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xaf071f)[0x55fcbd38171f]
2021-04-26T15:11:02.059626+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xaf5e8d)[0x55fcbd386e8d]
2021-04-26T15:11:02.060132+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xa25033)[0x55fcbd2b6033]
2021-04-26T15:11:02.060786+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xa253cb)[0x55fcbd2b63cb]
2021-04-26T15:11:02.061444+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_ZN7handler11ha_rnd_nextEPh+0x13f)[0x55fcbd11e9bf]
2021-04-26T15:11:02.062070+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_ZN14Rows_log_event8find_rowEP14rpl_group_info+0x4be)[0x55fcbd21152e]
2021-04-26T15:11:02.062778+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0xd7)[0x55fcbd211ab7]
2021-04-26T15:11:02.063406+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x23f)[0x55fcbd204f8f]
2021-04-26T15:11:02.063950+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_Z18wsrep_apply_eventsP3THDP14Relay_log_infoPKvm+0x1d4)[0x55fcbd08f544]
2021-04-26T15:11:02.064562+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_ZN21Wsrep_applier_service15apply_write_setERKN5wsrep7ws_metaERKNS0_12const_bufferERNS0_14mutable_bufferE+0x9a)[0x55fcbd07846a]
2021-04-26T15:11:02.065161+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xdde375)[0x55fcbd66f375]
2021-04-26T15:11:02.065811+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_ZN5wsrep12server_state8on_applyERNS_21high_priority_serviceERKNS_9ws_handleERKNS_7ws_metaERKNS_12const_bufferE+0x195)[0x55fcbd66fde5]
2021-04-26T15:11:02.066320+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0xdf4749)[0x55fcbd685749]
2021-04-26T15:11:02.078445+00:00 db2.mydomain.example mysqld[21482]: /usr/lib64/galera-4/libgalera_smm.so(+0x19f93e)[0x7f8c9693893e]
2021-04-26T15:11:02.078649+00:00 db2.mydomain.example mysqld[21482]: /usr/lib64/galera-4/libgalera_smm.so(+0x1f0f48)[0x7f8c96989f48]
2021-04-26T15:11:02.078796+00:00 db2.mydomain.example mysqld[21482]: /usr/lib64/galera-4/libgalera_smm.so(+0x2048f2)[0x7f8c9699d8f2]
2021-04-26T15:11:02.078909+00:00 db2.mydomain.example mysqld[21482]: /usr/lib64/galera-4/libgalera_smm.so(+0x207347)[0x7f8c969a0347]
2021-04-26T15:11:02.079024+00:00 db2.mydomain.example mysqld[21482]: /usr/lib64/galera-4/libgalera_smm.so(+0x1f3a23)[0x7f8c9698ca23]
2021-04-26T15:11:02.079150+00:00 db2.mydomain.example mysqld[21482]: /usr/lib64/galera-4/libgalera_smm.so(+0x2119bb)[0x7f8c969aa9bb]
2021-04-26T15:11:02.091078+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0xe)[0x55fcbd688b2e]
2021-04-26T15:11:02.091596+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(+0x8004db)[0x55fcbd0914db]
2021-04-26T15:11:02.092123+00:00 db2.mydomain.example mysqld[21482]: /usr/sbin/mysqld(_Z15start_wsrep_THDPv+0x439)[0x55fcbd0821a9]
2021-04-26T15:11:02.103934+00:00 db2.mydomain.example mysqld[21482]: /lib64/libpthread.so.0(+0x84f9)[0x7f8c99aa04f9]
2021-04-26T15:11:02.115539+00:00 db2.mydomain.example mysqld[21482]: /lib64/libc.so.6(clone+0x3f)[0x7f8c9829efbf]
2021-04-26T15:11:02.115757+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:02.115906+00:00 db2.mydomain.example mysqld[21482]: Trying to get some variables.
2021-04-26T15:11:02.116042+00:00 db2.mydomain.example mysqld[21482]: Some pointers may be invalid and cause the dump to abort.
2021-04-26T15:11:02.116164+00:00 db2.mydomain.example mysqld[21482]: Query (0x7f8c8945f273): BINLOG '
2021-04-26T15:11:02.116289+00:00 db2.mydomain.example mysqld[21482]: S8WCYBMBAAAAPwAAAClBBAAAADUAAAAAAAEAB25ldXRyb24AD2FsZW1iaWNfdmVyc2lvbgABDwJg
2021-04-26T15:11:02.116440+00:00 db2.mydomain.example mysqld[21482]: AAC5oiS8
2021-04-26T15:11:02.116549+00:00 db2.mydomain.example mysqld[21482]: S8WCYBgBAAAAPAAAAGVBBAAAADUAAAAAAAEAAf///gtjNmMxMTI5OTJjOf4LNWZmY2VlYmZhZGHQ
2021-04-26T15:11:02.116666+00:00 db2.mydomain.example mysqld[21482]: CaZR
2021-04-26T15:11:02.116773+00:00 db2.mydomain.example mysqld[21482]: '
2021-04-26T15:11:02.116906+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:02.117014+00:00 db2.mydomain.example mysqld[21482]: Connection ID (thread ID): 12
2021-04-26T15:11:02.117121+00:00 db2.mydomain.example mysqld[21482]: Status: NOT_KILLED
2021-04-26T15:11:02.117235+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:02.117352+00:00 db2.mydomain.example mysqld[21482]: Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=o>
2021-04-26T15:11:02.117466+00:00 db2.mydomain.example mysqld[21482]:
2021-04-26T15:11:02.117586+00:00 db2.mydomain.example mysqld[21482]: The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
2021-04-26T15:11:02.117706+00:00 db2.mydomain.example mysqld[21482]: information that should help you find out what is causing the crash.
2021-04-26T15:11:02.117818+00:00 db2.mydomain.example mysqld[21482]: Writing a core file...
2021-04-26T15:11:02.117926+00:00 db2.mydomain.example mysqld[21482]: Working directory at /var/lib/mysql
2021-04-26T15:11:02.118041+00:00 db2.mydomain.example mysqld[21482]: Resource Limits:
2021-04-26T15:11:02.118153+00:00 db2.mydomain.example mysqld[21482]: Limit                     Soft Limit           Hard Limit           Units
2021-04-26T15:11:02.118262+00:00 db2.mydomain.example mysqld[21482]: Max cpu time              unlimited            unlimited            seconds
2021-04-26T15:11:02.118370+00:00 db2.mydomain.example mysqld[21482]: Max file size             unlimited            unlimited            bytes
2021-04-26T15:11:02.118491+00:00 db2.mydomain.example mysqld[21482]: Max data size             unlimited            unlimited            bytes
2021-04-26T15:11:02.118619+00:00 db2.mydomain.example mysqld[21482]: Max stack size            8388608              unlimited            bytes
2021-04-26T15:11:02.118754+00:00 db2.mydomain.example mysqld[21482]: Max core file size        unlimited            unlimited            bytes
2021-04-26T15:11:02.118872+00:00 db2.mydomain.example mysqld[21482]: Max resident set          unlimited            unlimited            bytes
2021-04-26T15:11:02.118987+00:00 db2.mydomain.example mysqld[21482]: Max processes             unlimited            unlimited            processes
2021-04-26T15:11:02.119100+00:00 db2.mydomain.example mysqld[21482]: Max open files            1048576              1048576              files
2021-04-26T15:11:02.119234+00:00 db2.mydomain.example mysqld[21482]: Max locked memory         65536                65536                bytes
2021-04-26T15:11:02.119357+00:00 db2.mydomain.example mysqld[21482]: Max address space         unlimited            unlimited            bytes
2021-04-26T15:11:02.119479+00:00 db2.mydomain.example mysqld[21482]: Max file locks            unlimited            unlimited            locks
2021-04-26T15:11:02.119600+00:00 db2.mydomain.example mysqld[21482]: Max pending signals       125259               125259               signals
2021-04-26T15:11:02.119794+00:00 db2.mydomain.example mysqld[21482]: Max msgqueue size         819200               819200               bytes
2021-04-26T15:11:02.119917+00:00 db2.mydomain.example mysqld[21482]: Max nice priority         0                    0
2021-04-26T15:11:02.120061+00:00 db2.mydomain.example mysqld[21482]: Max realtime priority     0                    0
2021-04-26T15:11:02.120329+00:00 db2.mydomain.example mysqld[21482]: Max realtime timeout      unlimited            unlimited            us
2021-04-26T15:11:02.120448+00:00 db2.mydomain.example mysqld[21482]: Core pattern: |/usr/sbin/process-dump coredump %h core.%h.%t.%e.%p.gz

We are using 10.4.17 with MDEV-23851 MDEV-24229 BF-BF conflict issues, and MDEV-23851 BF-BF Conflict issue because of UK GAP locks patches applied.

We are aware of MDEV-24923 and created a build from 10.4 (https://github.com/MariaDB/server/commit/ee455e6f2e6c9cda921c0801210786123d9f8b95) but this initial crash is still reproducible. However, the subsequent crashes during the recovery phase, after restarting the failed instance seems to be fixed.
I managed to isolate and trigger the fault with the following simple SQL statements:

CREATE TABLE t1 (f1 VARCHAR(32) NOT NULL) ENGINE=InnoDB;
INSERT INTO t1 (f1) VALUES ('0e66c5227a8a');
INSERT INTO t1 (f1) VALUES ('c6c112992c9');
 
START TRANSACTION;
UPDATE t1 SET f1='5ffceebfada' WHERE t1.f1 = 'c6c112992c9';
COMMIT;
 
START TRANSACTION;
DELETE FROM t1 WHERE t1.f1 = '5ffceebfada';
COMMIT;
 
DROP TABLE t1;

Sometimes it needs couple of thousand executions to observe the fault.

My theory about the fault is that in some cases it is possible that an applier thread start processing the transaction with the DELETE statement before another applier managed to execute the UPDATE one.
When this latter thread reaches the actual UPDATE statement, it tries to aquire a lock that is being held by the transaction executed in the former thread which is at COMMIT stage and that causes a lock conflict between these two priority transactions. The lock holder transaction got aborted (I guess due to the higher sequence number) but instead of doing an expected rollback, wait and replay attempt it initiates a controlled crash because the assertion criteria hasn't met.

Right now, I'm trying to get familiar with mtr to implement this aforementioned scenario in a test case to have a quick and stable reproducer.



 Comments   
Comment by Gabor Orosz [ 2021-04-30 ]

Hi,

After doing some close inspection on the core dump and experimenting with mtr, I managed to come up with a reproducer test case:

--source include/galera_cluster.inc
--source include/have_innodb.inc
--source include/have_debug_sync.inc
--source include/galera_have_debug_sync.inc
 
# Setup
CREATE TABLE t1 (f1 VARCHAR(32) NOT NULL) ENGINE=InnoDB;
INSERT INTO t1 (f1) VALUES ('0e66c5227a8a');
INSERT INTO t1 (f1) VALUES ('c6c112992c9');
 
--connection node_2
--let $wait_condition = SELECT COUNT(*)=2 FROM t1;
--source include/wait_condition.inc
 
# Ensure that we have enough applier threads to process transactions in parallel
SET GLOBAL wsrep_slave_threads = 2;
 
# Set up a synchronization point to catch the first transaction
--let $galera_sync_point = apply_monitor_slave_enter_sync
--source include/galera_set_sync_point.inc
 
--connection node_1
# Invoke the first transaction
START TRANSACTION;
UPDATE t1 SET f1='5ffceebfada' WHERE t1.f1 = 'c6c112992c9';
COMMIT;
 
--connection node_2
# Wait for the first transaction to appear
SET SESSION wsrep_sync_wait = 0;
--let $galera_sync_point = apply_monitor_slave_enter_sync
--source include/galera_wait_sync_point.inc
 
# Clear the previous syncronizaton point and setup a different one for the
# second transaction
--source include/galera_clear_sync_point.inc
SET GLOBAL DEBUG_DBUG = "d,sync.wsrep_apply_cb";
 
--connection node_1
# Invoke the second transaction
START TRANSACTION;
UPDATE t1 SET f1='4ffceebfcdc' WHERE t1.f1 = '0e66c5227a8a';
COMMIT;
 
--connection node_2
# Wait for the second transaction to appear
SET SESSION DEBUG_SYNC = "now WAIT_FOR sync.wsrep_apply_cb_reached";
 
# Remove syncronization point
SET GLOBAL DEBUG_DBUG = "";
 
# Setup an additional syncronization point for the second transaction
# in commit stage
--let $galera_sync_point = commit_monitor_slave_enter_sync
--source include/galera_set_sync_point.inc
 
# Let the second transaction to proceed
SET DEBUG_SYNC = "now SIGNAL signal.wsrep_apply_cb";
 
# Wait for the second transaction to obtain lock
--let $galera_sync_point = apply_monitor_slave_enter_sync commit_monitor_slave_enter_sync
--source include/galera_wait_sync_point.inc
 
# Now, the second transaction is ahead of the first one
 
# Clear the syncronization point
--source include/galera_clear_sync_point.inc
 
# Let the first transaction to proceed
--let $galera_sync_point = apply_monitor_slave_enter_sync
--source include/galera_signal_sync_point.inc
 
# Crash occurs at this point
# int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
 
# Teardown [incomplete...]
--connection node_1
SET GLOBAL wsrep_slave_threads = DEFAULT;
 
DROP TABLE t1;

I'll try to fiddle with the code, but I would really appreciate if someone could take a look on this one.

Best regards,
GOro

Comment by Gabor Orosz [ 2021-04-30 ]

Hi,

Executed the above mentioned test case on the 10.4 branch head (0cc811c633d1fe5290b10fa49fad0a4b889383fa) and it crashes at a different point:

mysqld: /var/lib/src/mariadb/sql/service_wsrep.cc:271: my_bool wsrep_thd_is_aborting(const THD*): Assertion `(&(&thd->LOCK_thd_data)->m_mutex)->count > 0 && pthread_equal(pthread_self(), (&(&thd->LOCK_thd_data)->m_mutex)->thread)' failed.

Best regards,
GOro

Comment by Gabor Orosz [ 2021-05-03 ]

@jplindst, sorry for the mention. I saw that you did quite a lot of changes on this area recently that also shifted the actual assertion failure to a different place. Could you please take a look on this one and comment?
Thanks for your help in advance!

Best regards,
GOro

Comment by Gabor Orosz [ 2021-05-05 ]

Hi,

Just to summarize my findings, I think there are basically two scenarios that can lead to this kind of crashes:

  • The transaction with a lower sequence number holding the lock, while the other transaction with a higher sequence number tries to acquire lock: This is a transient BF-BF lock conflict as the transactions are in the right order and the lock holder will eventually release it after a while. Thus, the transaction that is requesting the lock just have to wait for a bit. This issue is present in 10.4.17 and a fix for that will be released in 10.4.19.
  • The transaction with a higher sequence number holding the lock, while the other transaction with a lower sequence number tries to acquire lock: That is what my test case tries to extort. For me, this seems to be an out of order committing case, and to my understanding that should not happen as Galera uses strict total order committing by default. I tried to set it explicitly, but it does not help. The only workaround that has been proven to be effective is to reduce the number of applier thread (wsrep_slave_threads) count to 1. However, I haven't done any measurement as of yet, so the performance impact isn't known.
    Considering the fact that Galera gives option to enable OOOC, my expectation then is that MariaDB should be capable to tackle with such situation, otherwise this needs to be listed as a limitation.

Best regards,
GOro

Comment by Seppo Jaakola [ 2021-05-07 ]

Hi Gabor TheGOro , thanks for the mtr test, nice work there, much appreciated.

I can reproduce the issue with this mtr test. The cause for the issue may be related to the fact that the test table does not have primary key. When I a change the test so that the test table has separate primary key, or if I declare the f1 as primary key, then the crash does not happen anymore, e.g. by this change in the test:

CREATE TABLE t1 (f1 VARCHAR(32) PRIMARY KEY NOT NULL) ENGINE=InnoDB;

From the error log sample of this issue description, it looks probable the the affected table (neutron.alembic_version) also misses primary key. Is this the correct, and is this the case with all previous crashes as well?

The test executes with wsrep_certify_nopk = ON, but it does not help here, because the certification key hashes calculated over the two replicated update rows do not conflict.

-seppo

Comment by Gabor Orosz [ 2021-05-07 ]

Hi Seppo,

Thanks for your response! Yes, the absence of the PK on the test table was intentional as Alembic (migration tool for the SQLAlchemy ORM library) does not use it either. I did some research on the topic, and it turned out that this was a known issue/deficiency in earlier versions of Alembic and it got fixed:

I'll do some testing with the modified table declaration to see that this affects only just the second (out of order committing) case or both of them, to decide whether we can stay on 10.4.17 (with the aforementioned patches) for a while or we have to ask our vendor for MDEV-24923 backport.

Best regards,
GOro

Comment by Gabor Orosz [ 2021-05-12 ]

Hi Seppo,

We have concluded our testing and verification using the updated Alembic package and also with our separate reproducer suite that got aligned. The results are good and we haven't observed any problems.
I guess the test case that I wrote makes little sense, as this scenario couldn't be considered as a supported one. Thus, I'm not trying to prep it up and push it as a contribution. Anyways, thanks for your help and guidance!
Feel free to close this ticket.

Best regards,
GOro

Comment by Agbo Steven [ 2021-05-21 ]

Hi,

FYI, we got this error on 2 differents node from 2 differents cluster:

Error :

mysqld: /home/buildbot/buildbot/build/mariadb-10.4.19/wsrep-lib/include/wsrep/client_state.hpp:668: int wsrep::client_state::bf_abort(wsrep::seqno): Assertion `mode_ == m_local || transaction_.is_streaming()' failed.
210519 10:12:07 [ERROR] mysqld got signal 6 ;

OS:

Debian buster 10.8

Version mariadb/galera-4 :

mariadb-server => 1:10.4.19+maria~buster / galera-4 => 26.4.8-buster

I don't know if I should create a new issue but this error looks very similar.

Someone else experiencing this error ?

Comment by Seppo Jaakola [ 2021-05-21 ]

The original problem was diagnosed to be a result of using a table with no primary key. As this is not recommended use case, the priority of this issue was lowered. However, I have a fix for supporting also tables with no primary key (in this scenario), and it will go to code review soon.
Please, take a look if your use case also is using table(s) without primary key.

Comment by Marko Mäkelä [ 2021-05-25 ]

MDEV-21181 and later MDEV-24001 have been filed for automatically creating a hidden primary key in the SQL layer if the user provided none. That would prevent InnoDB from creating a hidden primary key on its own. The InnoDB GEN_CLUST_INDEX(DB_ROW_ID) is not visible in the SQL layer except as a 48-bit key. The InnoDB internal column DB_ROW_ID is only materialized if no PRIMARY KEY and no UNIQUE NOT NULL index is available.

As I noted in MDEV-24001, row-based replication of an UPDATE of all n rows of a table that lacks a primary key should explode to n table scans, visiting n² rows in total.

In my opinion, ultimately this problem needs to be solved in the SQL layer.

Comment by Andrey Khizhnyakov [ 2023-10-16 ]

Colleagues, good afternoon!
This error is present in the configuration 10.5.19-MariaDB galera-4-26.4.14-1.el8.x86_64.
wsrep_slave_threads = 10.
I experienced the same errors in version 10.4.22. Unfortunately, the option with wsrep_slave_threads = 1 is not suitable, because the cluster performance is reduced to 0.
Setting cert.optimistic_pa=no solves the problem, but does not solve the performance issue.
All tables have primary keys.

Generated at Thu Feb 08 09:38:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.