[MDEV-32257] Assertion `thd->is_killed()' failed in XID_cache_element::acquired_to_recovered (opt) or acquire_xid (dbg) on XA ROLLBACK or XA COMMIT Created: 2023-09-26  Updated: 2023-10-16  Resolved: 2023-09-29

Status: Closed
Project: MariaDB Server
Component/s: Replication, Storage Engine - InnoDB
Affects Version/s: N/A
Fix Version/s: 10.6.16

Type: Bug Priority: Critical
Reporter: Roel Van de Paar Assignee: Andrei Elkin
Resolution: Fixed Votes: 0
Labels: XA

Issue Links:
Blocks
blocks MDEV-31949 slow parallel replication of user xa In Review
is blocked by MDEV-32272 lock_release_on_prepare_try() does no... Closed
Problem/Incident
is caused by MDEV-31949 slow parallel replication of user xa In Review
Relates
relates to MDEV-30941 ASAN: heap-use-after-free in std::__a... Confirmed

 Description   

When the following SQL is executed in a binlogging server:

SET pseudo_slave_mode=1;
XA START 'a';
XA END 'a';
XA PREPARE 'a';
XA ROLLBACK 'a';

The server will log into the binary log only the XA ROLLBACK part:

| master-bin.000001 |   4 | Format_desc       |         1 |         256 | Server ver: 10.6.16-MariaDB-debug-log, Binlog ver: 4 |
| master-bin.000001 | 256 | Start_encryption  |         1 |         296 |                                                      |
| master-bin.000001 | 296 | Gtid_list         |         1 |         325 | []                                                   |
| master-bin.000001 | 325 | Binlog_checkpoint |         1 |         369 | master-bin.000001                                    |
| master-bin.000001 | 369 | Gtid              |         1 |         412 | GTID 0-1-1                                           |
| master-bin.000001 | 412 | Query             |         1 |         502 | XA ROLLBACK X'61',X'',1   

Which should not been done. This type of read-only XA should not produce any events, as well as its even prepared XID should not be recovered after reconnect or the server crash-recovery.

Previous info: this also leads to a crash on the slave in a standard master/slave setup:

10.6.14-9-enterprise-MDEV-30165-30619-31949 57b39cb8e1e4b374e0e138b286a0f9cb0669e401 (Optimized)

2023-09-26 13:17:23 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './relaylog.000001' position: 4; GTID position ''
2023-09-26 13:18:01 69 [ERROR] XA COMMIT of GTID 0-1-%could not complete after having been logged into binary log

10.6.14-9-enterprise-MDEV-30165-30619-31949 57b39cb8e1e4b374e0e138b286a0f9cb0669e401 (Optimized)

Core was generated by `/test/PATCH_EMD260923-mariadb-10.6.14-9-linux-x86_64-opt/bin/mariadbd --no-defa'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  std::__atomic_base<int>::fetch_or (__m=std::memory_order_relaxed,
    __i=536870912, this=0x0) at /usr/include/c++/11/bits/atomic_base.h:646
646         fetch_or(__int_type __i,
[Current thread is 1 (Thread 0x154f3d4b1640 (LWP 1079862))]
(gdb) bt
#0  std::__atomic_base<int>::fetch_or (__m=std::memory_order_relaxed, __i=536870912, this=0x0) at /usr/include/c++/11/bits/atomic_base.h:646
#1  XID_cache_element::acquired_to_recovered (this=0x0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/xa.cc:119
#2  trans_xa_rollback (thd=thd@entry=0x154e08000c68) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/xa.cc:1081
#3  0x00005577e38265e4 in mysql_execute_command (thd=0x154e08000c68, is_called_from_prepared_stmt=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/sql_parse.cc:5897
#4  0x00005577e3829156 in mysql_parse (thd=0x154e08000c68, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x154f3d4b08f0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/sql_parse.cc:8063
#5  0x00005577e3bacac3 in Query_log_event::do_apply_event (this=0x154f0005e648, rgi=0x154f0005d1d0, query_arg=0x154f0005e7b3 "XA ROLLBACK X'61',X'',1", q_len_arg=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/sql_class.h:242
#6  0x00005577e375e086 in Log_event::apply_event (rgi=0x154f0005d1d0, this=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/log_event.h:1499
#7  apply_event_and_update_pos_apply (ev=ev@entry=0x154f0005e648, thd=thd@entry=0x154e08000c68, rgi=rgi@entry=0x154f0005d1d0, reason=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/slave.cc:3892
#8  0x00005577e3764962 in apply_event_and_update_pos_for_parallel (ev=ev@entry=0x154f0005e648, thd=thd@entry=0x154e08000c68, rgi=rgi@entry=0x154f0005d1d0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/slave.cc:4078
#9  0x00005577e39b4148 in rpt_handle_event (qev=0x154f0005cd78, rpt=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/rpl_parallel.cc:134
#10 0x00005577e39b874c in handle_rpl_parallel_thread (arg=arg@entry=0x154f000233b8) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/rpl_parallel.cc:1476
#11 0x00005577e3cf5e91 in pfs_spawn_thread (arg=0x154f000349d8) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/storage/perfschema/pfs.cc:2201
#12 0x0000154f63094b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#13 0x0000154f63126a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Debug:

10.6.14-9-enterprise-MDEV-30165-30619-31949 57b39cb8e1e4b374e0e138b286a0f9cb0669e401 (Debug)

2023-09-26 13:11:31 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './relaylog.000001' position: 4; GTID position ''
mariadbd: /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc:2043: bool acquire_xid(THD*): Assertion `thd->is_killed()' failed.

10.6.14-9-enterprise-MDEV-30165-30619-31949 57b39cb8e1e4b374e0e138b286a0f9cb0669e401 (Debug)

Core was generated by `/test/PATCH_EMD260923-mariadb-10.6.14-9-linux-x86_64-dbg/bin/mariadbd --no-defa'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22889389626944)
    at ./nptl/pthread_kill.c:44
[Current thread is 1 (Thread 0x14d159f01640 (LWP 3999773))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22889389626944) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=22889389626944) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=22889389626944, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000014d17f442476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000014d17f4287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000014d17f42871b in __assert_fail_base (fmt=0x14d17f5dd150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55d688113b8d "thd->is_killed()", file=0x55d688114498 "/test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc", line=2043, function=<optimized out>) at ./assert/assert.c:92
#6  0x000014d17f439e96 in __GI___assert_fail (assertion=0x55d688113b8d "thd->is_killed()", file=0x55d688114498 "/test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc", line=2043, function=0x55d688113b9e "bool acquire_xid(THD*)") at ./assert/assert.c:101
#7  0x000055d6877845ac in acquire_xid (thd=thd@entry=0x14cff8000f98) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc:2043
#8  0x000055d687798a03 in binlog_rollback_by_xid (hton=0x55d6893c0988, xid=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc:2130
#9  0x000055d68764a3e9 in ha_commit_or_rollback_by_xid (xid=<optimized out>, commit=commit@entry=false) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/handler.cc:2355
#10 0x000055d6875b28a5 in wsrep_ha_commit_or_rollback_by_xid (thd=thd@entry=0x14cff8000f98, commit=commit@entry=false) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/xa.cc:1445
#11 0x000055d6875b5e54 in trans_xa_rollback (thd=thd@entry=0x14cff8000f98) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/xa.cc:1063
#12 0x000055d68737c134 in mysql_execute_command (thd=thd@entry=0x14cff8000f98, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/sql_parse.cc:5897
#13 0x000055d68737e1f3 in mysql_parse (thd=0x14cff8000f98, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x14d159f007f0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/sql_parse.cc:8063
#14 0x000055d6877b50b6 in Query_log_event::do_apply_event (this=0x14d10006bc98, rgi=0x14d100066ab0, query_arg=0x14d10006be03 "XA ROLLBACK X'61',X'',1", q_len_arg=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/sql_class.h:242
#15 0x000055d6877b5a0c in Query_log_event::do_apply_event (this=<optimized out>, rgi=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log_event_server.cc:1605
#16 0x000055d6872a0714 in Log_event::apply_event (rgi=0x14d100066ab0, this=0x14d10006bc98) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log_event.h:1499
#17 apply_event_and_update_pos_apply (ev=ev@entry=0x14d10006bc98, thd=thd@entry=0x14cff8000f98, rgi=rgi@entry=0x14d100066ab0, reason=reason@entry=0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/slave.cc:3892
#18 0x000055d6872aa7d9 in apply_event_and_update_pos_for_parallel (ev=ev@entry=0x14d10006bc98, thd=thd@entry=0x14cff8000f98, rgi=rgi@entry=0x14d100066ab0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/slave.cc:4078
#19 0x000055d687555383 in rpt_handle_event (qev=qev@entry=0x14d100066658, rpt=rpt@entry=0x14d100027e50) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/rpl_parallel.cc:134
#20 0x000055d687559b08 in handle_rpl_parallel_thread (arg=arg@entry=0x14d100027e50) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/rpl_parallel.cc:1476
#21 0x000055d687919e72 in pfs_spawn_thread (arg=0x14d100039578) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/storage/perfschema/pfs.cc:2201
#22 0x000014d17f494b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#23 0x000014d17f526a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81



 Comments   
Comment by Roel Van de Paar [ 2023-09-26 ]

This testcase:

XA START 'a';
SET pseudo_slave_mode=1;
XA END 'a';
XA PREPARE 'a';
XA COMMIT 'a';

Leads to additional slightly different stacks on both opt and dbg builds:

bb-10.6-MDEV-31949 e9e38f741bb69db1a872854592e2f2c4c23255ce (Optimized)

2023-09-26 13:31:55 5 [Note] Slave I/O thread: Start asynchronous replication to master 'repl_user@127.0.0.1:12223' in log '' at position 4
2023-09-26 13:31:55 5 [Note] Slave I/O thread: connected to master 'repl_user@127.0.0.1:12223',replication starts at GTID position ''
2023-09-26 13:31:55 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './relaylog.000001' position: 4; GTID position ''
2023-09-26 13:32:30 69 [ERROR] XA COMMIT of GTID 0-1-%could not complete after having been logged into binary log

bb-10.6-MDEV-31949 e9e38f741bb69db1a872854592e2f2c4c23255ce (Optimized)

Core was generated by `/test/PATCH_EMD260923-mariadb-10.6.14-9-linux-x86_64-opt/bin/mariadbd --no-defa'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  std::__atomic_base<int>::fetch_or (__m=std::memory_order_relaxed, 
    __i=536870912, this=0x0) at /usr/include/c++/11/bits/atomic_base.h:646
646	      fetch_or(__int_type __i,
[Current thread is 1 (Thread 0x1491985d5640 (LWP 4149002))]
(gdb) bt
#0  std::__atomic_base<int>::fetch_or (__m=std::memory_order_relaxed, __i=536870912, this=0x0) at /usr/include/c++/11/bits/atomic_base.h:646
#1  XID_cache_element::acquired_to_recovered (this=0x0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/xa.cc:119
#2  trans_xa_commit (thd=thd@entry=0x149024000c68) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/xa.cc:862
#3  0x0000564b748a2622 in mysql_execute_command (thd=0x149024000c68, is_called_from_prepared_stmt=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/sql_parse.cc:5880
#4  0x0000564b748a5156 in mysql_parse (thd=0x149024000c68, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x1491985d48f0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/sql_parse.cc:8063
#5  0x0000564b74c28ac3 in Query_log_event::do_apply_event (this=0x14913805e648, rgi=0x14913805d1d0, query_arg=0x14913805e7b3 "XA COMMIT X'61',X'',1", q_len_arg=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/sql_class.h:242
#6  0x0000564b747da086 in Log_event::apply_event (rgi=0x14913805d1d0, this=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/log_event.h:1499
#7  apply_event_and_update_pos_apply (ev=ev@entry=0x14913805e648, thd=thd@entry=0x149024000c68, rgi=rgi@entry=0x14913805d1d0, reason=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/slave.cc:3892
#8  0x0000564b747e0962 in apply_event_and_update_pos_for_parallel (ev=ev@entry=0x14913805e648, thd=thd@entry=0x149024000c68, rgi=rgi@entry=0x14913805d1d0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/slave.cc:4078
#9  0x0000564b74a30148 in rpt_handle_event (qev=0x14913805cd78, rpt=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/rpl_parallel.cc:134
#10 0x0000564b74a3474c in handle_rpl_parallel_thread (arg=arg@entry=0x1491380233b8) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/sql/rpl_parallel.cc:1476
#11 0x0000564b74d71e91 in pfs_spawn_thread (arg=0x1491380349d8) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_opt/storage/perfschema/pfs.cc:2201
#12 0x00001491bf494b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#13 0x00001491bf526a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

bb-10.6-MDEV-31949 e9e38f741bb69db1a872854592e2f2c4c23255ce (Debug)

2023-09-26 13:31:49 5 [Note] Slave I/O thread: Start asynchronous replication to master 'repl_user@127.0.0.1:12001' in log '' at position 4
2023-09-26 13:31:49 5 [Note] Slave I/O thread: connected to master 'repl_user@127.0.0.1:12001',replication starts at GTID position ''
2023-09-26 13:31:49 6 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 4, relay log './relaylog.000001' position: 4; GTID position ''
mariadbd: /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc:2043: bool acquire_xid(THD*): Assertion `thd->is_killed()' failed.

bb-10.6-MDEV-31949 e9e38f741bb69db1a872854592e2f2c4c23255ce (Debug)

Core was generated by `/test/PATCH_EMD260923-mariadb-10.6.14-9-linux-x86_64-dbg/bin/mariadbd --no-defa'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22745705637440)
    at ./nptl/pthread_kill.c:44
[Current thread is 1 (Thread 0x14afe5b46640 (LWP 4108447))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22745705637440) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=22745705637440) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=22745705637440, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000014b00ac42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000014b00ac287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000014b00ac2871b in __assert_fail_base (fmt=0x14b00addd150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x55f13c2a2b8d "thd->is_killed()", file=0x55f13c2a3498 "/test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc", line=2043, function=<optimized out>) at ./assert/assert.c:92
#6  0x000014b00ac39e96 in __GI___assert_fail (assertion=0x55f13c2a2b8d "thd->is_killed()", file=0x55f13c2a3498 "/test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc", line=2043, function=0x55f13c2a2b9e "bool acquire_xid(THD*)") at ./assert/assert.c:101
#7  0x000055f13b9135ac in acquire_xid (thd=thd@entry=0x14ae74000f98) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc:2043
#8  0x000055f13b928daf in binlog_commit_by_xid (hton=0x55f13f1bc988, xid=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log.cc:2090
#9  0x000055f13b7d9391 in ha_commit_or_rollback_by_xid (xid=<optimized out>, commit=commit@entry=true) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/handler.cc:2353
#10 0x000055f13b7418a5 in wsrep_ha_commit_or_rollback_by_xid (thd=thd@entry=0x14ae74000f98, commit=commit@entry=true) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/xa.cc:1445
#11 0x000055f13b7441f3 in trans_xa_commit (thd=thd@entry=0x14ae74000f98) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/xa.cc:840
#12 0x000055f13b50b0ae in mysql_execute_command (thd=thd@entry=0x14ae74000f98, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/sql_parse.cc:5880
#13 0x000055f13b50d1f3 in mysql_parse (thd=0x14ae74000f98, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x14afe5b457f0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/sql_parse.cc:8063
#14 0x000055f13b9440b6 in Query_log_event::do_apply_event (this=0x14af9c06bc98, rgi=0x14af9c066ab0, query_arg=0x14af9c06be03 "XA COMMIT X'61',X'',1", q_len_arg=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/sql_class.h:242
#15 0x000055f13b944a0c in Query_log_event::do_apply_event (this=<optimized out>, rgi=<optimized out>) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log_event_server.cc:1605
#16 0x000055f13b42f714 in Log_event::apply_event (rgi=0x14af9c066ab0, this=0x14af9c06bc98) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/log_event.h:1499
#17 apply_event_and_update_pos_apply (ev=ev@entry=0x14af9c06bc98, thd=thd@entry=0x14ae74000f98, rgi=rgi@entry=0x14af9c066ab0, reason=reason@entry=0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/slave.cc:3892
#18 0x000055f13b4397d9 in apply_event_and_update_pos_for_parallel (ev=ev@entry=0x14af9c06bc98, thd=thd@entry=0x14ae74000f98, rgi=rgi@entry=0x14af9c066ab0) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/slave.cc:4078
#19 0x000055f13b6e4383 in rpt_handle_event (qev=qev@entry=0x14af9c066658, rpt=rpt@entry=0x14af9c027e50) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/rpl_parallel.cc:134
#20 0x000055f13b6e8b08 in handle_rpl_parallel_thread (arg=arg@entry=0x14af9c027e50) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/sql/rpl_parallel.cc:1476
#21 0x000055f13baa8e72 in pfs_spawn_thread (arg=0x14af9c039578) at /test/git-bisect/10.6.14-9-enterprise-MDEV-30165-30619-31949_dbg/storage/perfschema/pfs.cc:2201
#22 0x000014b00ac94b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#23 0x000014b00ad26a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Comment by Roel Van de Paar [ 2023-09-26 ]

UniqueID's observed thus far. Note the small difference rollback/commit as per above testcases.

SIGABRT|acquire_xid|binlog_rollback_by_xid|ha_commit_or_rollback_by_xid|wsrep_ha_commit_or_rollback_by_xid   # rollback
SIGABRT|acquire_xid|binlog_commit_by_xid|ha_commit_or_rollback_by_xid|wsrep_ha_commit_or_rollback_by_xid   # commit
SIGSEGV|std::__atomic_base<int>::fetch_or|XID_cache_element::acquired_to_recovered|trans_xa_rollback|mysql_execute_command   # rollback
SIGSEGV|std::__atomic_base<int>::fetch_or|XID_cache_element::acquired_to_recovered|trans_xa_commit|mysql_execute_command   # commit

Comment by Roel Van de Paar [ 2023-09-27 ]

The original SIGABRT/SIGSEGV's discovered in this bug were in 10.6.14-9-enterprise-MDEV-30165-30619-31949 at revision 57b39cb8e1e4b374e0e138b286a0f9cb0669e401

The full patch code has now been moved to branch bb-10.6-MDEV-31949. In revision e9e38f741bb69db1a872854592e2f2c4c23255ce of that branch, the first XA ROLLBACK testcase does not assert nor crash anymore. The second testcase (XA COMMIT) does assert/crash however, as before.

As such, I have updated the titles for only the second testcase outcomes to match the new branch, and left the ES titles for the first testcase as-is.

Comment by Roel Van de Paar [ 2023-09-28 ]

SIGSEGV|std::__atomic_base<int>::fetch_or|XID_cache_element::acquired_to_recovered|trans_xa_commit|mysql_execute_command

Was observed in the latest (today's) version of the patch branch bb-10.6-MDEV-31949 at rev 4422251f2713eda078153f3b5eed2fe7143a7ac9 (Debug) so the issue still exists.

Comment by Roel Van de Paar [ 2023-09-28 ]

Created MDEV-32272 Failing assertion: table->n_rec_locks == 0 in dict_sys_t::remove on DROP TABLE.
While this related bug is not caused by MDEV-31949, it causes interference with testing of MDEV-31949 and re-testing of MDEV-32257.

Comment by Roel Van de Paar [ 2023-09-28 ]

An additional stack seen with this similar testcase:

--source include/have_binlog_format_row.inc
--source include/master-slave.inc
--connection slave
STOP SLAVE;
SET GLOBAL slave_parallel_threads=10;
START SLAVE;
SELECT SLEEP(2);
--connection master
SET @@session.pseudo_slave_mode=TRUE;
XA START 'a';
XA END 'a';
XA PREPARE 'a';
XA COMMIT 'a';
--source include/rpl_end.inc

bb-10.6-MDEV-31949 4422251f2713eda078153f3b5eed2fe7143a7ac9 (Debug)

Core was generated by `/test/PATCH2_MD280923-mariadb-10.6.16-linux-x86_64-dbg/bin/mariadbd --no-defaul'.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22879743034944)
    at ./nptl/pthread_kill.c:44
[Current thread is 1 (Thread 0x14cf1af4c640 (LWP 1703770))]
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=22879743034944) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=22879743034944) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=22879743034944, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x000014cf50a42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x000014cf50a287f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000014cf50a2871b in __assert_fail_base (fmt=0x14cf50bdd150 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5629101f1e1c "thd->is_killed()", file=0x5629101f2058 "/test/bb-10.6-MDEV-31949_dbg/sql/log.cc", line=2034, function=<optimized out>) at ./assert/assert.c:92
#6  0x000014cf50a39e96 in __GI___assert_fail (assertion=0x5629101f1e1c "thd->is_killed()", file=0x5629101f2058 "/test/bb-10.6-MDEV-31949_dbg/sql/log.cc", line=2034, function=0x5629101f1e2d "bool acquire_xid(THD*)") at ./assert/assert.c:101
#7  0x000056290f87c493 in acquire_xid (thd=0x14cdd0000f98) at /test/bb-10.6-MDEV-31949_dbg/sql/log.cc:2034
#8  binlog_commit_by_xid (hton=<optimized out>, xid=<optimized out>) at /test/bb-10.6-MDEV-31949_dbg/sql/log.cc:2081
#9  0x000056290f72dbef in ha_commit_or_rollback_by_xid (xid=<optimized out>, commit=commit@entry=true) at /test/bb-10.6-MDEV-31949_dbg/sql/handler.cc:2342
#10 0x000056290f69bf77 in trans_xa_commit (thd=thd@entry=0x14cdd0000f98) at /test/bb-10.6-MDEV-31949_dbg/sql/xa.cc:820
#11 0x000056290f4783f7 in mysql_execute_command (thd=thd@entry=0x14cdd0000f98, is_called_from_prepared_stmt=is_called_from_prepared_stmt@entry=false) at /test/bb-10.6-MDEV-31949_dbg/sql/sql_parse.cc:5886
#12 0x000056290f47a492 in mysql_parse (thd=0x14cdd0000f98, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x14cf1af4b7a0) at /test/bb-10.6-MDEV-31949_dbg/sql/sql_parse.cc:8050
#13 0x000056290f89770a in Query_log_event::do_apply_event (this=0x14cee806ebd8, rgi=0x14cee806c800, query_arg=0x14cee806ed43 "XA COMMIT X'61',X'',1", q_len_arg=<optimized out>) at /test/bb-10.6-MDEV-31949_dbg/sql/sql_class.h:241
#14 0x000056290f898060 in Query_log_event::do_apply_event (this=<optimized out>, rgi=<optimized out>) at /test/bb-10.6-MDEV-31949_dbg/sql/log_event_server.cc:1605
#15 0x000056290f39d4f4 in Log_event::apply_event (rgi=0x14cee806c800, this=0x14cee806ebd8) at /test/bb-10.6-MDEV-31949_dbg/sql/log_event.h:1499
#16 apply_event_and_update_pos_apply (ev=ev@entry=0x14cee806ebd8, thd=thd@entry=0x14cdd0000f98, rgi=rgi@entry=0x14cee806c800, reason=reason@entry=0) at /test/bb-10.6-MDEV-31949_dbg/sql/slave.cc:3894
#17 0x000056290f3a75df in apply_event_and_update_pos_for_parallel (ev=ev@entry=0x14cee806ebd8, thd=thd@entry=0x14cdd0000f98, rgi=rgi@entry=0x14cee806c800) at /test/bb-10.6-MDEV-31949_dbg/sql/slave.cc:4090
#18 0x000056290f63d753 in rpt_handle_event (qev=qev@entry=0x14cee806c3a8, rpt=rpt@entry=0x14cee8026df8) at /test/bb-10.6-MDEV-31949_dbg/sql/rpl_parallel.cc:66
#19 0x000056290f64223e in handle_rpl_parallel_thread (arg=arg@entry=0x14cee8026df8) at /test/bb-10.6-MDEV-31949_dbg/sql/rpl_parallel.cc:1514
#20 0x000056290fa0ada4 in pfs_spawn_thread (arg=0x14cee80387e8) at /test/bb-10.6-MDEV-31949_dbg/storage/perfschema/pfs.cc:2201
#21 0x000014cf50a94b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#22 0x000014cf50b26a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

SIGABRT|acquire_xid|binlog_commit_by_xid|ha_commit_or_rollback_by_xid|trans_xa_commit

It looks like slave_parallel_threads is required.

Comment by Roel Van de Paar [ 2023-09-28 ]

This testcase:

--source include/have_binlog_format_row.inc
--source include/master-slave.inc
SET @@session.pseudo_slave_mode=TRUE;
XA START 'a';
XA END 'a';
XA PREPARE 'a';
XA COMMIT 'a';

Produces:

bb-10.6-MDEV-31949 4422251f2713eda078153f3b5eed2fe7143a7ac9 (Debug)

2023-09-28 12:32:44 7 [Warning] Slave: XAER_NOTA: Unknown XID Error_code: 1397
^ Found warnings in /test/PATCH2_MD280923-mariadb-10.6.16-linux-x86_64-dbg/mysql-test/var/log/mysqld.2.err

Comment by Roel Van de Paar [ 2023-09-28 ]

This MTR testcase:

--source include/have_binlog_format_row.inc
--source include/master-slave.inc
--connection slave
STOP SLAVE;
SET GLOBAL slave_parallel_threads=10;
START SLAVE;
SELECT SLEEP(2);
--connection master
SET @@session.pseudo_slave_mode=TRUE;
XA START 'a';
XA END 'a';
XA PREPARE 'a';
XA ROLLBACK 'a';
--source include/rpl_end.inc

Crashes the original branch used when this ticket was created (10.6.14-9-enterprise-MDEV-3016530619-31949 57b39cb8e1e4b374e0e138b286a0f9cb0669e401) but no longer crashes the latest branch (bb-10.6MDEV-31949 4422251f2713eda078153f3b5eed2fe7143a7ac9) so some improvement was made. However, it still produces an error as follows:

10.6.16 4422251f2713eda078153f3b5eed2fe7143a7ac9 (Debug)

Last_Error	Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA ROLLBACK X'61',X'',1'
...
Wrong value for Last_SQL_Errno. Expected '0', got '1397'

Please check.

Comment by Roel Van de Paar [ 2023-09-28 ]

This testcase, which is the one from the first comment in MTR format, still crashes the current bb-10.6-MDEV-31949:

--source include/have_binlog_format_row.inc
--source include/master-slave.inc
--connection slave
STOP SLAVE;
SET GLOBAL slave_parallel_threads=10;
START SLAVE;
SELECT SLEEP(2);
--connection master
XA START 'a';
SET @@session.pseudo_slave_mode=TRUE;
XA END 'a';
XA PREPARE 'a';
XA COMMIT 'a';
--source include/rpl_end.inc

However, it now crashes on this stack for debug:

bb-10.6-MDEV-31949 4422251f2713eda078153f3b5eed2fe7143a7ac9 (Debug)

SIGABRT|acquire_xid|binlog_commit_by_xid|ha_commit_or_rollback_by_xid|trans_xa_commit

For optimized we still see:

10.6.16 4422251f2713eda078153f3b5eed2fe7143a7ac9 (Optimized)

SIGSEGV|std::__atomic_base<int>::fetch_or|XID_cache_element::acquired_to_recovered|trans_xa_commit|mysql_execute_command

Comment by Roel Van de Paar [ 2023-09-28 ]

Current action items for this bug summary:

  1. Fix the XA COMMIT crash, MTR testcase in previous comment
  2. Fix the secondary XA COMMIT crash, MTR testcase in this comment
  3. Review the XAER_NOTA error in this comment
  4. Review the secondary XAER_NOTA error in this comment
  5. Please fix MDEV-32272 and MDEV-30941 to enable better/full re-testing of this ticket as well as MDEV-31949

Once these items are fixed I can re-test this bug. Testing on MDEV-31949 continues.

Comment by Andrei Elkin [ 2023-09-28 ]

The fixes are in bb-10.6-andrei.

Comment by Brandon Nesterenko [ 2023-09-28 ]

The patch cb8f96142 looks good to me.

Comment by Roel Van de Paar [ 2023-10-02 ]

I see various commits relating to this ticket in bb-10.6-MDEV-31949. Is this ticket (all crashes and potentially the Unknown XID Error_code) deemed fully resolved in that branch? Thank you.

Comment by Andrei Elkin [ 2023-10-02 ]

Roel, first I agree MDEV-31949 exposed this bug in its own and valuable way. I did not mean to dismiss your original collected evidence of crashes, just they were expected (implied to myself at least) in MDEV-31949.
The bug is fixed in 10.6. The changes in MDEV-31949 can (and should) be viewed as pre-merge of the 10.6 commit.

Thanks for the testing!

Comment by Roel Van de Paar [ 2023-10-02 ]

Elkin Understood, thank you.

Not sure I understood the last line (The changes in...) - I would like to ensure that the bb-10.6-MDEV-31949 branch contains all fixes for this ticket MDEV-32257. Can you confirm the same yes/no? Thank you

Comment by Andrei Elkin [ 2023-10-04 ]

Roel, to your question, the server segfault is tacked by the 31949. The impossible logging of the "orphan" XA-rollback - is done in the 10.6.
In hindsight perhaps it'd be more proper to have two tickets reported, with the (bug) 31949 requiring the (bug) 10.6 (though the 31949 can be also reached by other path:s like the slave misconfiguration).

Generated at Thu Feb 08 10:30:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.