[MDEV-29832] rpl.rpl_semi_sync_after_sync_row frequently fails - Jira

Details

Type: Bug
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Affects Version/s: 10.11, 11.0(EOL), 11.1(EOL), 11.2(EOL), 11.3(EOL)
Fix Version/s: 10.11
Component/s: Replication, Storage Engine - InnoDB, Tests
Labels:
None

Description

Recently on 10.11 branches, rpl.rpl_semi_sync_after_row has been failing on various environments with this output:

https://buildbot.mariadb.org/#/builders/384/builds/1622

10.11 11cfaf394bdf9804f74ffd1289ec758e1107ba32
rpl.rpl_semi_sync_after_sync_row 'innodb,row' w7 [ fail ]
Test ended at 2022-10-20 02:56:50

CURRENT_TEST: rpl.rpl_semi_sync_after_sync_row
mysqltest: In included file "/home/buildbot/aarch64-fedora-36/build/mysql-test/suite/rpl/include/rpl_semi_sync.inc":
included from /home/buildbot/aarch64-fedora-36/build/mysql-test/suite/rpl/t/rpl_semi_sync.test at line 2:
included from /home/buildbot/aarch64-fedora-36/build/mysql-test/suite/rpl/t/rpl_semi_sync_after_sync_row.test at line 3:
At line 90: query 'create table t1 (a int) engine=$engine_type' failed: <Unknown> (2013): Lost connection to server during query

The result from queries just before the failure was:
< snip >
rpl_semi_sync_master_enabled OFF
[ enable semi-sync on master ]
set global rpl_semi_sync_master_enabled = 1;
show variables like 'rpl_semi_sync_master_enabled';
Variable_name Value
rpl_semi_sync_master_enabled ON
[ status of semi-sync on master should be ON even without any semi-sync slaves ]
show status like 'Rpl_semi_sync_master_clients';
Variable_name Value
Rpl_semi_sync_master_clients 0
show status like 'Rpl_semi_sync_master_status';
Variable_name Value
Rpl_semi_sync_master_status ON
show status like 'Rpl_semi_sync_master_yes_tx';
Variable_name Value
Rpl_semi_sync_master_yes_tx 0
#
# BUG#45672 Semisync repl: ActiveTranx:insert_tranx_node: transaction node allocation failed
# BUG#45673 Semisynch reports correct operation even if no slave is connected
#

More results from queries before failure can be found in /home/buildbot/aarch64-fedora-36/build/mysql-test/var/7/log/rpl_semi_sync_after_sync_row.log


Server [mysqld.1 - pid: 61248, winpid: 61248, exit: 256] failed during test run
Server log from this test:
----------SERVER LOG START-----------
2022-10-20 2:55:53 387 [Note] Deleted Master_info file '/dev/shm/var_auto_CXzB/7/mysqld.1/data/master.info'.
2022-10-20 2:55:53 387 [Note] Deleted Master_info file '/dev/shm/var_auto_CXzB/7/mysqld.1/data/relay-log.info'.
2022-10-20 2:55:53 389 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('')
2022-10-20 2:55:53 390 [Note] Semi-sync replication initialized for transactions.
2022-10-20 2:55:53 390 [Note] Semi-sync replication enabled on the master.
2022-10-20 2:55:53 0 [Note] Starting ack receiver thread
2022-10-20 02:56:48 0xffff1f7ef000 InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-36/build/storage/innobase/include/fut0lst.h line 122
InnoDB: Failing assertion: addr.page == FIL_NULL \|\| addr.boffset >= FIL_PAGE_DATA
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mariadbd startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
InnoDB: about forcing recovery.
221020 2:56:48 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

The traceback appears to slightly differ among the environments, but for now I will put them all here as the failing assertion is the same and they are all recent failures, suggesting that they are related.

11cfaf394bdf9804f74ffd1289ec758e1107ba32 aarch64-fedora-36
Thread pointer: 0xaaab1bb7e8f8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0xffff1f7ee608 thread_stack 0x49000
mysys/stacktrace.c:213(my_print_stacktrace)[0xaaaae05da380]
sql/signal_handler.cc:236(handle_fatal_signal)[0xaaaae0195dfc]
addr2line: 'linux-vdso.so.1': No such file
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb2a407a0]
??:0(__pthread_kill_implementation)[0xffffb1e62658]
:0(__GI_raise)[0xffffb1e1ab00]
:0(__GI_abort)[0xffffb1e070f8]
include/ut0ut.h:329(ib::logger& ib::logger::operator<< <int>(int const&))[0xaaaadfe93974]
include/ut0new.h:766(ut_allocator<unsigned char, true>::deallocate_trace(ut_new_pfx_t const*))[0xaaaadfe92d24]
include/dyn0buf.h:177(mtr_buf_t::~mtr_buf_t())[0xaaaae04ba8d0]
trx/trx0purge.cc:1363(trx_purge(unsigned long, bool))[0xaaaae04bc730]
srv/srv0srv.cc:1610(purge_coordinator_state::do_purge())[0xaaaae04b2ddc]
srv/srv0srv.cc:1766(purge_coordinator_callback(void*))[0xaaaae04b2948]
tpool/task_group.cc:71(tpool::task_group::execute(tpool::task*))[0xaaaae05888f4]
tpool/tpool_generic.cc:578(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0xaaaae0587290]
??:0(std::error_code::default_error_condition() const)[0xffffb21530a0]
??:0(start_thread)[0xffffb1e609a8]
??:0(thread_start)[0xffffb1ecbd1c]

Same traceback was seen with 47736901e08 https://buildbot.mariadb.org/#/builders/523/builds/450

5bd86986a8ec2d9222e621d36e10b63f4c026976 ppc64le-debian-sid
Thread pointer: 0x13c92b5d8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fff0f7ee308 thread_stack 0x49000
mysys/stacktrace.c:212(my_print_stacktrace)[0x134260fa4]
sql/signal_handler.cc:233(handle_fatal_signal)[0x133c20b38]
addr2line: 'linux-vdso64.so.1': No such file
linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fffaea90514]
??:0(pthread_key_delete)[0x7fffada8defc]
??:0(gsignal)[0x7fffada2cd3c]
??:0(abort)[0x7fffada0d060]
ut/ut0dbg.cc:60(ut_dbg_assertion_failed(char const, char const, unsigned int))[0x13379b12c]
include/fut0lst.h:122(flst_read_addr(unsigned char const*))[0x133799c8c]
include/fut0lst.h:122(flst_read_addr(unsigned char const*))[0x1337e2198]
trx/trx0purge.cc:1361(trx_purge(unsigned long, bool))[0x1340d322c]
srv/srv0srv.cc:1611(purge_coordinator_state::do_purge())[0x1340c7c34]
srv/srv0srv.cc:1765(purge_coordinator_callback(void*))[0x1340c7578]
tpool/task_group.cc:70(tpool::task_group::execute(tpool::task*))[0x1341e76d8]
tpool/task.cc:32(tpool::task::execute())[0x1341e7908]
tpool/tpool_generic.cc:580(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0x1341e51f0]
bits/invoke.h:74(void std::__invoke_impl<void, void (tpool::thread_pool_generic::)(tpool::worker_data), tpool::thread_pool_generic, tpool::worker_data>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::&&)(tpool::worker_data), tpool::thread_pool_generic&&, tpool::worker_data&&))[0x1341e6384]
??:0(std::error_code::default_error_condition() const)[0x7fffadf06920]
??:0(pthread_condattr_setpshared)[0x7fffada8b3c8]
??:0(clone)[0x7fffadb3acc0]

3a0c3b65de426a9b8cd8bd03406d52e10977f17a aarch64-debian-10
Thread pointer: 0xaaab15b256c8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0xffff177ee708 thread_stack 0x49000
mysys/stacktrace.c:212(my_print_stacktrace)[0xaaaadd338cf0]
sql/signal_handler.cc:236(handle_fatal_signal)[0xaaaadcedea8c]
addr2line: 'linux-vdso.so.1': No such file
linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffac5707a0]
??:0(raise)[0xffffabbd2714]
??:0(abort)[0xffffabbc08e8]
ut/ut0ut.cc:63(ut_print_timestamp(_IO_FILE*))[0xaaaadcbe4840]
include/fut0lst.h:122(flst_read_addr(unsigned char const*))[0xaaaadd1fc98c]
trx/trx0purge.cc:1361(trx_purge(unsigned long, bool))[0xaaaadd1fe9e8]
srv/srv0srv.cc:1610(purge_coordinator_callback(void*))[0xaaaadd1f2900]
tpool/task_group.cc:71(tpool::task_group::execute(tpool::task*))[0xaaaadd2e6508]
tpool/tpool_generic.cc:578(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0xaaaadd2e45b8]
??:0(std::error_code::default_error_condition() const)[0xffffabeca1f4]
??:0(start_thread)[0xffffabfc77e4]
??:0(__clone)[0xffffabc6f59c]

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

11cfaf3_aarch64-fedora-36_mysqld.1.err.7
1007 kB
2022-10-20 11:53
5bd86986a_ppc64le-debian-sid_mysqld.1.err.1
512 kB
2022-10-20 11:53
c4994b468_ppc64le-ubuntu-2004_mysqld.1.err.2
746 kB
2022-10-20 11:53

Issue Links

duplicates

MDEV-33325 Crash in flst_read_addr on corrupted data

Closed

relates to

MDEV-30728 Frequent 'InnoDB: Database page corruption on disk' on MariaDB 10.11 in Debian autopktest on arch ppc64el

Closed

MDEV-34453 Trying to read 16384 bytes at 70368744161280 outside the bounds of the file: ./ibdata1

Closed

Activity

Ascending order - Click to sort in descending order

View 6 older comments

Marko Mäkelä added a comment - 2023-11-17 13:47

alice, builder 231 is also known as ppc64le-rhel-8. I think that this is a Linux io_uring kernel bug that affects at least POWER. This is being tracked by ~~MDEV-30728~~.

Implementations of POWER and ARM ISAs typically implement weak memory ordering, unlike (say) IA-32, AMD64, SPARC, and RISC-V RVTSO, which follow Total Store Order (TSO). This bug could be a Linux kernel equivalent of ~~MDEV-28430~~.

Marko Mäkelä added a comment - 2023-11-17 13:47 alice , builder 231 is also known as ppc64le-rhel-8. I think that this is a Linux io_uring kernel bug that affects at least POWER. This is being tracked by MDEV-30728 . Implementations of POWER and ARM ISAs typically implement weak memory ordering, unlike (say) IA-32, AMD64, SPARC, and RISC-V RVTSO, which follow Total Store Order (TSO). This bug could be a Linux kernel equivalent of MDEV-28430 .

Marko Mäkelä added a comment - 2023-11-17 13:53

Builder 384 that is mentioned in Description is aarch64-fedora-36, which I think should be recent enough to support io_uring. For now, I could believe that the kernel bug behind ~~MDEV-28430~~ can affect also ARMv8.

Marko Mäkelä added a comment - 2023-11-17 13:53 Builder 384 that is mentioned in Description is aarch64-fedora-36, which I think should be recent enough to support io_uring . For now, I could believe that the kernel bug behind MDEV-28430 can affect also ARMv8.

Marko Mäkelä added a comment - 2024-04-03 12:24

~~MDEV-33325~~ would replace the assertions with error handling.

Marko Mäkelä added a comment - 2024-04-03 12:24 MDEV-33325 would replace the assertions with error handling.

Marko Mäkelä added a comment - 2024-11-26 07:02

The corruption could possibly be explained by a regression that was introduced in ~~MDEV-27058~~ and fixed in ~~MDEV-34453~~.

I searched the cross reference, and these failures seemed to be rather rare. The latest recorded crash of this kind was in December 2023, even though the crash was replaced with more robust error handling in April 2024 by ~~MDEV-34453~~. I did not find any failure between that and now that would indicate corruption in InnoDB.

A possible explanation is a design choice that I made already in ~~MDEV-13542~~: If any corruption is found during the purge of committed transaction history, the current operation will be silently skipped. In this case, if the undo logs are found to be corrupted, we will let it remain that way, that is, not delete any delete-marked committed records, or maybe also not free the related undo log pages.

I see that the test has been failing in many other ways. Partly this could be explained by InnoDB corruption. ~~MDEV-34453~~ was closed in September 2024, but the fix was not merged to the main branch until October 3, 2024.

The following failure is newer than that, and it is for some other reason. This kind of failures of this test have occurred on various platforms:

main 35cebfdc513f92b143b1a7229c480f4f684f1698
rpl.rpl_semi_sync_after_sync_row 'row' w1 [ fail ]
Test ended at 2024-10-15 03:23:18

CURRENT_TEST: rpl.rpl_semi_sync_after_sync_row
--- /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.result 2024-10-15 03:08:52.000000000 +0200
+++ /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.reject 2024-10-15 03:23:15.254532000 +0200
@@ -175,10 +175,10 @@
[ master status should be ON ]
show status like 'Rpl_semi_sync_master_no_tx';
Variable_name Value
-Rpl_semi_sync_master_no_tx 0
+Rpl_semi_sync_master_no_tx 1
show status like 'Rpl_semi_sync_master_yes_tx';
Variable_name Value
-Rpl_semi_sync_master_yes_tx 14
+Rpl_semi_sync_master_yes_tx 13
[ semi-sync replication of these transactions will fail ]
insert into t1 values (500);
[ master status should be OFF ]
@@ -187,10 +187,10 @@
Rpl_semi_sync_master_status OFF
show status like 'Rpl_semi_sync_master_no_tx';
Variable_name Value
-Rpl_semi_sync_master_no_tx 1
+Rpl_semi_sync_master_no_tx 2
show status like 'Rpl_semi_sync_master_yes_tx';
Variable_name Value
-Rpl_semi_sync_master_yes_tx 14
+Rpl_semi_sync_master_yes_tx 13
delete from t1 where a=10;
delete from t1 where a=9;
delete from t1 where a=8;
@@ -208,10 +208,10 @@
Rpl_semi_sync_master_status OFF
show status like 'Rpl_semi_sync_master_no_tx';
Variable_name Value
-Rpl_semi_sync_master_no_tx 12
+Rpl_semi_sync_master_no_tx 13
show status like 'Rpl_semi_sync_master_yes_tx';
Variable_name Value
-Rpl_semi_sync_master_yes_tx 14
+Rpl_semi_sync_master_yes_tx 13
#
# Test semi-sync status on master will be ON again when slave catches up
#
@@ -238,10 +238,10 @@
[ master status should be ON again after slave catches up ]
show status like 'Rpl_semi_sync_master_no_tx';
Variable_name Value
-Rpl_semi_sync_master_no_tx 12
+Rpl_semi_sync_master_no_tx 13
show status like 'Rpl_semi_sync_master_yes_tx';
Variable_name Value
-Rpl_semi_sync_master_yes_tx 14
+Rpl_semi_sync_master_yes_tx 13
show status like 'Rpl_semi_sync_master_clients';
Variable_name Value
Rpl_semi_sync_master_clients 1
@@ -258,10 +258,10 @@
[ Semi-sync master status variables before FLUSH STATUS ]
SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx';
Variable_name Value
-Rpl_semi_sync_master_no_tx 12
+Rpl_semi_sync_master_no_tx 13
SHOW STATUS LIKE 'Rpl_semi_sync_master_yes_tx';
Variable_name Value
-Rpl_semi_sync_master_yes_tx 15
+Rpl_semi_sync_master_yes_tx 14
FLUSH NO_WRITE_TO_BINLOG GLOBAL STATUS;
[ Semi-sync master status variables after FLUSH STATUS ]
SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx';

Marko Mäkelä added a comment - 2024-11-26 07:02 The corruption could possibly be explained by a regression that was introduced in MDEV-27058 and fixed in MDEV-34453 . I searched the cross reference, and these failures seemed to be rather rare. The latest recorded crash of this kind was in December 2023, even though the crash was replaced with more robust error handling in April 2024 by MDEV-34453 . I did not find any failure between that and now that would indicate corruption in InnoDB. A possible explanation is a design choice that I made already in MDEV-13542 : If any corruption is found during the purge of committed transaction history, the current operation will be silently skipped. In this case, if the undo logs are found to be corrupted, we will let it remain that way, that is, not delete any delete-marked committed records, or maybe also not free the related undo log pages. I see that the test has been failing in many other ways. Partly this could be explained by InnoDB corruption. MDEV-34453 was closed in September 2024, but the fix was not merged to the main branch until October 3, 2024 . The following failure is newer than that, and it is for some other reason. This kind of failures of this test have occurred on various platforms: main 35cebfdc513f92b143b1a7229c480f4f684f1698 rpl.rpl_semi_sync_after_sync_row 'row' w1 [ fail ] Test ended at 2024-10-15 03:23:18 CURRENT_TEST: rpl.rpl_semi_sync_after_sync_row --- /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.result 2024-10-15 03:08:52.000000000 +0200 +++ /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.reject 2024-10-15 03:23:15.254532000 +0200 @@ -175,10 +175,10 @@ [ master status should be ON ] show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 0 +Rpl_semi_sync_master_no_tx 1 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 [ semi-sync replication of these transactions will fail ] insert into t1 values (500); [ master status should be OFF ] @@ -187,10 +187,10 @@ Rpl_semi_sync_master_status OFF show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 1 +Rpl_semi_sync_master_no_tx 2 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 delete from t1 where a=10; delete from t1 where a=9; delete from t1 where a=8; @@ -208,10 +208,10 @@ Rpl_semi_sync_master_status OFF show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 12 +Rpl_semi_sync_master_no_tx 13 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 # # Test semi-sync status on master will be ON again when slave catches up # @@ -238,10 +238,10 @@ [ master status should be ON again after slave catches up ] show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 12 +Rpl_semi_sync_master_no_tx 13 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 show status like 'Rpl_semi_sync_master_clients'; Variable_name Value Rpl_semi_sync_master_clients 1 @@ -258,10 +258,10 @@ [ Semi-sync master status variables before FLUSH STATUS ] SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 12 +Rpl_semi_sync_master_no_tx 13 SHOW STATUS LIKE 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 15 +Rpl_semi_sync_master_yes_tx 14 FLUSH NO_WRITE_TO_BINLOG GLOBAL STATUS; [ Semi-sync master status variables after FLUSH STATUS ] SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx';

Marko Mäkelä added a comment - 2024-11-26 07:07

It looks like the originally reported crashes were fixed by ~~MDEV-33325~~, and I think that the root cause of the crashes (corruption) should have been fixed by ~~MDEV-34453~~. The remaining causes of failures ought to reside outside InnoDB.

Marko Mäkelä added a comment - 2024-11-26 07:07 It looks like the originally reported crashes were fixed by MDEV-33325 , and I think that the root cause of the crashes (corruption) should have been fixed by MDEV-34453 . The remaining causes of failures ought to reside outside InnoDB.

MariaDB Server

rpl.rpl_semi_sync_after_sync_row frequently fails

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration