Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-29832

rpl.rpl_semi_sync_after_sync_row frequently fails

Details

    Description

      Recently on 10.11 branches, rpl.rpl_semi_sync_after_row has been failing on various environments with this output:

      https://buildbot.mariadb.org/#/builders/384/builds/1622

      10.11 11cfaf394bdf9804f74ffd1289ec758e1107ba32

      rpl.rpl_semi_sync_after_sync_row 'innodb,row' w7 [ fail ]
              Test ended at 2022-10-20 02:56:50
       
      CURRENT_TEST: rpl.rpl_semi_sync_after_sync_row
      mysqltest: In included file "/home/buildbot/aarch64-fedora-36/build/mysql-test/suite/rpl/include/rpl_semi_sync.inc": 
      included from /home/buildbot/aarch64-fedora-36/build/mysql-test/suite/rpl/t/rpl_semi_sync.test at line 2:
      included from /home/buildbot/aarch64-fedora-36/build/mysql-test/suite/rpl/t/rpl_semi_sync_after_sync_row.test at line 3:
      At line 90: query 'create table t1 (a int) engine=$engine_type' failed: <Unknown> (2013): Lost connection to server during query
       
      The result from queries just before the failure was:
      < snip >
      rpl_semi_sync_master_enabled	OFF
      [ enable semi-sync on master ]
      set global rpl_semi_sync_master_enabled = 1;
      show variables like 'rpl_semi_sync_master_enabled';
      Variable_name	Value
      rpl_semi_sync_master_enabled	ON
      [ status of semi-sync on master should be ON even without any semi-sync slaves ]
      show status like 'Rpl_semi_sync_master_clients';
      Variable_name	Value
      Rpl_semi_sync_master_clients	0
      show status like 'Rpl_semi_sync_master_status';
      Variable_name	Value
      Rpl_semi_sync_master_status	ON
      show status like 'Rpl_semi_sync_master_yes_tx';
      Variable_name	Value
      Rpl_semi_sync_master_yes_tx	0
      #
      # BUG#45672 Semisync repl: ActiveTranx:insert_tranx_node: transaction node allocation failed
      # BUG#45673 Semisynch reports correct operation even if no slave is connected
      #
       
      More results from queries before failure can be found in /home/buildbot/aarch64-fedora-36/build/mysql-test/var/7/log/rpl_semi_sync_after_sync_row.log
       
       
      Server [mysqld.1 - pid: 61248, winpid: 61248, exit: 256] failed during test run
      Server log from this test:
      ----------SERVER LOG START-----------
      2022-10-20  2:55:53 387 [Note] Deleted Master_info file '/dev/shm/var_auto_CXzB/7/mysqld.1/data/master.info'.
      2022-10-20  2:55:53 387 [Note] Deleted Master_info file '/dev/shm/var_auto_CXzB/7/mysqld.1/data/relay-log.info'.
      2022-10-20  2:55:53 389 [Note] Start binlog_dump to slave_server(2), pos(, 4), using_gtid(1), gtid('')
      2022-10-20  2:55:53 390 [Note] Semi-sync replication initialized for transactions.
      2022-10-20  2:55:53 390 [Note] Semi-sync replication enabled on the master.
      2022-10-20  2:55:53 0 [Note] Starting ack receiver thread
      2022-10-20 02:56:48 0xffff1f7ef000  InnoDB: Assertion failure in file /home/buildbot/aarch64-fedora-36/build/storage/innobase/include/fut0lst.h line 122
      InnoDB: Failing assertion: addr.page == FIL_NULL || addr.boffset >= FIL_PAGE_DATA
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mariadbd startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
      InnoDB: about forcing recovery.
      221020  2:56:48 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
      

      The traceback appears to slightly differ among the environments, but for now I will put them all here as the failing assertion is the same and they are all recent failures, suggesting that they are related.

      11cfaf394bdf9804f74ffd1289ec758e1107ba32 aarch64-fedora-36

      Thread pointer: 0xaaab1bb7e8f8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0xffff1f7ee608 thread_stack 0x49000
      mysys/stacktrace.c:213(my_print_stacktrace)[0xaaaae05da380]
      sql/signal_handler.cc:236(handle_fatal_signal)[0xaaaae0195dfc]
      addr2line: 'linux-vdso.so.1': No such file
      linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffb2a407a0]
      ??:0(__pthread_kill_implementation)[0xffffb1e62658]
      :0(__GI_raise)[0xffffb1e1ab00]
      :0(__GI_abort)[0xffffb1e070f8]
      include/ut0ut.h:329(ib::logger& ib::logger::operator<< <int>(int const&))[0xaaaadfe93974]
      include/ut0new.h:766(ut_allocator<unsigned char, true>::deallocate_trace(ut_new_pfx_t const*))[0xaaaadfe92d24]
      include/dyn0buf.h:177(mtr_buf_t::~mtr_buf_t())[0xaaaae04ba8d0]
      trx/trx0purge.cc:1363(trx_purge(unsigned long, bool))[0xaaaae04bc730]
      srv/srv0srv.cc:1610(purge_coordinator_state::do_purge())[0xaaaae04b2ddc]
      srv/srv0srv.cc:1766(purge_coordinator_callback(void*))[0xaaaae04b2948]
      tpool/task_group.cc:71(tpool::task_group::execute(tpool::task*))[0xaaaae05888f4]
      tpool/tpool_generic.cc:578(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0xaaaae0587290]
      ??:0(std::error_code::default_error_condition() const)[0xffffb21530a0]
      ??:0(start_thread)[0xffffb1e609a8]
      ??:0(thread_start)[0xffffb1ecbd1c]
      

      5bd86986a8ec2d9222e621d36e10b63f4c026976 ppc64le-debian-sid

      Thread pointer: 0x13c92b5d8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7fff0f7ee308 thread_stack 0x49000
      mysys/stacktrace.c:212(my_print_stacktrace)[0x134260fa4]
      sql/signal_handler.cc:233(handle_fatal_signal)[0x133c20b38]
      addr2line: 'linux-vdso64.so.1': No such file
      linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fffaea90514]
      ??:0(pthread_key_delete)[0x7fffada8defc]
      ??:0(gsignal)[0x7fffada2cd3c]
      ??:0(abort)[0x7fffada0d060]
      ut/ut0dbg.cc:60(ut_dbg_assertion_failed(char const*, char const*, unsigned int))[0x13379b12c]
      include/fut0lst.h:122(flst_read_addr(unsigned char const*))[0x133799c8c]
      include/fut0lst.h:122(flst_read_addr(unsigned char const*))[0x1337e2198]
      trx/trx0purge.cc:1361(trx_purge(unsigned long, bool))[0x1340d322c]
      srv/srv0srv.cc:1611(purge_coordinator_state::do_purge())[0x1340c7c34]
      srv/srv0srv.cc:1765(purge_coordinator_callback(void*))[0x1340c7578]
      tpool/task_group.cc:70(tpool::task_group::execute(tpool::task*))[0x1341e76d8]
      tpool/task.cc:32(tpool::task::execute())[0x1341e7908]
      tpool/tpool_generic.cc:580(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0x1341e51f0]
      bits/invoke.h:74(void std::__invoke_impl<void, void (tpool::thread_pool_generic::*)(tpool::worker_data*), tpool::thread_pool_generic*, tpool::worker_data*>(std::__invoke_memfun_deref, void (tpool::thread_pool_generic::*&&)(tpool::worker_data*), tpool::thread_pool_generic*&&, tpool::worker_data*&&))[0x1341e6384]
      ??:0(std::error_code::default_error_condition() const)[0x7fffadf06920]
      ??:0(pthread_condattr_setpshared)[0x7fffada8b3c8]
      ??:0(clone)[0x7fffadb3acc0]
      

      3a0c3b65de426a9b8cd8bd03406d52e10977f17a aarch64-debian-10

      Thread pointer: 0xaaab15b256c8
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0xffff177ee708 thread_stack 0x49000
      mysys/stacktrace.c:212(my_print_stacktrace)[0xaaaadd338cf0]
      sql/signal_handler.cc:236(handle_fatal_signal)[0xaaaadcedea8c]
      addr2line: 'linux-vdso.so.1': No such file
      linux-vdso.so.1(__kernel_rt_sigreturn+0x0)[0xffffac5707a0]
      ??:0(raise)[0xffffabbd2714]
      ??:0(abort)[0xffffabbc08e8]
      ut/ut0ut.cc:63(ut_print_timestamp(_IO_FILE*))[0xaaaadcbe4840]
      include/fut0lst.h:122(flst_read_addr(unsigned char const*))[0xaaaadd1fc98c]
      trx/trx0purge.cc:1361(trx_purge(unsigned long, bool))[0xaaaadd1fe9e8]
      srv/srv0srv.cc:1610(purge_coordinator_callback(void*))[0xaaaadd1f2900]
      tpool/task_group.cc:71(tpool::task_group::execute(tpool::task*))[0xaaaadd2e6508]
      tpool/tpool_generic.cc:578(tpool::thread_pool_generic::worker_main(tpool::worker_data*))[0xaaaadd2e45b8]
      ??:0(std::error_code::default_error_condition() const)[0xffffabeca1f4]
      ??:0(start_thread)[0xffffabfc77e4]
      ??:0(__clone)[0xffffabc6f59c]
      

      Attachments

        Issue Links

          Activity

            alice, builder 231 is also known as ppc64le-rhel-8. I think that this is a Linux io_uring kernel bug that affects at least POWER. This is being tracked by MDEV-30728.

            Implementations of POWER and ARM ISAs typically implement weak memory ordering, unlike (say) IA-32, AMD64, SPARC, and RISC-V RVTSO, which follow Total Store Order (TSO). This bug could be a Linux kernel equivalent of MDEV-28430.

            marko Marko Mäkelä added a comment - alice , builder 231 is also known as ppc64le-rhel-8. I think that this is a Linux io_uring kernel bug that affects at least POWER. This is being tracked by MDEV-30728 . Implementations of POWER and ARM ISAs typically implement weak memory ordering, unlike (say) IA-32, AMD64, SPARC, and RISC-V RVTSO, which follow Total Store Order (TSO). This bug could be a Linux kernel equivalent of MDEV-28430 .

            Builder 384 that is mentioned in Description is aarch64-fedora-36, which I think should be recent enough to support io_uring. For now, I could believe that the kernel bug behind MDEV-28430 can affect also ARMv8.

            marko Marko Mäkelä added a comment - Builder 384 that is mentioned in Description is aarch64-fedora-36, which I think should be recent enough to support io_uring . For now, I could believe that the kernel bug behind MDEV-28430 can affect also ARMv8.

            MDEV-33325 would replace the assertions with error handling.

            marko Marko Mäkelä added a comment - MDEV-33325 would replace the assertions with error handling.

            The corruption could possibly be explained by a regression that was introduced in MDEV-27058 and fixed in MDEV-34453.

            I searched the cross reference, and these failures seemed to be rather rare. The latest recorded crash of this kind was in December 2023, even though the crash was replaced with more robust error handling in April 2024 by MDEV-34453. I did not find any failure between that and now that would indicate corruption in InnoDB.

            A possible explanation is a design choice that I made already in MDEV-13542: If any corruption is found during the purge of committed transaction history, the current operation will be silently skipped. In this case, if the undo logs are found to be corrupted, we will let it remain that way, that is, not delete any delete-marked committed records, or maybe also not free the related undo log pages.

            I see that the test has been failing in many other ways. Partly this could be explained by InnoDB corruption. MDEV-34453 was closed in September 2024, but the fix was not merged to the main branch until October 3, 2024.

            The following failure is newer than that, and it is for some other reason. This kind of failures of this test have occurred on various platforms:

            main 35cebfdc513f92b143b1a7229c480f4f684f1698

            rpl.rpl_semi_sync_after_sync_row 'row'   w1 [ fail ]
                    Test ended at 2024-10-15 03:23:18
             
            CURRENT_TEST: rpl.rpl_semi_sync_after_sync_row
            --- /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.result	2024-10-15 03:08:52.000000000 +0200
            +++ /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.reject	2024-10-15 03:23:15.254532000 +0200
            @@ -175,10 +175,10 @@
             [ master status should be ON ]
             show status like 'Rpl_semi_sync_master_no_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_no_tx	0
            +Rpl_semi_sync_master_no_tx	1
             show status like 'Rpl_semi_sync_master_yes_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_yes_tx	14
            +Rpl_semi_sync_master_yes_tx	13
             [ semi-sync replication of these transactions will fail ]
             insert into t1 values (500);
             [ master status should be OFF ]
            @@ -187,10 +187,10 @@
             Rpl_semi_sync_master_status	OFF
             show status like 'Rpl_semi_sync_master_no_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_no_tx	1
            +Rpl_semi_sync_master_no_tx	2
             show status like 'Rpl_semi_sync_master_yes_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_yes_tx	14
            +Rpl_semi_sync_master_yes_tx	13
             delete from t1 where a=10;
             delete from t1 where a=9;
             delete from t1 where a=8;
            @@ -208,10 +208,10 @@
             Rpl_semi_sync_master_status	OFF
             show status like 'Rpl_semi_sync_master_no_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_no_tx	12
            +Rpl_semi_sync_master_no_tx	13
             show status like 'Rpl_semi_sync_master_yes_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_yes_tx	14
            +Rpl_semi_sync_master_yes_tx	13
             #
             # Test semi-sync status on master will be ON again when slave catches up
             #
            @@ -238,10 +238,10 @@
             [ master status should be ON again after slave catches up ]
             show status like 'Rpl_semi_sync_master_no_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_no_tx	12
            +Rpl_semi_sync_master_no_tx	13
             show status like 'Rpl_semi_sync_master_yes_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_yes_tx	14
            +Rpl_semi_sync_master_yes_tx	13
             show status like 'Rpl_semi_sync_master_clients';
             Variable_name	Value
             Rpl_semi_sync_master_clients	1
            @@ -258,10 +258,10 @@
             [ Semi-sync master status variables before FLUSH STATUS ]
             SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_no_tx	12
            +Rpl_semi_sync_master_no_tx	13
             SHOW STATUS LIKE 'Rpl_semi_sync_master_yes_tx';
             Variable_name	Value
            -Rpl_semi_sync_master_yes_tx	15
            +Rpl_semi_sync_master_yes_tx	14
             FLUSH NO_WRITE_TO_BINLOG GLOBAL STATUS;
             [ Semi-sync master status variables after FLUSH STATUS ]
             SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx';
            

            marko Marko Mäkelä added a comment - The corruption could possibly be explained by a regression that was introduced in MDEV-27058 and fixed in MDEV-34453 . I searched the cross reference, and these failures seemed to be rather rare. The latest recorded crash of this kind was in December 2023, even though the crash was replaced with more robust error handling in April 2024 by MDEV-34453 . I did not find any failure between that and now that would indicate corruption in InnoDB. A possible explanation is a design choice that I made already in MDEV-13542 : If any corruption is found during the purge of committed transaction history, the current operation will be silently skipped. In this case, if the undo logs are found to be corrupted, we will let it remain that way, that is, not delete any delete-marked committed records, or maybe also not free the related undo log pages. I see that the test has been failing in many other ways. Partly this could be explained by InnoDB corruption. MDEV-34453 was closed in September 2024, but the fix was not merged to the main branch until October 3, 2024 . The following failure is newer than that, and it is for some other reason. This kind of failures of this test have occurred on various platforms: main 35cebfdc513f92b143b1a7229c480f4f684f1698 rpl.rpl_semi_sync_after_sync_row 'row' w1 [ fail ] Test ended at 2024-10-15 03:23:18   CURRENT_TEST: rpl.rpl_semi_sync_after_sync_row --- /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.result 2024-10-15 03:08:52.000000000 +0200 +++ /data/buildbot/workers/prod/amd64-freebsd-14/build/mysql-test/suite/rpl/r/rpl_semi_sync_after_sync_row.reject 2024-10-15 03:23:15.254532000 +0200 @@ -175,10 +175,10 @@ [ master status should be ON ] show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 0 +Rpl_semi_sync_master_no_tx 1 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 [ semi-sync replication of these transactions will fail ] insert into t1 values (500); [ master status should be OFF ] @@ -187,10 +187,10 @@ Rpl_semi_sync_master_status OFF show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 1 +Rpl_semi_sync_master_no_tx 2 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 delete from t1 where a=10; delete from t1 where a=9; delete from t1 where a=8; @@ -208,10 +208,10 @@ Rpl_semi_sync_master_status OFF show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 12 +Rpl_semi_sync_master_no_tx 13 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 # # Test semi-sync status on master will be ON again when slave catches up # @@ -238,10 +238,10 @@ [ master status should be ON again after slave catches up ] show status like 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 12 +Rpl_semi_sync_master_no_tx 13 show status like 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 14 +Rpl_semi_sync_master_yes_tx 13 show status like 'Rpl_semi_sync_master_clients'; Variable_name Value Rpl_semi_sync_master_clients 1 @@ -258,10 +258,10 @@ [ Semi-sync master status variables before FLUSH STATUS ] SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx'; Variable_name Value -Rpl_semi_sync_master_no_tx 12 +Rpl_semi_sync_master_no_tx 13 SHOW STATUS LIKE 'Rpl_semi_sync_master_yes_tx'; Variable_name Value -Rpl_semi_sync_master_yes_tx 15 +Rpl_semi_sync_master_yes_tx 14 FLUSH NO_WRITE_TO_BINLOG GLOBAL STATUS; [ Semi-sync master status variables after FLUSH STATUS ] SHOW STATUS LIKE 'Rpl_semi_sync_master_no_tx';

            It looks like the originally reported crashes were fixed by MDEV-33325, and I think that the root cause of the crashes (corruption) should have been fixed by MDEV-34453. The remaining causes of failures ought to reside outside InnoDB.

            marko Marko Mäkelä added a comment - It looks like the originally reported crashes were fixed by MDEV-33325 , and I think that the root cause of the crashes (corruption) should have been fixed by MDEV-34453 . The remaining causes of failures ought to reside outside InnoDB.

            People

              bnestere Brandon Nesterenko
              angelique.sklavounos Angelique Sklavounos (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.