Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-28053

Sysbench data load crashes Galera secondary node in async master slave setup

Details

    Description

      Sysbench data load crashes Galera secondary node in async master (mariadb server) > slave (Galera node1) setup

      Error info

      10.8.3 1596ef738c63b7b40a76ba75264a70644fdb7e51 (Debug)

      2022-03-14 12:11:20 7 [Note] WSREP: Apply TO END: 24: CREATE TABLE sbtest14(
        id INTEGER NOT NULL AUTO_INCREMENT,
        k INTEGER DEFAULT '0' NOT NULL,
        c CHAR(120) DEFAULT '' NOT NULL,
        pad CHAR(60) DEFAULT '' NOT NULL,
        PRIMARY KEY (id)
      ) /*! ENGINE = innodb */
      2022-03-14 12:11:20 7 [Note] WSREP: Set WSREPXid for InnoDB:  b2ea5b48-a361-11ec-a253-965dbcf44208:24
      2022-03-14 12:11:20 6 [Note] WSREP: wsrep_before_prepare: 1
      2022-03-14 12:11:20 6 [Note] WSREP: wsrep_after_prepare: 1
      2022-03-14 12:11:20 6 [Note] WSREP: wsrep_before_commit: 1, 25
      2022-03-14 12:11:20 6 [Note] WSREP: wsrep_ordered_commit: 1
      2022-03-14 12:11:20 6 [ERROR] mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950")
      2022-03-14 12:11:20 6 [Note] WSREP: log_and_order has failed 6 0
      mysqld: /test/10.8_dbg/wsrep-lib/src/transaction.cpp:659: int wsrep::transaction::before_rollback(): Assertion `state() == s_executing || state() == s_preparing || state() == s_prepared || state() == s_must_abort
       || state() == s_aborting || state() == s_cert_failed || state() == s_must_replay' failed.
      220314 12:11:20 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
      

      Test case

      GALERA_BASE=/home/ramesh/rpl/mariadb-10.8.3-linux-x86_64_one
      RPL_BASE=/home/ramesh/rpl/mariadb-10.8.3-linux-x86_64_two
      DATADIR=/home/ramesh/rpl
       
      rm -Rf $DATADIR/node* $DATADIR/data
       
      $GALERA_BASE/scripts/mariadb-install-db --no-defaults --force --auth-root-authentication-method=normal  --basedir=$GALERA_BASE --datadir=$DATADIR/node1
      $GALERA_BASE/scripts/mariadb-install-db --no-defaults --force --auth-root-authentication-method=normal  --basedir=$GALERA_BASE --datadir=$DATADIR/node2
      $RPL_BASE/scripts/mariadb-install-db --no-defaults --force --auth-root-authentication-method=normal  --basedir=$RPL_BASE --datadir=$DATADIR/data
       
       
      $GALERA_BASE/bin/mysqld --defaults-file=$DATADIR/n1.cnf --wsrep-new-cluster > $DATADIR/node1/node1.err 2>&1 & 
      sleep 2
      $GALERA_BASE/bin/mysqladmin  -uroot -S$DATADIR/node1/node1_socket.sock ping
       
      $GALERA_BASE/bin/mysqld --defaults-file=$DATADIR/n2.cnf > $DATADIR/node2/node2.err 2>&1 &
       
      $RPL_BASE/bin/mysqld --defaults-file=$DATADIR/my.cnf  > $DATADIR/data/mysql.err 2>&1 & 
      sleep 2
      $RPL_BASE/bin/mysql -uroot --socket=/home/ramesh/rpl/data/socket.sock
       
      delete from mysql.user where user='';
      create user repl@'%' identified by 'repl';
      grant all on *.* to  repl@'%';
      flush privileges;
      \q
       
       
      $GALERA_BASE/bin/mysql -uroot -S$DATADIR/node1/node1_socket.sock
      CHANGE MASTER TO MASTER_HOST='127.0.0.1', MASTER_PORT=4040, MASTER_USER='repl', MASTER_PASSWORD='repl', MASTER_USE_GTID=slave_pos; START SLAVE; 
      SHOW SLAVE STATUS \G
      \q
       
      sysbench /usr/share/sysbench/oltp_insert.lua --mysql-db=test --mysql-user=root  --db-driver=mysql --mysql-socket=/home/ramesh/rpl/data/socket.sock --threads=20 --tables=20 --table-size=1000 prepare
      
      

      Attachments

        1. my.cnf
          0.4 kB
        2. n1.cnf
          1.0 kB
        3. n2.cnf
          0.9 kB

        Activity

          Node1 crashes with similar assertion in different place when we run RQG with a simple grammar file.

          perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=4040:socket=/home/ramesh/rpl/data/socket.sock:user=root:database=test --gendata=conf/examples/example.zz --grammar=conf/examples/example.yy –-threads=8 --duration=300 --queries=100000
          

          10.8.0 cf7cc376bacb3f49f5afa2cce144dace82602546

          Core was generated by `/home/ramesh/rpl/mariadb-10.8.0-linux-x86_64_one/bin/mysqld --defaults-file=/ho'.
          --Type <RET> for more, q to quit, c to continue without paging--
          Program terminated with signal SIGABRT, Aborted.
          #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
          56	../sysdeps/unix/sysv/linux/pthread_kill.c: No such file or directory.
          [Current thread is 1 (Thread 0x149138272700 (LWP 324518))]
          (gdb) bt
          #0  __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
          #1  0x00005566b28071cd in my_write_core (sig=sig@entry=6) at /test/mtest/10.8_dbg/mysys/stacktrace.c:424
          #2  0x00005566b1fe42d5 in handle_fatal_signal (sig=6) at /test/mtest/10.8_dbg/sql/signal_handler.cc:345
          #3  <signal handler called>
          #4  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
          #5  0x000014914bf4c859 in __GI_abort () at abort.c:79
          #6  0x000014914bf4c729 in __assert_fail_base (fmt=0x14914c0e2588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
              assertion=0x5566b2f50420 "state() == s_executing || state() == s_prepared || state() == s_committed || state() == s_aborted || state() == s_must_abort || state() == s_cert_failed || state() == s_must_replay", file=0x5566b2f4ed98 "/test/mtest/10.8_dbg/wsrep-lib/src/transaction.cpp", line=810, function=<optimized out>) at assert.c:92
          #7  0x000014914bf5df36 in __GI___assert_fail (
              assertion=assertion@entry=0x5566b2f50420 "state() == s_executing || state() == s_prepared || state() == s_committed || state() == s_aborted || state() == s_must_abort || state() == s_cert_failed || state() == s_must_replay", file=file@entry=0x5566b2f4ed98 "/test/mtest/10.8_dbg/wsrep-lib/src/transaction.cpp", line=line@entry=810, function=function@entry=0x5566b2f503f0 "int wsrep::transaction::after_statement()")
              at assert.c:101
          #8  0x00005566b296b16a in wsrep::transaction::after_statement (this=this@entry=0x1490b00078a8) at /test/mtest/10.8_dbg/wsrep-lib/src/transaction.cpp:890
          #9  0x00005566b2950fd8 in wsrep::client_state::after_statement (this=this@entry=0x1490b0007840) at /test/mtest/10.8_dbg/wsrep-lib/src/client_state.cpp:272
          #10 0x00005566b1ee5c78 in wsrep_after_statement (thd=0x1490b0000ff8) at /test/mtest/10.8_dbg/sql/sql_class.h:5330
          #11 rpt_handle_event (qev=qev@entry=0x1490c003bfb8, rpt=rpt@entry=0x1490c000eaa0) at /test/mtest/10.8_dbg/sql/rpl_parallel.cc:66
          #12 0x00005566b1eeb555 in handle_rpl_parallel_thread (arg=arg@entry=0x1490c000eaa0) at /test/mtest/10.8_dbg/sql/rpl_parallel.cc:1376
          #13 0x00005566b22e5601 in pfs_spawn_thread (arg=0x1490c000fd08) at /test/mtest/10.8_dbg/storage/perfschema/pfs.cc:2201
          #14 0x000014914c45b609 in start_thread (arg=<optimized out>) at pthread_create.c:477
          #15 0x000014914c049293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
          (gdb) 
          

          ramesh Ramesh Sivaraman added a comment - Node1 crashes with similar assertion in different place when we run RQG with a simple grammar file. perl gentest.pl --dsn=dbi:mysql:host=127.0.0.1:port=4040:socket=/home/ramesh/rpl/data/socket.sock:user=root:database=test --gendata=conf/examples/example.zz --grammar=conf/examples/example.yy –-threads=8 --duration=300 --queries=100000 10.8.0 cf7cc376bacb3f49f5afa2cce144dace82602546 Core was generated by `/home/ramesh/rpl/mariadb-10.8.0-linux-x86_64_one/bin/mysqld --defaults-file=/ho'. --Type <RET> for more, q to quit, c to continue without paging-- Program terminated with signal SIGABRT, Aborted. #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 56 ../sysdeps/unix/sysv/linux/pthread_kill.c: No such file or directory. [Current thread is 1 (Thread 0x149138272700 (LWP 324518))] (gdb) bt #0 __pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 0x00005566b28071cd in my_write_core (sig=sig@entry=6) at /test/mtest/10.8_dbg/mysys/stacktrace.c:424 #2 0x00005566b1fe42d5 in handle_fatal_signal (sig=6) at /test/mtest/10.8_dbg/sql/signal_handler.cc:345 #3 <signal handler called> #4 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #5 0x000014914bf4c859 in __GI_abort () at abort.c:79 #6 0x000014914bf4c729 in __assert_fail_base (fmt=0x14914c0e2588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x5566b2f50420 "state() == s_executing || state() == s_prepared || state() == s_committed || state() == s_aborted || state() == s_must_abort || state() == s_cert_failed || state() == s_must_replay", file=0x5566b2f4ed98 "/test/mtest/10.8_dbg/wsrep-lib/src/transaction.cpp", line=810, function=<optimized out>) at assert.c:92 #7 0x000014914bf5df36 in __GI___assert_fail ( assertion=assertion@entry=0x5566b2f50420 "state() == s_executing || state() == s_prepared || state() == s_committed || state() == s_aborted || state() == s_must_abort || state() == s_cert_failed || state() == s_must_replay", file=file@entry=0x5566b2f4ed98 "/test/mtest/10.8_dbg/wsrep-lib/src/transaction.cpp", line=line@entry=810, function=function@entry=0x5566b2f503f0 "int wsrep::transaction::after_statement()") at assert.c:101 #8 0x00005566b296b16a in wsrep::transaction::after_statement (this=this@entry=0x1490b00078a8) at /test/mtest/10.8_dbg/wsrep-lib/src/transaction.cpp:890 #9 0x00005566b2950fd8 in wsrep::client_state::after_statement (this=this@entry=0x1490b0007840) at /test/mtest/10.8_dbg/wsrep-lib/src/client_state.cpp:272 #10 0x00005566b1ee5c78 in wsrep_after_statement (thd=0x1490b0000ff8) at /test/mtest/10.8_dbg/sql/sql_class.h:5330 #11 rpt_handle_event (qev=qev@entry=0x1490c003bfb8, rpt=rpt@entry=0x1490c000eaa0) at /test/mtest/10.8_dbg/sql/rpl_parallel.cc:66 #12 0x00005566b1eeb555 in handle_rpl_parallel_thread (arg=arg@entry=0x1490c000eaa0) at /test/mtest/10.8_dbg/sql/rpl_parallel.cc:1376 #13 0x00005566b22e5601 in pfs_spawn_thread (arg=0x1490c000fd08) at /test/mtest/10.8_dbg/storage/perfschema/pfs.cc:2201 #14 0x000014914c45b609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #15 0x000014914c049293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb)

          jplindst Replication breaks in 10.4 optimized build even after fixing MDEV-27568

          10.4.25 c519aa3d7aea1afb7754ccf70aa5b276ff5fbf87

          2022-03-19 17:03:42 23 [Note] WSREP: wsrep_after_prepare: 1
          2022-03-19 17:03:42 23 [Note] WSREP: wsrep_before_commit: 1, 91
          2022-03-19 17:03:42 23 [Note] WSREP: wsrep_ordered_commit: 1
          2022-03-19 17:03:42 23 [ERROR] mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950")
          2022-03-19 17:03:42 22 [Note] WSREP: wsrep_after_statement for 22 client_state exec  client_mode local trans_state executing
          2022-03-19 17:03:42 23 [Note] WSREP: log_and_order has failed 23 0
          2022-03-19 17:03:42 23 [Note] WSREP: rollback skipped 0x148128035300 1
          2022-03-19 17:03:42 23 [Note] WSREP: SQL apply failed, res 1 conflict state: ordered_commit
          2022-03-19 17:03:42 23 [ERROR] Slave SQL: Node has dropped from cluster, Gtid 0-11-88, Internal MariaDB error code: 1047
          2022-03-19 17:03:42 23 [Note] WSREP: wsrep_after_statement for 23 client_state exec  client_mode local trans_state ordered_commit
          2022-03-19 17:03:42 23 [ERROR] Slave (additional info): An attempt was made to binlog GTID 0-11-88 which would create an out-of-order sequence number with existing GTID 0-11-90, and gtid strict mode is enabled Error_code: 1950
          2022-03-19 17:03:42 23 [Warning] Slave: An attempt was made to binlog GTID 0-11-88 which would create an out-of-order sequence number with existing GTID 0-11-90, and gtid strict mode is enabled Error_code: 1950
          2022-03-19 17:03:42 23 [Warning] Slave: Error writing file 'binlog' (errno: 1950 "Unknown error 1950") Error_code: 1026
          2022-03-19 17:03:42 23 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 68458343; GTID position '0-11-90'
          2022-03-19 17:03:42 22 [Note] WSREP: wsrep_after_statement for 22 client_state exec  client_mode local trans_state executing
          

          ramesh Ramesh Sivaraman added a comment - jplindst Replication breaks in 10.4 optimized build even after fixing MDEV-27568 10.4.25 c519aa3d7aea1afb7754ccf70aa5b276ff5fbf87 2022-03-19 17:03:42 23 [Note] WSREP: wsrep_after_prepare: 1 2022-03-19 17:03:42 23 [Note] WSREP: wsrep_before_commit: 1, 91 2022-03-19 17:03:42 23 [Note] WSREP: wsrep_ordered_commit: 1 2022-03-19 17:03:42 23 [ERROR] mysqld: Error writing file 'binlog' (errno: 1950 "Unknown error 1950") 2022-03-19 17:03:42 22 [Note] WSREP: wsrep_after_statement for 22 client_state exec client_mode local trans_state executing 2022-03-19 17:03:42 23 [Note] WSREP: log_and_order has failed 23 0 2022-03-19 17:03:42 23 [Note] WSREP: rollback skipped 0x148128035300 1 2022-03-19 17:03:42 23 [Note] WSREP: SQL apply failed, res 1 conflict state: ordered_commit 2022-03-19 17:03:42 23 [ERROR] Slave SQL: Node has dropped from cluster, Gtid 0-11-88, Internal MariaDB error code: 1047 2022-03-19 17:03:42 23 [Note] WSREP: wsrep_after_statement for 23 client_state exec client_mode local trans_state ordered_commit 2022-03-19 17:03:42 23 [ERROR] Slave (additional info): An attempt was made to binlog GTID 0-11-88 which would create an out-of-order sequence number with existing GTID 0-11-90, and gtid strict mode is enabled Error_code: 1950 2022-03-19 17:03:42 23 [Warning] Slave: An attempt was made to binlog GTID 0-11-88 which would create an out-of-order sequence number with existing GTID 0-11-90, and gtid strict mode is enabled Error_code: 1950 2022-03-19 17:03:42 23 [Warning] Slave: Error writing file 'binlog' (errno: 1950 "Unknown error 1950") Error_code: 1026 2022-03-19 17:03:42 23 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000001' position 68458343; GTID position '0-11-90' 2022-03-19 17:03:42 22 [Note] WSREP: wsrep_after_statement for 22 client_state exec client_mode local trans_state executing

          The error "Error writing file 'binlog' (errno: 1950 "Unknown error 1950")" is reported because there is an attempt to write a gtid event to the binlog and the gtid is out of order (1950 is ER_GTID_STRICT_OUT_OF_ORDER). The error is raised in this case because gtid_strict_mode is enabled.
          Currently, the combination parallel_slave_threads > 1 and Galera acting as a slave does not guarantee that the replication order of Galera matches the order given by the aync replication master.

          sciascid Daniele Sciascia added a comment - The error "Error writing file 'binlog' (errno: 1950 "Unknown error 1950")" is reported because there is an attempt to write a gtid event to the binlog and the gtid is out of order (1950 is ER_GTID_STRICT_OUT_OF_ORDER). The error is raised in this case because gtid_strict_mode is enabled. Currently, the combination parallel_slave_threads > 1 and Galera acting as a slave does not guarantee that the replication order of Galera matches the order given by the aync replication master.

          ramesh Fix in branch bb-10.4-galera

          jplindst Jan Lindström (Inactive) added a comment - ramesh Fix in branch bb-10.4-galera

          People

            sciascid Daniele Sciascia
            ramesh Ramesh Sivaraman
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.