Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-36174

Assertion `!all' failed in binlog_rollback on INSERT...SELECT

Details

    Description

      I have observed the following stack twice in 11.8.1:

      CS 11.8.1 33e0796e7a154e02a5e53c55cefc5d6feb4f5e6d (Debug)

      mariadbd: /test/11.8_dbg/sql/log.cc:2480: int binlog_rollback(THD *, bool): Assertion `!all' failed.
      

      CS 11.8.1 33e0796e7a154e02a5e53c55cefc5d6feb4f5e6d (Debug)

      Core was generated by `/test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd --no-defaults --max'.
      Program terminated with signal SIGABRT, Aborted.
      #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
       
      [Current thread is 1 (LWP 2491313)]
      (gdb) bt
      #0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:44
      #1  __pthread_kill_internal (signo=6, threadid=<optimized out>)at ./nptl/pthread_kill.c:78
      #2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6)at ./nptl/pthread_kill.c:89
      #3  0x000014ea8a64526e in __GI_raise (sig=sig@entry=6)at ../sysdeps/posix/raise.c:26
      #4  0x000014ea8a6288ff in __GI_abort () at ./stdlib/abort.c:79
      #5  0x000014ea8a62881b in __assert_fail_base (fmt=0x14ea8a7d01e8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x5564b82d0604 "!all", file=file@entry=0x5564b82caf86 "/test/11.8_dbg/sql/log.cc", line=line@entry=2480, function=function@entry=0x5564b82d05b7 "int binlog_rollback(THD *, bool)")at ./assert/assert.c:94
      #6  0x000014ea8a63b507 in __assert_fail (assertion=0x5564b82d0604 "!all", file=0x5564b82caf86 "/test/11.8_dbg/sql/log.cc", line=2480, function=0x5564b82d05b7 "int binlog_rollback(THD *, bool)")at ./assert/assert.c:103
      #7  0x00005564b77b15a9 in binlog_rollback (thd=0x14ea1c000d58, all=true)at /test/11.8_dbg/sql/log.cc:2480
      #8  0x00005564b7626712 in ha_rollback_trans (thd=0x14ea1c000d58, all=true)at /test/11.8_dbg/sql/handler.cc:2344
      #9  0x00005564b753f8ae in xa_trans_force_rollback (thd=0x14ea1c000d58)at /test/11.8_dbg/sql/xa.cc:412
      #10 0x00005564b75415b1 in trans_xa_detach (thd=0x14ea1c000d58)at /test/11.8_dbg/sql/xa.cc:898
      #11 0x00005564b71750ea in THD::cleanup (this=0x14ea1c000d58)at /test/11.8_dbg/sql/sql_class.cc:1673
      #12 0x00005564b705b3ca in unlink_thd (thd=0x14ea1c000d58)at /test/11.8_dbg/sql/mysqld.cc:2865
      #13 0x00005564b73eda85 in do_handle_one_connection (connect=0x5564bc11c058, put_in_cache=true) at /test/11.8_dbg/sql/sql_connect.cc:1426
      #14 0x00005564b73ed7be in handle_one_connection (arg=0x5564bc10c1e8)at /test/11.8_dbg/sql/sql_connect.cc:1327
      #15 0x000014ea8a69ca94 in start_thread (arg=<optimized out>)at ./nptl/pthread_create.c:447
      #16 0x000014ea8a729c3c in clone3 ()at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
      

      However, the issue has this far not reproduced once despite of a variety of attempts.

      Attachments

        1. sql1.sql
          5.27 MB
        2. sql2.sql
          2.58 MB

        Activity

          bnestere Elkin knielsen Hi! Any ideas?
          susil.behera FYI (XA)

          Roel Roel Van de Paar added a comment - bnestere Elkin knielsen Hi! Any ideas? susil.behera FYI (XA)
          Elkin Andrei Elkin added a comment -

          Roel, thanks for reporting! I tried to build an execution but could not succeed quickly. It all looks like a DML statement of an XA failed and did before binlogging of XA-Prepare an internal error, like write to a binlog cache occurs and after that the error status was not cleared.
          The latter should not have happened at all.

          If you still have a query log that might help to investigate deeper.

          Elkin Andrei Elkin added a comment - Roel , thanks for reporting! I tried to build an execution but could not succeed quickly. It all looks like a DML statement of an XA failed and did before binlogging of XA-Prepare an internal error, like write to a binlog cache occurs and after that the error status was not cleared. The latter should not have happened at all. If you still have a query log that might help to investigate deeper.
          Roel Roel Van de Paar added a comment - - edited

          Elkin Thank you for having a look!
          Analyzing the SQL logs was a good idea. Here is what I found: both assertions happened upon INSERT...SELECT statements:
          From sql1.sql:

          INSERT INTO t3 SELECT * FROM t1;#ERROR: 2013 - Lost connection to server during query
          

          From sql2.sql:

          INSERT INTO t1 SELECT A.a + B.a*10+C.a*100 FROM t0 A, t0 B, t0 C;#ERROR: 2013 - Lost connection to server during query
          

          I have attached the full traces. The result of each query's exec is stored behind the query (#...).
          Runs stop executing when ~250 consecutive queries fail, so these failing queries are about 240-245 lines from the end.
          SQL query execution was single threaded. While there may be replication related sql statements, no actual replication setup was present (i.e. no replica).

          Roel Roel Van de Paar added a comment - - edited Elkin Thank you for having a look! Analyzing the SQL logs was a good idea. Here is what I found: both assertions happened upon INSERT...SELECT statements: From sql1.sql : INSERT INTO t3 SELECT * FROM t1;#ERROR: 2013 - Lost connection to server during query From sql2.sql : INSERT INTO t1 SELECT A.a + B.a*10+C.a*100 FROM t0 A, t0 B, t0 C;#ERROR: 2013 - Lost connection to server during query I have attached the full traces. The result of each query's exec is stored behind the query ( # ...). Runs stop executing when ~250 consecutive queries fail, so these failing queries are about 240-245 lines from the end. SQL query execution was single threaded. While there may be replication related sql statements, no actual replication setup was present (i.e. no replica).
          Roel Roel Van de Paar added a comment - - edited

          The error log in connection with the sql2.sql trace looks quite uneventful, except that the assert happens post-shutdown:

          CS 11.8.1 33e0796e7a154e02a5e53c55cefc5d6feb4f5e6d (Debug) Build 15/02/2025

          2025-02-16  5:26:45 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd: ready for connections.
          Version: '11.8.1-MariaDB-debug-log'  socket: '/dev/shm/697978/1978/socket.sock'  port: 31708  MariaDB Server
          2025-02-16  5:26:46 8 [Note] Deleted Master_info file '/dev/shm/697978/1978/data/master.info'.
          2025-02-16  5:26:46 8 [Note] Deleted Master_info file '/dev/shm/697978/1978/data/relay-log.info'.
          2025-02-16  5:26:46 8 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
          2025-02-16  5:26:46 8 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MariaDB server acts as a replica and has its hostname changed. Please use '--log-basename=#' or '--relay-log=pid-relay-bin' to avoid this problem.
          2025-02-16  5:26:46 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='', master_port='3306', master_log_file='', master_log_pos='4'.
          2025-02-16  5:26:46 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos
          2025-02-16  5:26:50 8 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
          2025-02-16  5:26:50 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='', master_port='3306', master_log_file='', master_log_pos='4'.
          2025-02-16  5:26:50 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos
          2025-02-16  5:26:53 8 [Note] Master connection name: ''  Master_info_file: 'master.info'  Relay_info_file: 'relay-log.info'
          2025-02-16  5:26:53 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='3306', master_log_file='', master_log_pos='4'.
          2025-02-16  5:26:53 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos
          2025-02-16  5:27:12 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
          2025-02-16  5:27:32 0 [Warning] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd: Thread 8 (user : 'root') did not exit
           
          mariadbd: /test/11.8_dbg/sql/log.cc:2480: int binlog_rollback(THD *, bool): Assertion `!all' failed.
          

          And in the trace for sql1.sql we see 'Local temporary space limit reached' (though somewhat earlier in the log), and again the assert happens post-shutdown:

          CS 11.8.1 33e0796e7a154e02a5e53c55cefc5d6feb4f5e6d (Debug) Build 15/02/2025

          2025-02-15 18:24:01 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd: ready for connections.
          Version: '11.8.1-MariaDB-debug-log'  socket: '/dev/shm/408286/811/socket.sock'  port: 42209  MariaDB Server
          2025-02-15 18:24:14 8 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MariaDB server acts as a replica and has its hostname changed. Please use '--log-basename=#' or '--relay-log=pid-relay-bin' to avoid this problem.
          2025-02-15 18:24:17 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='', master_port='3306', master_log_file='', master_log_pos='4'.
          2025-02-15 18:24:17 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos
          2025-02-15 18:24:17 8 [ERROR] Write to binary log failed: Local temporary space limit reached. An incident event is written to binary log and slave will be stopped.
           
          2025-02-15 18:24:18 8 [ERROR] Write to binary log failed: Local temporary space limit reached. An incident event is written to binary log and slave will be stopped.
           
          2025-02-15 18:24:29 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown
          mariadbd: /test/11.8_dbg/sql/log.cc:2480: int binlog_rollback(THD *, bool): Assertion `!all' failed.
          

          Roel Roel Van de Paar added a comment - - edited The error log in connection with the sql2.sql trace looks quite uneventful, except that the assert happens post-shutdown: CS 11.8.1 33e0796e7a154e02a5e53c55cefc5d6feb4f5e6d (Debug) Build 15/02/2025 2025-02-16 5:26:45 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd: ready for connections. Version: '11.8.1-MariaDB-debug-log' socket: '/dev/shm/697978/1978/socket.sock' port: 31708 MariaDB Server 2025-02-16 5:26:46 8 [Note] Deleted Master_info file '/dev/shm/697978/1978/data/master.info'. 2025-02-16 5:26:46 8 [Note] Deleted Master_info file '/dev/shm/697978/1978/data/relay-log.info'. 2025-02-16 5:26:46 8 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info' 2025-02-16 5:26:46 8 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MariaDB server acts as a replica and has its hostname changed. Please use '--log-basename=#' or '--relay-log=pid-relay-bin' to avoid this problem. 2025-02-16 5:26:46 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. 2025-02-16 5:26:46 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2025-02-16 5:26:50 8 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info' 2025-02-16 5:26:50 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. 2025-02-16 5:26:50 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2025-02-16 5:26:53 8 [Note] Master connection name: '' Master_info_file: 'master.info' Relay_info_file: 'relay-log.info' 2025-02-16 5:26:53 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='3306', master_log_file='', master_log_pos='4'. 2025-02-16 5:26:53 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2025-02-16 5:27:12 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown 2025-02-16 5:27:32 0 [Warning] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd: Thread 8 (user : 'root') did not exit   mariadbd: /test/11.8_dbg/sql/log.cc:2480: int binlog_rollback(THD *, bool): Assertion `!all' failed. And in the trace for sql1.sql we see 'Local temporary space limit reached' (though somewhat earlier in the log), and again the assert happens post-shutdown: CS 11.8.1 33e0796e7a154e02a5e53c55cefc5d6feb4f5e6d (Debug) Build 15/02/2025 2025-02-15 18:24:01 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd: ready for connections. Version: '11.8.1-MariaDB-debug-log' socket: '/dev/shm/408286/811/socket.sock' port: 42209 MariaDB Server 2025-02-15 18:24:14 8 [Warning] Neither --relay-log nor --relay-log-index were used; so replication may break when this MariaDB server acts as a replica and has its hostname changed. Please use '--log-basename=#' or '--relay-log=pid-relay-bin' to avoid this problem. 2025-02-15 18:24:17 8 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. 2025-02-15 18:24:17 8 [Note] Previous Using_Gtid=Slave_Pos. New Using_Gtid=Slave_Pos 2025-02-15 18:24:17 8 [ERROR] Write to binary log failed: Local temporary space limit reached. An incident event is written to binary log and slave will be stopped.   2025-02-15 18:24:18 8 [ERROR] Write to binary log failed: Local temporary space limit reached. An incident event is written to binary log and slave will be stopped.   2025-02-15 18:24:29 0 [Note] /test/MD150225-mariadb-11.8.1-linux-x86_64-dbg/bin/mariadbd (initiated by: root[root] @ localhost []): Normal shutdown mariadbd: /test/11.8_dbg/sql/log.cc:2480: int binlog_rollback(THD *, bool): Assertion `!all' failed.

          Given the above I also checked the runtime logs and found that our regular shutdown command:

          timeout --signal=9 90s ${BASEDIR}/bin/mysqladmin -uroot -S${SOCKET} shutdown > /dev/null 2>&1
          

          In both cases failed to shutdown the server in 90 seconds.
          When that happens, we execute kill -9. i.o.w. the Assertion would have happened before that, i.e. while mariadbd was still causing mysqladmin to hang.
          Given this finding, I'll try and reduce based on a delayed shutdown.

          Roel Roel Van de Paar added a comment - Given the above I also checked the runtime logs and found that our regular shutdown command: timeout --signal=9 90s ${BASEDIR} /bin/mysqladmin -uroot -S${SOCKET} shutdown > /dev/null 2>&1 In both cases failed to shutdown the server in 90 seconds. When that happens, we execute kill -9 . i.o.w. the Assertion would have happened before that, i.e. while mariadbd was still causing mysqladmin to hang. Given this finding, I'll try and reduce based on a delayed shutdown.

          People

            Roel Roel Van de Paar
            Roel Roel Van de Paar
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.