Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25835

mariadb 10.3.29 galera cluster crashes with errors like: "[ERROR] WSREP: Trx 236236 tries to abort slave trx 236238."

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.2.36, 10.3.29
    • 10.2.41, 10.3.32, 10.4.22, 10.5.13, 10.6.5
    • Galera
    • None

    Description

      Starting with mariadb 10.3.29 when deploying a three node galera cluster, we are seeing crashes with errors like in [1].
      Please find attached the log files from all three nodes and their configurations. A core dump of the crash in [1] can also be provided.

      [1]

      2021-06-01  9:06:16 0 [ERROR] WSREP: Trx 236236 tries to abort slave trx 236238. This could be caused by:
      	1) unsupported configuration options combination, please check documentation.
      	2) a bug in the code.
      	3) a database corruption.
       Node consistency compromized, need to abort. Restart the node to resync with cluster.
      210601  9:06:16 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.
       
      To report this bug, see https://mariadb.com/kb/en/reporting-bugs
       
      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed, 
      something is definitely wrong and this may fail.
       
      Server version: 10.3.29-MariaDB-1:10.3.29+maria~bionic-log
      key_buffer_size=67108864
      read_buffer_size=131072
      max_used_connections=1
      max_threads=10002
      thread_count=12
      It is possible that mysqld could use up to 
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 22055535 K  bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.
       
      Thread pointer: 0x0
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x0 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5647d689bb8e]
      /usr/sbin/mysqld(handle_fatal_signal+0x515)[0x5647d63308c5]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fd6316ed980]
      /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fd631328fb7]
      /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fd63132a921]
      /usr/sbin/mysqld(+0x90a914)[0x5647d64ba914]
      /usr/sbin/mysqld(+0x90f83c)[0x5647d64bf83c]
      /usr/sbin/mysqld(handle_manager+0x1f3)[0x5647d6142a83]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fd6316e26db]
      /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fd63140b71f]
      The manual page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mysqld/ contains
      information that should help you find out what is causing the crash.
      Writing a core file...
      Working directory at /var/lib/mysql
      Resource Limits:
      Limit                     Soft Limit           Hard Limit           Units     
      Max cpu time              unlimited            unlimited            seconds   
      Max file size             unlimited            unlimited            bytes     
      Max data size             unlimited            unlimited            bytes     
      Max stack size            8388608              unlimited            bytes     
      Max core file size        unlimited            unlimited            bytes     
      Max resident set          unlimited            unlimited            bytes     
      Max processes             unlimited            unlimited            processes 
      Max open files            1048576              1048576              files     
      Max locked memory         16777216             16777216             bytes     
      Max address space         unlimited            unlimited            bytes     
      Max file locks            unlimited            unlimited            locks     
      Max pending signals       64055                64055                signals   
      Max msgqueue size         819200               819200               bytes     
      Max nice priority         0                    0                    
      Max realtime priority     0                    0                    
      Max realtime timeout      unlimited            unlimited            us        
      Core pattern: core
      

      Attachments

        1. os-ctrl-0-galera.cnf
          1 kB
        2. os-ctrl-0-mariadb.log
          275 kB
        3. os-ctrl-1-galera.cnf
          1 kB
        4. os-ctrl-1-mariadb.log
          373 kB
        5. os-ctrl-2-galera.cnf
          1 kB
        6. os-ctrl-2-mariadb.log
          143 kB

        Issue Links

          Activity

            jhorstmann Jan Horstmann created issue -
            julien.fritsch Julien Fritsch made changes -
            Field Original Value New Value
            wdoekes Walter Doekes added a comment -

            Okay, so we notice this too:

            2021-06-02 20:58:51 0 [Note] WSREP: Signalling provider to continue.
            2021-06-02 20:58:51 0 [Note] WSREP: SST received: 13af80aa-836a-11eb-8c23-32f57cb601d4:54087553
            2021-06-02 20:58:51 2 [Note] WSREP: Receiving IST: 1683 writesets, seqnos 54087553-54089236
            2021-06-02 20:58:51 0 [Note] WSREP: Receiving IST...  0.0% (   0/1683 events) complete.
            2021-06-02 20:58:51 0 [Note] Reading of all Master_info entries succeeded
            2021-06-02 20:58:51 0 [Note] Added new Master_info '' to hash table
            2021-06-02 20:58:51 0 [Note] /usr/sbin/mysqld: ready for connections.
            Version: '10.3.29-MariaDB-1:10.3.29+maria~bionic-log'  socket: '/run/mysqld/MY_DEPLOYMENT-mysqld.sock'  port: 3306  mariadb.org binary distribution
            2021-06-02 20:58:51 0 [ERROR] WSREP: Trx 54088471 tries to abort slave trx 54088527. This could be caused by:
                    1) unsupported configuration options combination, please check documentation.
                    2) a bug in the code.
                    3) a database corruption.
             Node consistency compromized, need to abort. Restart the node to resync with cluster.
            

            It went into a crashloop, with the same: Trx 54088471 tries to abort slave trx 54088527.

            First core dump looks like:

            #0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
            #1  0x00007f30f06fa9fb in __GI_abort () at abort.c:100
            #2  0x000055d85b55d914 in wsrep_abort_slave_trx (bf_seqno=<optimized out>, victim_seqno=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:18893
            #3  0x000055d85b56283c in bg_wsrep_kill_trx (void_arg=0x7f1088023830) at ./storage/innobase/handler/ha_innodb.cc:19058
            #4  0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
            #5  0x000055d85b8ea67a in pfs_spawn_thread (arg=0x55d8c61e5228) at ./storage/perfschema/pfs.cc:1869
            #6  0x00007f30f0ab26db in start_thread (arg=0x7f1153399700) at pthread_create.c:463
            

                    case QUERY_IDLE:
                    {
                            WSREP_DEBUG("kill IDLE for " TRX_ID_FMT, victim_trx->id);      
             
                            if (wsrep_thd_exec_mode(thd) == REPL_RECV) {
                                    WSREP_DEBUG("kill BF IDLE, seqno: %lld",
                                                (long long)wsrep_thd_trx_seqno(thd));      
                                    wsrep_abort_slave_trx(arg->bf_seqno,     // <-- CRASH
                                                          wsrep_thd_trx_seqno(thd));
                                    goto ret_unlock;
                            }
                            /* This will lock thd from proceeding after net_read() */      
                            wsrep_thd_set_conflict_state(thd, ABORTING);
            

            (gdb) print *arg
            $2 = {thd_id = 67, trx_id = 7493842400, bf_seqno = 54088471, signal = 1}
             
            (gdb) print *(unsigned*)((void*)thd + 0x58e0)
            $4 = 54088527
             
            (gdb) print thd->wsrep_trx_meta
            $22 = {gtid = {uuid = {data = "\023\257\200\252\203j\021\353\214#2\365|\266", <incomplete sequence \324>}, seqno = 54088527}, depends_on = 54088417}
             
            (gdb) print thd->proc_info
            $23 = 0x7f0fb0006570 "Applied write set 54088527"
             
            (gdb) info locals
            rcode = <optimized out>
            arg = 0x7f1088023830
            thd = 0x7f0fb0000c08
            victim_trx = 0x7f242978d1f8
            awake = false
            

            (gdb) up
            #4  0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112
             
            (gdb) info locals
            next = 0x0
            cb = 0x7f1088023868
            error = 0
            abstime = {tv_sec = 94387747591648, tv_nsec = 139848169358893}
            reset_flush_time = true
            

            Went back to 10.3.25 because of MDEV-25190 (among others).


            Let me know if I can get you more debug info from the dump.

            wdoekes Walter Doekes added a comment - Okay, so we notice this too: 2021-06-02 20:58:51 0 [Note] WSREP: Signalling provider to continue. 2021-06-02 20:58:51 0 [Note] WSREP: SST received: 13af80aa-836a-11eb-8c23-32f57cb601d4:54087553 2021-06-02 20:58:51 2 [Note] WSREP: Receiving IST: 1683 writesets, seqnos 54087553-54089236 2021-06-02 20:58:51 0 [Note] WSREP: Receiving IST... 0.0% ( 0/1683 events) complete. 2021-06-02 20:58:51 0 [Note] Reading of all Master_info entries succeeded 2021-06-02 20:58:51 0 [Note] Added new Master_info '' to hash table 2021-06-02 20:58:51 0 [Note] /usr/sbin/mysqld: ready for connections. Version: '10.3.29-MariaDB-1:10.3.29+maria~bionic-log' socket: '/run/mysqld/MY_DEPLOYMENT-mysqld.sock' port: 3306 mariadb.org binary distribution 2021-06-02 20:58:51 0 [ERROR] WSREP: Trx 54088471 tries to abort slave trx 54088527. This could be caused by: 1) unsupported configuration options combination, please check documentation. 2) a bug in the code. 3) a database corruption. Node consistency compromized, need to abort. Restart the node to resync with cluster. It went into a crashloop, with the same: Trx 54088471 tries to abort slave trx 54088527. First core dump looks like: #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007f30f06fa9fb in __GI_abort () at abort.c:100 #2 0x000055d85b55d914 in wsrep_abort_slave_trx (bf_seqno=<optimized out>, victim_seqno=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:18893 #3 0x000055d85b56283c in bg_wsrep_kill_trx (void_arg=0x7f1088023830) at ./storage/innobase/handler/ha_innodb.cc:19058 #4 0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112 #5 0x000055d85b8ea67a in pfs_spawn_thread (arg=0x55d8c61e5228) at ./storage/perfschema/pfs.cc:1869 #6 0x00007f30f0ab26db in start_thread (arg=0x7f1153399700) at pthread_create.c:463 case QUERY_IDLE: { WSREP_DEBUG("kill IDLE for " TRX_ID_FMT, victim_trx->id);   if (wsrep_thd_exec_mode(thd) == REPL_RECV) { WSREP_DEBUG("kill BF IDLE, seqno: %lld", (long long)wsrep_thd_trx_seqno(thd)); wsrep_abort_slave_trx(arg->bf_seqno, // <-- CRASH wsrep_thd_trx_seqno(thd)); goto ret_unlock; } /* This will lock thd from proceeding after net_read() */ wsrep_thd_set_conflict_state(thd, ABORTING); (gdb) print *arg $2 = {thd_id = 67, trx_id = 7493842400, bf_seqno = 54088471, signal = 1}   (gdb) print *(unsigned*)((void*)thd + 0x58e0) $4 = 54088527   (gdb) print thd->wsrep_trx_meta $22 = {gtid = {uuid = {data = "\023\257\200\252\203j\021\353\214#2\365|\266", <incomplete sequence \324>}, seqno = 54088527}, depends_on = 54088417}   (gdb) print thd->proc_info $23 = 0x7f0fb0006570 "Applied write set 54088527"   (gdb) info locals rcode = <optimized out> arg = 0x7f1088023830 thd = 0x7f0fb0000c08 victim_trx = 0x7f242978d1f8 awake = false (gdb) up #4 0x000055d85b1e5a83 in handle_manager (arg=arg@entry=0x0) at ./sql/sql_manager.cc:112   (gdb) info locals next = 0x0 cb = 0x7f1088023868 error = 0 abstime = {tv_sec = 94387747591648, tv_nsec = 139848169358893} reset_flush_time = true Went back to 10.3.25 because of MDEV-25190 (among others). Let me know if I can get you more debug info from the dump.
            jhorstmann Jan Horstmann added a comment -

            @wdoekes would you consider it to be save to go back from 10.3.29 to 10.3.28 or lower? Did you do a full SST per node and which method do you use?

            jhorstmann Jan Horstmann added a comment - @wdoekes would you consider it to be save to go back from 10.3.29 to 10.3.28 or lower? Did you do a full SST per node and which method do you use?
            wdoekes Walter Doekes added a comment -

            If you're not affected by MDEV-25190, then I guess 10.3.28 is fine. We've had a bad streak after 10.3.25, so we're reverting back to that.

            I'm not aware what ops did when reverting. SSTs generally start automatically here after certain failures, more often than appears necessary. So if there was an SST, I'm not sure if it was intentional either.

            The crash loop started on the second node, apparently. ("Rebooted, started maria on the 2nd node. Crashloop. Downgrading made it happy again.")

            wdoekes Walter Doekes added a comment - If you're not affected by MDEV-25190 , then I guess 10.3.28 is fine. We've had a bad streak after 10.3.25, so we're reverting back to that. I'm not aware what ops did when reverting. SSTs generally start automatically here after certain failures, more often than appears necessary. So if there was an SST, I'm not sure if it was intentional either. The crash loop started on the second node, apparently. ("Rebooted, started maria on the 2nd node. Crashloop. Downgrading made it happy again.")
            kjoiner Kyle Joiner (Inactive) made changes -
            Affects Version/s 10.2.36 [ 25026 ]
            julien.fritsch Julien Fritsch made changes -
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            julien.fritsch Julien Fritsch made changes -
            Assignee Jan Lindström [ jplindst ]

            Hey guys,

            are there any updates?

            We downgraded after the first three times this happened and freeze the patch cycle of a big number of database servers to prevent to hit this bug again.

            Kind regards,
            Patrick

            pschlirf Patrick Schlirf added a comment - Hey guys, are there any updates? We downgraded after the first three times this happened and freeze the patch cycle of a big number of database servers to prevent to hit this bug again. Kind regards, Patrick
            julien.fritsch Julien Fritsch made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            This issue will be fixed when (if)MDEV-25114 is reviewed and applied.

            jplindst Jan Lindström (Inactive) added a comment - - edited This issue will be fixed when (if) MDEV-25114 is reviewed and applied.
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -

            I need more information. Can you run with --wsrep-debug=1 and provide error logs ? Naturally, steps how to reproduce would be most useful.

            jplindst Jan Lindström (Inactive) added a comment - I need more information. Can you run with --wsrep-debug=1 and provide error logs ? Naturally, steps how to reproduce would be most useful.
            jplindst Jan Lindström (Inactive) made changes -
            Labels need_feedback
            julien.fritsch Julien Fritsch made changes -
            Assignee Jan Lindström [ jplindst ] Pramod Mahto [ pramod.mahto@mariadb.com ]
            pramod.mahto@mariadb.com Pramod Mahto made changes -
            Assignee Pramod Mahto [ pramod.mahto@mariadb.com ] Muhammad Irfan [ muhammad.irfan ]
            Richard Richard Stracke made changes -
            Assignee Muhammad Irfan [ muhammad.irfan ] Jan Lindström [ jplindst ]
            Richard Richard Stracke made changes -
            Labels need_feedback
            jplindst Jan Lindström (Inactive) made changes -
            Assignee Jan Lindström [ jplindst ] Seppo Jaakola [ seppo ]
            seppo Seppo Jaakola made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            seppo Seppo Jaakola added a comment -

            This issue seems to cause applier-applier conflicts, configuring for only one applier (wsrep_slave_threads=1) should be immediate help for this.
            I suspect this is a duplicate of MDEV-24978, as the table definition has foreign key constraint over multi-byte character set

            seppo Seppo Jaakola added a comment - This issue seems to cause applier-applier conflicts, configuring for only one applier (wsrep_slave_threads=1) should be immediate help for this. I suspect this is a duplicate of MDEV-24978 , as the table definition has foreign key constraint over multi-byte character set
            seppo Seppo Jaakola made changes -
            seppo Seppo Jaakola added a comment -

            probably a duplicate

            seppo Seppo Jaakola added a comment - probably a duplicate
            seppo Seppo Jaakola made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            jplindst Jan Lindström (Inactive) made changes -
            jplindst Jan Lindström (Inactive) made changes -
            Fix Version/s 10.2.41 [ 26032 ]
            Fix Version/s 10.3.32 [ 26029 ]
            Fix Version/s 10.4.22 [ 26031 ]
            Fix Version/s 10.5.13 [ 26026 ]
            Fix Version/s 10.6.5 [ 26034 ]
            Fix Version/s 10.2 [ 14601 ]
            Fix Version/s 10.3 [ 22126 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            danblack Daniel Black made changes -
            danblack Daniel Black added a comment -

            With 47ba5523046094db33e68c92a182491a629bbd56 reverting MDEV-24978, is this still closed?

            danblack Daniel Black added a comment - With 47ba5523046094db33e68c92a182491a629bbd56 reverting MDEV-24978 , is this still closed?

            Yes, MDEV-24978 was fixed on d5a15f04f4ab0738b0a5f993b208bcfaad522fd4

            jplindst Jan Lindström (Inactive) added a comment - Yes, MDEV-24978 was fixed on d5a15f04f4ab0738b0a5f993b208bcfaad522fd4
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 122368 ] MariaDB v4 [ 159354 ]
            hholzgra Hartmut Holzgraefe made changes -
            mariadb-jira-automation Jira Automation (IT) made changes -
            Zendesk Related Tickets 135202 147109 133910

            People

              seppo Seppo Jaakola
              jhorstmann Jan Horstmann
              Votes:
              8 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.