Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • 10.4.18
    • 10.4.19
    • Galera
    • None
    • Debian GNU/Linux 8.11 (jessie) x86_64

    Description

      Yesterday my entire cluster crashed and was eventually reinitialised from scratch. In the process we also updated MariaDB on some of the servers. This afternoon I had this crash on node 5, which then took out 4 other nodes, leaving only 1 (everything recovered luckily)

      MariaDB 10.4.18
      galera: 4.7(ree4f10fc)

      The crash that started it seems to have been:
      (some information obfuscated)
      ```
      2021-04-09 14:37:31 1 [ERROR] InnoDB: Conflicting lock on table: ``.`table1` index: PRIMARY that has lock
      RECORD LOCKS space id 267 page no 21525 n bits 120 index PRIMARY of table ``.`table1` trx id 196886519 lock mode S locks rec but not gap
      Record lock, heap no 47 PHYSICAL RECORD: n_fields 8; compact format; info bits 0
      0: len 8; hex 000000000057d9c6; asc W ;;
      1: len 6; hex 000000000000; asc ;;
      2: len 7; hex 80000000000000; asc ;;
      3: len 4; hex 8000166d; asc m;;
      4: SQL NULL;
      5: len 11; hex 424d5f5748415453415050; asc astring;;
      6: len 13; hex 2b393730353637353535383033; asc astring;;
      7: len 1; hex 80; asc ;;

      2021-04-09 14:37:31 1 [ERROR] InnoDB: WSREP state:
      2021-04-09 14:37:31 1 [ERROR] WSREP: Thread BF trx_id: 196886521 thread: 1 seqno: 99055 client_state: exec client_mode: high priority transaction_mode: replaying applier: 1 toi: 0 local: 0 query: UPDATE table1 SET `profileId` = number, `isBot` = 0 WHERE `serviceId` = number AND `channelId` = 'astring' AND `channelUserId` = 'astring'?Jp`
      2021-04-09 14:37:31 1 [ERROR] WSREP: Thread BF trx_id: 196886519 thread: 34 seqno: 99054 client_state: exec client_mode: high priority transaction_mode: executing applier: 1 toi: 0 local: 0 query: INSERT INTO table2 (`uuid`, `type`, `conversationId`, `timestamp`, `channelUserId`, `channelAccountId`) VALUES ('guid', 'string', 11731790, 1617971851864, 5546465, 10799)?Jp`
      2021-04-09 14:37:31 0x7f53edbe1700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/mariadb-10.4.18/storage/innobase/lock/lock0lock.cc line 706
      InnoDB: We intentionally generate a memory trap.
      InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
      InnoDB: If you get repeated assertion failures or crashes, even
      InnoDB: immediately after the mysqld startup, there may be
      InnoDB: corruption in the InnoDB tablespace. Please refer to
      InnoDB: https://mariadb.com/kb/en/library/innodb-recovery-modes/
      InnoDB: about forcing recovery.
      210409 14:37:31 [ERROR] mysqld got signal 6 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.

      To report this bug, see https://mariadb.com/kb/en/reporting-bugs

      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.

      Server version: 10.4.18-MariaDB-1:10.4.18+maria~bionic-log
      key_buffer_size=134217728
      read_buffer_size=2097152
      max_used_connections=129
      max_threads=65537
      thread_count=78
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 403037846 K bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.

      Thread pointer: 0x7f533005de18
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f53edbe0d28 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5605d336452e]
      /usr/sbin/mysqld(handle_fatal_signal+0x515)[0x5605d2dda2d5]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f59e4c648a0]
      /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f59e3598f47]
      /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f59e359a8b1]
      /usr/sbin/mysqld(+0x5889fd)[0x5605d2ae09fd]
      /usr/sbin/mysqld(+0xa71d49)[0x5605d2fc9d49]
      /usr/sbin/mysqld(+0xa72ecd)[0x5605d2fcaecd]
      /usr/sbin/mysqld(+0xa782d1)[0x5605d2fd02d1]
      /usr/sbin/mysqld(+0xb10c37)[0x5605d3068c37]
      /usr/sbin/mysqld(+0xb16d74)[0x5605d306ed74]
      /usr/sbin/mysqld(+0xa353b2)[0x5605d2f8d3b2]
      /usr/sbin/mysqld(ZN7handler10ha_rnd_posEPhS0+0x132)[0x5605d2ddf562]
      /usr/sbin/mysqld(+0x7a47c3)[0x5605d2cfc7c3]
      /usr/sbin/mysqld(_ZN14Rows_log_event8find_rowEP14rpl_group_info+0x3a0)[0x5605d2edc4e0]
      /usr/sbin/mysqld(_ZN21Update_rows_log_event11do_exec_rowEP14rpl_group_info+0xe0)[0x5605d2edcb90]
      /usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEP14rpl_group_info+0x237)[0x5605d2ecfcb7]
      /usr/sbin/mysqld(_Z18wsrep_apply_eventsP3THDP14Relay_log_infoPKvm+0x1d8)[0x5605d2d4d488]
      /usr/sbin/mysqld(_ZN22Wsrep_replayer_service15apply_write_setERKN5wsrep7ws_metaERKNS0_12const_bufferERNS0_14mutable_bufferE+0xf4)[0x5605d2d344a4]
      /usr/sbin/mysqld(+0xe83432)[0x5605d33db432]
      /usr/sbin/mysqld(_ZN5wsrep12server_state8on_applyERNS_21high_priority_serviceERKNS_9ws_handleERKNS_7ws_metaERKNS_12const_bufferE+0x155)[0x5605d33dc245]
      /usr/sbin/mysqld(+0xe98af4)[0x5605d33f0af4]
      /usr/lib/galera/libgalera_smm.so(+0x30fb5)[0x7f59e157cfb5]
      /usr/lib/galera/libgalera_smm.so(+0x4845c)[0x7f59e159445c]
      /usr/lib/galera/libgalera_smm.so(+0x1fd75)[0x7f59e156bd75]
      /usr/sbin/mysqld(_ZN5wsrep18wsrep_provider_v266replayERKNS_9ws_handleEPNS_21high_priority_serviceE+0x29)[0x5605d33f1459]
      /usr/sbin/mysqld(_ZN20Wsrep_client_service6replayEv+0xf4)[0x5605d2d340d4]
      /usr/sbin/mysqld(_ZN5wsrep11transaction6replayERSt11unique_lockINS_5mutexEE+0x96)[0x5605d33ea996]
      /usr/sbin/mysqld(_ZN5wsrep11transaction15after_statementEv+0x1b4)[0x5605d33ec5a4]
      /usr/sbin/mysqld(_ZN5wsrep12client_state15after_statementEv+0xb9)[0x5605d33cf229]
      /usr/sbin/mysqld(+0x6757ad)[0x5605d2bcd7ad]
      /usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x22d2)[0x5605d2bd01e2]
      /usr/sbin/mysqld(_Z10do_commandP3THD+0x11c)[0x5605d2bd0aec]
      /usr/sbin/mysqld(_Z11tp_callbackP13TP_connection+0xef)[0x5605d2dab70f]
      /usr/sbin/mysqld(+0xa0b3e0)[0x5605d2f633e0]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f59e4c596db]
      /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f59e367ba3f]
      ```

      other nodes then tried to recover, did quorum checks and start state transfer routines, which caused 4 of them to crash with this stack trace
      ```
      =================================================
      2021-04-09 14:37:44 17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
      210409 14:37:44 [ERROR] mysqld got signal 11 ;
      This could be because you hit a bug. It is also possible that this binary
      or one of the libraries it was linked against is corrupt, improperly built,
      or misconfigured. This error can also be caused by malfunctioning hardware.

      To report this bug, see https://mariadb.com/kb/en/reporting-bugs

      We will try our best to scrape up some info that will hopefully help
      diagnose the problem, but since we have already crashed,
      something is definitely wrong and this may fail.

      Server version: 10.4.18-MariaDB-1:10.4.18+maria~bionic-log
      key_buffer_size=134217728
      read_buffer_size=2097152
      max_used_connections=91
      max_threads=65537
      thread_count=83
      It is possible that mysqld could use up to
      key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 403037846 K bytes of memory
      Hope that's ok; if not, decrease some variables in the equation.

      Thread pointer: 0x7f74ec000c08
      Attempting backtrace. You can use the following information to find out
      where mysqld died. If you see no messages after this, something went
      terribly wrong...
      stack_bottom = 0x7f75741c9e08 thread_stack 0x49000
      /usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x5632ddda152e]
      /usr/sbin/mysqld(handle_fatal_signal+0x515)[0x5632dd8172d5]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x128a0)[0x7f7b5ac268a0]
      /usr/sbin/mysqld(+0x60843d)[0x5632dd59d43d]
      /usr/sbin/mysqld(_Z11lock_tablesP3THDP10TABLE_LISTjj+0x4a0)[0x5632dd5a3460]
      /usr/sbin/mysqld(_Z20open_and_lock_tablesP3THDRK14DDL_options_stP10TABLE_LISTbjP19Prelocking_strategy+0x8a)[0x5632dd5a4eea]
      /usr/sbin/mysqld(_Z24open_n_lock_single_tableP3THDP10TABLE_LIST13thr_lock_typejP19Prelocking_strategy+0x41)[0x5632dd5a4f91]
      /usr/sbin/mysqld(+0x7f78a7)[0x5632dd78c8a7]
      /usr/sbin/mysqld(+0x7f8324)[0x5632dd78d324]
      /usr/sbin/mysqld(_ZNK12Wsrep_schema12restore_viewEP3THDRKN5wsrep2idE+0x24b)[0x5632dd79039b]
      /usr/sbin/mysqld(_ZN20Wsrep_server_service8log_viewEPN5wsrep21high_priority_serviceERKNS0_4viewE+0x700)[0x5632dd999dc0]
      /usr/sbin/mysqld(_ZN5wsrep12server_state7on_viewERKNS_4viewEPNS_21high_priority_serviceE+0x3c5)[0x5632dde1eef5]
      /usr/sbin/mysqld(+0xe9bc1c)[0x5632dde30c1c]
      /usr/lib/galera/libgalera_smm.so(+0x3ee89)[0x7f7b5754ce89]
      /usr/lib/galera/libgalera_smm.so(+0x3fc98)[0x7f7b5754dc98]
      /usr/lib/galera/libgalera_smm.so(+0x4b188)[0x7f7b57559188]
      /usr/lib/galera/libgalera_smm.so(+0x4b3be)[0x7f7b575593be]
      /usr/lib/galera/libgalera_smm.so(+0x7c848)[0x7f7b5758a848]
      /usr/lib/galera/libgalera_smm.so(+0x7caa2)[0x7f7b5758aaa2]
      /usr/lib/galera/libgalera_smm.so(+0x49890)[0x7f7b57557890]
      /usr/lib/galera/libgalera_smm.so(+0x1c2ab)[0x7f7b5752a2ab]
      /usr/sbin/mysqld(_ZN5wsrep18wsrep_provider_v2611run_applierEPNS_21high_priority_serviceE+0xe)[0x5632dde2e1de]
      /usr/sbin/mysqld(+0x7f758f)[0x5632dd78c58f]
      /usr/sbin/mysqld(_Z15start_wsrep_THDPv+0x413)[0x5632dd77bf33]
      /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7f7b5ac1b6db]
      2021-04-09 14:37:46 0 [Note] WSREP: (1cc929aa-b399, 'tcp://0.0.0.0:4567') reconnecting to dd6e91d4-a06b (tcp://10.30.126.13:4567), attempt 0
      2021-04-09 14:37:46 0 [Note] WSREP: (1cc929aa-b399, 'tcp://0.0.0.0:4567') reconnecting to cc37987d-b641 (tcp://10.35.126.15:4567), attempt 0
      /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7f7b5963da3f]

      Trying to get some variables.
      Some pointers may be invalid and cause the dump to abort.
      Query (0x0): (null)
      Connection ID (thread ID): 17
      Status: NOT_KILLED
      ```

      This all looks very not normal... I'm not sure how to analyse this.
      The only thing that I could find that looks a little bit similar is MDEV-23851

      Any help with figuring out what this stack trace means would be appreciated

      Attachments

        Issue Links

          Activity

            jplindst changed that code in MDEV-24923. The function wsrep_assert_no_bf_bf_wait() that reports the fatal error would only be available in debug builds.

            I think that he should analyze this failure and comment on whether it could be a real error.

            marko Marko Mäkelä added a comment - jplindst changed that code in MDEV-24923 . The function wsrep_assert_no_bf_bf_wait() that reports the fatal error would only be available in debug builds. I think that he should analyze this failure and comment on whether it could be a real error.

            For various reasons we have since updated the cluster and we are now on 10.5.9. We also switched from wsrep_sst_method rsync to mariadbbackup.

            I think however that we are finding a similar problem however, see also MDEV-25405. So far the problem just hasn't been as catastrophic as on our older setup, where it caused complete cluster lockup on the nodes.

            djhartman Derk-Jan Hartman added a comment - For various reasons we have since updated the cluster and we are now on 10.5.9. We also switched from wsrep_sst_method rsync to mariadbbackup. I think however that we are finding a similar problem however, see also MDEV-25405 . So far the problem just hasn't been as catastrophic as on our older setup, where it caused complete cluster lockup on the nodes.

            trx_id: 196886519 has seqno: 99054 and is holding the lock. while trx_id: 196886521 seqno: 99055 is requesting conflicting record lock. Because trx_id: 196886521 has larger seqno it should wait for it turn i.e. do lock wait. This has been now fixed in MDEV-24923.

            jplindst Jan Lindström (Inactive) added a comment - trx_id: 196886519 has seqno: 99054 and is holding the lock. while trx_id: 196886521 seqno: 99055 is requesting conflicting record lock. Because trx_id: 196886521 has larger seqno it should wait for it turn i.e. do lock wait. This has been now fixed in MDEV-24923 .

            As far as I can tell 10.4.19 and 10.5.10 have not yet been packaged for ubuntu.. But as soon as they are available we will update I guess, because the HA part isn't too HA right now.

            djhartman Derk-Jan Hartman added a comment - As far as I can tell 10.4.19 and 10.5.10 have not yet been packaged for ubuntu.. But as soon as they are available we will update I guess, because the HA part isn't too HA right now.
            thefsb Tom added a comment - - edited

            How long should I expect to wait until a Debian package is available?

            I had a cluster of 3 nodes running 10.4.15. One upgraded to 10.4.18 on Thursday and on Saturday morning I got a crash like the one Derk-Jan Hartman reported. I worry that we'll get more such crashes on that node if I don't revert it to 10.4.15.

            Had another crash this afternoon. Will look at reverting now.

            thefsb Tom added a comment - - edited How long should I expect to wait until a Debian package is available? I had a cluster of 3 nodes running 10.4.15. One upgraded to 10.4.18 on Thursday and on Saturday morning I got a crash like the one Derk-Jan Hartman reported. I worry that we'll get more such crashes on that node if I don't revert it to 10.4.15. Had another crash this afternoon. Will look at reverting now.

            People

              jplindst Jan Lindström (Inactive)
              djhartman Derk-Jan Hartman
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.