Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27622

Galera ring buffer cache may get corrupted

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.5.13, 10.5.21
    • N/A
    • Galera
    • None
    • SuSE Linux

    Description

      Customer reports Galera ring buffer cache file seems to get corrupted in a way that the buffer size is indicated incorrectly in one of the buffer header.

      Attachments

        Activity

          Another customer (this time with 10.5.11 version) is getting
          ...
          2022-02-11 17:31:48 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/3221225496 bytes) complete.
          2022-02-11 17:31:51 0 [ERROR] WSREP: std::bad_alloc
          ...

          YK Yakov Kushnirsky added a comment - Another customer (this time with 10.5.11 version) is getting ... 2022-02-11 17:31:48 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/3221225496 bytes) complete. 2022-02-11 17:31:51 0 [ERROR] WSREP: std::bad_alloc ...
          khaiping.loh Khai Ping added a comment -

          This is seen in 10.6.5 as well.

          khaiping.loh Khai Ping added a comment - This is seen in 10.6.5 as well.
          juan.vera Juan added a comment -

          Hi Yurchenko - My apologies for re-opening, but we just had a customer suffering from the same problem after an upgrade from 10.4.27 to 10.5.21 and galera 26.4.14.

          Their error log looks like this:

          2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 6fa80f44-5c6f-11ed-b131-e306eae8a48e, offset: -1
          2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
          2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
          2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 2594073386741137408-2594073386741137408
          2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (       0/16777216 bytes) complete.
          2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found 0/1 locked buffers
          2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: free space: 117440512/134217728
          2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (16777216/16777216 bytes) complete.
          2023-08-03  9:31:32 0 [Note] WSREP: Passing config to GCS: base_dir = /Database/prod/; base_host = 10.158.157.117; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /Database/prod/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment =
          2023-08-03  9:31:32 0 [Note] WSREP: Service thread queue flushed.
          2023-08-03  9:31:32 0 [Note] WSREP: ####### Assign initial position for certification: 6fa80f44-5c6f-11ed-b131-e306eae8a48e:5445163, protocol version: -1
          2023-08-03  9:31:32 0 [ERROR] WSREP: Corrupt buffer header: addr: 0x7f7234024930, seqno: 2594073386741137408, size: 16777216, ctx: 0x5619670ff9d8, flags: 1537. store: 115, type: 104
          230803  9:31:32 [ERROR] mysqld got signal 6 ;
          This could be because you hit a bug. It is also possible that this binary
          or one of the libraries it was linked against is corrupt, improperly built,
          or misconfigured. This error can also be caused by malfunctioning hardware.
           
          To report this bug, see https://mariadb.com/kb/en/reporting-bugs
           
          We will try our best to scrape up some info that will hopefully help
          diagnose the problem, but since we have already crashed,
          something is definitely wrong and this may fail.
           
          Server version: 10.5.21-MariaDB-1:10.5.21+maria~ubu2004-log source revision: bed70468ea08c2820647f5e3ac006a9ff88144ac
          key_buffer_size=0
          read_buffer_size=131072
          max_used_connections=0
          max_threads=20011
          thread_count=0
          It is possible that mysqld could use up to
          key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 44054972 K  bytes of memory
          Hope that's ok; if not, decrease some variables in the equation.
           
          Thread pointer: 0x0
          Attempting backtrace. You can use the following information to find out
          where mysqld died. If you see no messages after this, something went
          terribly wrong...
          stack_bottom = 0x0 thread_stack 0x49000
          Printing to addr2line failed
          /usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x561964f91a52]
          /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5619649cffc5]
          /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7240e31420]
          /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f724093500b]
          /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7240914859]
          /usr/lib/libgalera_smm.so(+0x3ebc7)[0x7f724030ebc7]
          /usr/lib/libgalera_smm.so(+0x1cc3fe)[0x7f724049c3fe]
          /usr/lib/libgalera_smm.so(+0x1b12da)[0x7f72404812da]
          /usr/lib/libgalera_smm.so(+0x80728)[0x7f7240350728]
          /usr/lib/libgalera_smm.so(+0x502a2)[0x7f72403202a2]
          /usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v26C1ERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS_8provider8servicesE+0x1ec)[0x561965030d1c]
          /usr/sbin/mariadbd(_ZN5wsrep8provider13make_providerERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS0_8servicesE+0x54)[0x5619650180f4]
          /usr/sbin/mariadbd(_ZN5wsrep12server_state13load_providerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKNS_8provider8servicesE+0x1f3)[0x56196501a5e3]
          /usr/sbin/mariadbd(_Z10wsrep_initv+0x193)[0x561964ca7073]
          /usr/sbin/mariadbd(_Z18wsrep_init_startupb+0x14)[0x561964ca7724]
          /usr/sbin/mariadbd(+0x6ac161)[0x5619646de161]
          /usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0x403)[0x5619646e2e63]
          /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7240916083]
          /usr/sbin/mariadbd(_start+0x2e)[0x5619646d7bbe]
          

          juan.vera Juan added a comment - Hi Yurchenko - My apologies for re-opening, but we just had a customer suffering from the same problem after an upgrade from 10.4.27 to 10.5.21 and galera 26.4.14. Their error log looks like this: 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 6fa80f44-5c6f-11ed-b131-e306eae8a48e, offset: -1 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/134217752 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 2594073386741137408-2594073386741137408 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan... 0.0% ( 0/16777216 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found 0/1 locked buffers 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: free space: 117440512/134217728 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (16777216/16777216 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: Passing config to GCS: base_dir = /Database/prod/; base_host = 10.158.157.117; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /Database/prod/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 2023-08-03 9:31:32 0 [Note] WSREP: Service thread queue flushed. 2023-08-03 9:31:32 0 [Note] WSREP: ####### Assign initial position for certification: 6fa80f44-5c6f-11ed-b131-e306eae8a48e:5445163, protocol version: -1 2023-08-03 9:31:32 0 [ERROR] WSREP: Corrupt buffer header: addr: 0x7f7234024930, seqno: 2594073386741137408, size: 16777216, ctx: 0x5619670ff9d8, flags: 1537. store: 115, type: 104 230803 9:31:32 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.   To report this bug, see https://mariadb.com/kb/en/reporting-bugs   We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.   Server version: 10.5.21-MariaDB-1:10.5.21+maria~ubu2004-log source revision: bed70468ea08c2820647f5e3ac006a9ff88144ac key_buffer_size=0 read_buffer_size=131072 max_used_connections=0 max_threads=20011 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 44054972 K bytes of memory Hope that's ok; if not, decrease some variables in the equation.   Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 Printing to addr2line failed /usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x561964f91a52] /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5619649cffc5] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7240e31420] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f724093500b] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7240914859] /usr/lib/libgalera_smm.so(+0x3ebc7)[0x7f724030ebc7] /usr/lib/libgalera_smm.so(+0x1cc3fe)[0x7f724049c3fe] /usr/lib/libgalera_smm.so(+0x1b12da)[0x7f72404812da] /usr/lib/libgalera_smm.so(+0x80728)[0x7f7240350728] /usr/lib/libgalera_smm.so(+0x502a2)[0x7f72403202a2] /usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v26C1ERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS_8provider8servicesE+0x1ec)[0x561965030d1c] /usr/sbin/mariadbd(_ZN5wsrep8provider13make_providerERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS0_8servicesE+0x54)[0x5619650180f4] /usr/sbin/mariadbd(_ZN5wsrep12server_state13load_providerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKNS_8provider8servicesE+0x1f3)[0x56196501a5e3] /usr/sbin/mariadbd(_Z10wsrep_initv+0x193)[0x561964ca7073] /usr/sbin/mariadbd(_Z18wsrep_init_startupb+0x14)[0x561964ca7724] /usr/sbin/mariadbd(+0x6ac161)[0x5619646de161] /usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0x403)[0x5619646e2e63] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7240916083] /usr/sbin/mariadbd(_start+0x2e)[0x5619646d7bbe]
          juan.vera Juan added a comment - - edited

          FYI for anyone coming here looking for the workaround, the correct parameter is "gcache_recover", not "gcache_recovery", as in:

          wsrep_provider_options="gcache.recover=off;"
          

          juan.vera Juan added a comment - - edited FYI for anyone coming here looking for the workaround, the correct parameter is "gcache_recover", not "gcache_recovery", as in: wsrep_provider_options= "gcache.recover=off;"
          khaiping.loh Khai Ping added a comment - - edited

          @juan, it happens to us even when we use gcache.recover, after updating to the galera plugin to 4.11 , we no longer face the issue. There is a gcache fix in 4.11 which you already have based on your comment

          https://fromdual.com/galera-cluster-release-notes#galera-plugin-26-4-11-release-notes

          khaiping.loh Khai Ping added a comment - - edited @juan, it happens to us even when we use gcache.recover, after updating to the galera plugin to 4.11 , we no longer face the issue. There is a gcache fix in 4.11 which you already have based on your comment https://fromdual.com/galera-cluster-release-notes#galera-plugin-26-4-11-release-notes
          Yurchenko Alexey added a comment -

          Juan reopened the ticket for a wrong reason. The ticket was about std::bad_alloc runtime exception - i.e. a coding bug, here it is a failed heuristics in gcache scanning. This is possible since gcache is never flushed and its integrity is not to be expected. So yes, in this case one needs to

          wsrep_provider_options="gcache.recover=off;"
          

          There is a heuristic to detect corrupt buffers during scan, but it cannot be 100% reliable, so sometimes such situations will arise.

          Yurchenko Alexey added a comment - Juan reopened the ticket for a wrong reason. The ticket was about std::bad_alloc runtime exception - i.e. a coding bug, here it is a failed heuristics in gcache scanning. This is possible since gcache is never flushed and its integrity is not to be expected. So yes, in this case one needs to wsrep_provider_options= "gcache.recover=off;" There is a heuristic to detect corrupt buffers during scan, but it cannot be 100% reliable, so sometimes such situations will arise.

          People

            Yurchenko Alexey
            YK Yakov Kushnirsky
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.