[MDEV-27622] Galera ring buffer cache may get corrupted Created: 2022-01-25  Updated: 2023-12-29  Resolved: 2023-12-20

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.13, 10.5.21
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Yakov Kushnirsky Assignee: Alexey
Resolution: Fixed Votes: 1
Labels: None
Environment:

SuSE Linux


Attachments: PNG File screenshot-1.png    

 Description   

Customer reports Galera ring buffer cache file seems to get corrupted in a way that the buffer size is indicated incorrectly in one of the buffer header.



 Comments   
Comment by Yakov Kushnirsky [ 2022-02-11 ]

Another customer (this time with 10.5.11 version) is getting
...
2022-02-11 17:31:48 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/3221225496 bytes) complete.
2022-02-11 17:31:51 0 [ERROR] WSREP: std::bad_alloc
...

Comment by Khai Ping [ 2022-04-05 ]

This is seen in 10.6.5 as well.

Comment by Juan [ 2023-08-03 ]

Hi Yurchenko - My apologies for re-opening, but we just had a customer suffering from the same problem after an upgrade from 10.4.27 to 10.5.21 and galera 26.4.14.

Their error log looks like this:

2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 6fa80f44-5c6f-11ed-b131-e306eae8a48e, offset: -1
2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.
2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 2594073386741137408-2594073386741137408
2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (       0/16777216 bytes) complete.
2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found 0/1 locked buffers
2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: free space: 117440512/134217728
2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (16777216/16777216 bytes) complete.
2023-08-03  9:31:32 0 [Note] WSREP: Passing config to GCS: base_dir = /Database/prod/; base_host = 10.158.157.117; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /Database/prod/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment =
2023-08-03  9:31:32 0 [Note] WSREP: Service thread queue flushed.
2023-08-03  9:31:32 0 [Note] WSREP: ####### Assign initial position for certification: 6fa80f44-5c6f-11ed-b131-e306eae8a48e:5445163, protocol version: -1
2023-08-03  9:31:32 0 [ERROR] WSREP: Corrupt buffer header: addr: 0x7f7234024930, seqno: 2594073386741137408, size: 16777216, ctx: 0x5619670ff9d8, flags: 1537. store: 115, type: 104
230803  9:31:32 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.21-MariaDB-1:10.5.21+maria~ubu2004-log source revision: bed70468ea08c2820647f5e3ac006a9ff88144ac
key_buffer_size=0
read_buffer_size=131072
max_used_connections=0
max_threads=20011
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 44054972 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
Printing to addr2line failed
/usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x561964f91a52]
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5619649cffc5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7240e31420]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f724093500b]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7240914859]
/usr/lib/libgalera_smm.so(+0x3ebc7)[0x7f724030ebc7]
/usr/lib/libgalera_smm.so(+0x1cc3fe)[0x7f724049c3fe]
/usr/lib/libgalera_smm.so(+0x1b12da)[0x7f72404812da]
/usr/lib/libgalera_smm.so(+0x80728)[0x7f7240350728]
/usr/lib/libgalera_smm.so(+0x502a2)[0x7f72403202a2]
/usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v26C1ERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS_8provider8servicesE+0x1ec)[0x561965030d1c]
/usr/sbin/mariadbd(_ZN5wsrep8provider13make_providerERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS0_8servicesE+0x54)[0x5619650180f4]
/usr/sbin/mariadbd(_ZN5wsrep12server_state13load_providerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKNS_8provider8servicesE+0x1f3)[0x56196501a5e3]
/usr/sbin/mariadbd(_Z10wsrep_initv+0x193)[0x561964ca7073]
/usr/sbin/mariadbd(_Z18wsrep_init_startupb+0x14)[0x561964ca7724]
/usr/sbin/mariadbd(+0x6ac161)[0x5619646de161]
/usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0x403)[0x5619646e2e63]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7240916083]
/usr/sbin/mariadbd(_start+0x2e)[0x5619646d7bbe]

Comment by Juan [ 2023-08-03 ]

FYI for anyone coming here looking for the workaround, the correct parameter is "gcache_recover", not "gcache_recovery", as in:

wsrep_provider_options="gcache.recover=off;"

Comment by Khai Ping [ 2023-08-04 ]

@juan, it happens to us even when we use gcache.recover, after updating to the galera plugin to 4.11 , we no longer face the issue. There is a gcache fix in 4.11 which you already have based on your comment

https://fromdual.com/galera-cluster-release-notes#galera-plugin-26-4-11-release-notes

Comment by Alexey [ 2023-12-20 ]

Juan reopened the ticket for a wrong reason. The ticket was about std::bad_alloc runtime exception - i.e. a coding bug, here it is a failed heuristics in gcache scanning. This is possible since gcache is never flushed and its integrity is not to be expected. So yes, in this case one needs to

wsrep_provider_options="gcache.recover=off;"

There is a heuristic to detect corrupt buffers during scan, but it cannot be 100% reliable, so sometimes such situations will arise.

Generated at Thu Feb 08 09:54:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.