Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.5.13, 10.5.21
-
None
-
SuSE Linux
Description
Customer reports Galera ring buffer cache file seems to get corrupted in a way that the buffer size is indicated incorrectly in one of the buffer header.
Attachments
- screenshot-1.png
- 230 kB
Activity
Hi Yurchenko - My apologies for re-opening, but we just had a customer suffering from the same problem after an upgrade from 10.4.27 to 10.5.21 and galera 26.4.14.
Their error log looks like this:
2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 6fa80f44-5c6f-11ed-b131-e306eae8a48e, offset: -1
|
2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/134217752 bytes) complete.
|
2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.
|
2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 2594073386741137408-2594073386741137408
|
2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan... 0.0% ( 0/16777216 bytes) complete.
|
2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found 0/1 locked buffers
|
2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: free space: 117440512/134217728
|
2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (16777216/16777216 bytes) complete.
|
2023-08-03 9:31:32 0 [Note] WSREP: Passing config to GCS: base_dir = /Database/prod/; base_host = 10.158.157.117; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /Database/prod/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment =
|
2023-08-03 9:31:32 0 [Note] WSREP: Service thread queue flushed.
|
2023-08-03 9:31:32 0 [Note] WSREP: ####### Assign initial position for certification: 6fa80f44-5c6f-11ed-b131-e306eae8a48e:5445163, protocol version: -1
|
2023-08-03 9:31:32 0 [ERROR] WSREP: Corrupt buffer header: addr: 0x7f7234024930, seqno: 2594073386741137408, size: 16777216, ctx: 0x5619670ff9d8, flags: 1537. store: 115, type: 104
|
230803 9:31:32 [ERROR] mysqld got signal 6 ;
|
This could be because you hit a bug. It is also possible that this binary
|
or one of the libraries it was linked against is corrupt, improperly built,
|
or misconfigured. This error can also be caused by malfunctioning hardware.
|
|
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
|
|
We will try our best to scrape up some info that will hopefully help
|
diagnose the problem, but since we have already crashed,
|
something is definitely wrong and this may fail.
|
|
Server version: 10.5.21-MariaDB-1:10.5.21+maria~ubu2004-log source revision: bed70468ea08c2820647f5e3ac006a9ff88144ac
|
key_buffer_size=0
|
read_buffer_size=131072
|
max_used_connections=0
|
max_threads=20011
|
thread_count=0
|
It is possible that mysqld could use up to
|
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 44054972 K bytes of memory
|
Hope that's ok; if not, decrease some variables in the equation.
|
|
Thread pointer: 0x0
|
Attempting backtrace. You can use the following information to find out
|
where mysqld died. If you see no messages after this, something went
|
terribly wrong...
|
stack_bottom = 0x0 thread_stack 0x49000
|
Printing to addr2line failed
|
/usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x561964f91a52]
|
/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5619649cffc5]
|
/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7240e31420]
|
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f724093500b]
|
/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7240914859]
|
/usr/lib/libgalera_smm.so(+0x3ebc7)[0x7f724030ebc7]
|
/usr/lib/libgalera_smm.so(+0x1cc3fe)[0x7f724049c3fe]
|
/usr/lib/libgalera_smm.so(+0x1b12da)[0x7f72404812da]
|
/usr/lib/libgalera_smm.so(+0x80728)[0x7f7240350728]
|
/usr/lib/libgalera_smm.so(+0x502a2)[0x7f72403202a2]
|
/usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v26C1ERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS_8provider8servicesE+0x1ec)[0x561965030d1c]
|
/usr/sbin/mariadbd(_ZN5wsrep8provider13make_providerERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS0_8servicesE+0x54)[0x5619650180f4]
|
/usr/sbin/mariadbd(_ZN5wsrep12server_state13load_providerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKNS_8provider8servicesE+0x1f3)[0x56196501a5e3]
|
/usr/sbin/mariadbd(_Z10wsrep_initv+0x193)[0x561964ca7073]
|
/usr/sbin/mariadbd(_Z18wsrep_init_startupb+0x14)[0x561964ca7724]
|
/usr/sbin/mariadbd(+0x6ac161)[0x5619646de161]
|
/usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0x403)[0x5619646e2e63]
|
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7240916083]
|
/usr/sbin/mariadbd(_start+0x2e)[0x5619646d7bbe]
|
FYI for anyone coming here looking for the workaround, the correct parameter is "gcache_recover", not "gcache_recovery", as in:
wsrep_provider_options="gcache.recover=off;" |
@juan, it happens to us even when we use gcache.recover, after updating to the galera plugin to 4.11 , we no longer face the issue. There is a gcache fix in 4.11 which you already have based on your comment
https://fromdual.com/galera-cluster-release-notes#galera-plugin-26-4-11-release-notes
Juan reopened the ticket for a wrong reason. The ticket was about std::bad_alloc runtime exception - i.e. a coding bug, here it is a failed heuristics in gcache scanning. This is possible since gcache is never flushed and its integrity is not to be expected. So yes, in this case one needs to
wsrep_provider_options="gcache.recover=off;" |
There is a heuristic to detect corrupt buffers during scan, but it cannot be 100% reliable, so sometimes such situations will arise.
Another customer (this time with 10.5.11 version) is getting
...
2022-02-11 17:31:48 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/3221225496 bytes) complete.
2022-02-11 17:31:51 0 [ERROR] WSREP: std::bad_alloc
...