[MDEV-27622] Galera ring buffer cache may get corrupted - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.5.13, 10.5.21
Fix Version/s: N/A
Component/s: Galera
Labels:
None
Environment:
SuSE Linux

Description

Customer reports Galera ring buffer cache file seems to get corrupted in a way that the buffer size is indicated incorrectly in one of the buffer header.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

screenshot-1.png
230 kB
2022-04-05 03:03

Activity

Ascending order - Click to sort in descending order

Yakov Kushnirsky added a comment - 2022-02-11 20:21

Another customer (this time with 10.5.11 version) is getting
...
2022-02-11 17:31:48 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/3221225496 bytes) complete.
2022-02-11 17:31:51 0 [ERROR] WSREP: std::bad_alloc
...

Yakov Kushnirsky added a comment - 2022-02-11 20:21 Another customer (this time with 10.5.11 version) is getting ... 2022-02-11 17:31:48 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/3221225496 bytes) complete. 2022-02-11 17:31:51 0 [ERROR] WSREP: std::bad_alloc ...

Khai Ping added a comment - 2022-04-05 03:03

This is seen in 10.6.5 as well.

Khai Ping added a comment - 2022-04-05 03:03 This is seen in 10.6.5 as well.

Juan added a comment - 2023-08-03 17:42

Hi Yurchenko - My apologies for re-opening, but we just had a customer suffering from the same problem after an upgrade from 10.4.27 to 10.5.21 and galera 26.4.14.

Their error log looks like this:

2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 6fa80f44-5c6f-11ed-b131-e306eae8a48e, offset: -1

2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (        0/134217752 bytes) complete.

2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete.

2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 2594073386741137408-2594073386741137408

2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...  0.0% (       0/16777216 bytes) complete.

2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found 0/1 locked buffers

2023-08-03  9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: free space: 117440512/134217728

2023-08-03  9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (16777216/16777216 bytes) complete.

2023-08-03  9:31:32 0 [Note] WSREP: Passing config to GCS: base_dir = /Database/prod/; base_host = 10.158.157.117; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /Database/prod/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment =

2023-08-03  9:31:32 0 [Note] WSREP: Service thread queue flushed.

2023-08-03  9:31:32 0 [Note] WSREP: ####### Assign initial position for certification: 6fa80f44-5c6f-11ed-b131-e306eae8a48e:5445163, protocol version: -1

2023-08-03  9:31:32 0 [ERROR] WSREP: Corrupt buffer header: addr: 0x7f7234024930, seqno: 2594073386741137408, size: 16777216, ctx: 0x5619670ff9d8, flags: 1537. store: 115, type: 104

230803  9:31:32 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 10.5.21-MariaDB-1:10.5.21+maria~ubu2004-log source revision: bed70468ea08c2820647f5e3ac006a9ff88144ac

key_buffer_size=0

read_buffer_size=131072

max_used_connections=0

max_threads=20011

thread_count=0

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 44054972 K  bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x0 thread_stack 0x49000

Printing to addr2line failed

/usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x561964f91a52]

/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5619649cffc5]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7240e31420]

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f724093500b]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7240914859]

/usr/lib/libgalera_smm.so(+0x3ebc7)[0x7f724030ebc7]

/usr/lib/libgalera_smm.so(+0x1cc3fe)[0x7f724049c3fe]

/usr/lib/libgalera_smm.so(+0x1b12da)[0x7f72404812da]

/usr/lib/libgalera_smm.so(+0x80728)[0x7f7240350728]

/usr/lib/libgalera_smm.so(+0x502a2)[0x7f72403202a2]

/usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v26C1ERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS_8provider8servicesE+0x1ec)[0x561965030d1c]

/usr/sbin/mariadbd(_ZN5wsrep8provider13make_providerERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS0_8servicesE+0x54)[0x5619650180f4]

/usr/sbin/mariadbd(_ZN5wsrep12server_state13load_providerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKNS_8provider8servicesE+0x1f3)[0x56196501a5e3]

/usr/sbin/mariadbd(_Z10wsrep_initv+0x193)[0x561964ca7073]

/usr/sbin/mariadbd(_Z18wsrep_init_startupb+0x14)[0x561964ca7724]

/usr/sbin/mariadbd(+0x6ac161)[0x5619646de161]

/usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0x403)[0x5619646e2e63]

/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7240916083]

/usr/sbin/mariadbd(_start+0x2e)[0x5619646d7bbe]

Juan added a comment - 2023-08-03 17:42 Hi Yurchenko - My apologies for re-opening, but we just had a customer suffering from the same problem after an upgrade from 10.4.27 to 10.5.21 and galera 26.4.14. Their error log looks like this: 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 6fa80f44-5c6f-11ed-b131-e306eae8a48e, offset: -1 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/134217752 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer initial scan...100.0% (134217752/134217752 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found gapless sequence 2594073386741137408-2594073386741137408 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan... 0.0% ( 0/16777216 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: found 0/1 locked buffers 2023-08-03 9:31:32 0 [Note] WSREP: Recovering GCache ring buffer: free space: 117440512/134217728 2023-08-03 9:31:32 0 [Note] WSREP: GCache::RingBuffer unused buffers scan...100.0% (16777216/16777216 bytes) complete. 2023-08-03 9:31:32 0 [Note] WSREP: Passing config to GCS: base_dir = /Database/prod/; base_host = 10.158.157.117; base_port = 4567; cert.log_conflicts = no; cert.optimistic_pa = yes; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /Database/prod/; gcache.keep_pages_size = 0; gcache.keep_plaintext_size = 128M; gcache.mem_size = 0; gcache.name = galera.cache; gcache.page_size = 128M; gcache.recover = yes; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.fc_single_primary = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 2023-08-03 9:31:32 0 [Note] WSREP: Service thread queue flushed. 2023-08-03 9:31:32 0 [Note] WSREP: ####### Assign initial position for certification: 6fa80f44-5c6f-11ed-b131-e306eae8a48e:5445163, protocol version: -1 2023-08-03 9:31:32 0 [ERROR] WSREP: Corrupt buffer header: addr: 0x7f7234024930, seqno: 2594073386741137408, size: 16777216, ctx: 0x5619670ff9d8, flags: 1537. store: 115, type: 104 230803 9:31:32 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware. To report this bug, see https://mariadb.com/kb/en/reporting-bugs We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail. Server version: 10.5.21-MariaDB-1:10.5.21+maria~ubu2004-log source revision: bed70468ea08c2820647f5e3ac006a9ff88144ac key_buffer_size=0 read_buffer_size=131072 max_used_connections=0 max_threads=20011 thread_count=0 It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 44054972 K bytes of memory Hope that's ok; if not, decrease some variables in the equation. Thread pointer: 0x0 Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong... stack_bottom = 0x0 thread_stack 0x49000 Printing to addr2line failed /usr/sbin/mariadbd(my_print_stacktrace+0x32)[0x561964f91a52] /usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x5619649cffc5] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420)[0x7f7240e31420] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f724093500b] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7f7240914859] /usr/lib/libgalera_smm.so(+0x3ebc7)[0x7f724030ebc7] /usr/lib/libgalera_smm.so(+0x1cc3fe)[0x7f724049c3fe] /usr/lib/libgalera_smm.so(+0x1b12da)[0x7f72404812da] /usr/lib/libgalera_smm.so(+0x80728)[0x7f7240350728] /usr/lib/libgalera_smm.so(+0x502a2)[0x7f72403202a2] /usr/sbin/mariadbd(_ZN5wsrep18wsrep_provider_v26C1ERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS_8provider8servicesE+0x1ec)[0x561965030d1c] /usr/sbin/mariadbd(_ZN5wsrep8provider13make_providerERNS_12server_stateERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESA_RKNS0_8servicesE+0x54)[0x5619650180f4] /usr/sbin/mariadbd(_ZN5wsrep12server_state13load_providerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_RKNS_8provider8servicesE+0x1f3)[0x56196501a5e3] /usr/sbin/mariadbd(_Z10wsrep_initv+0x193)[0x561964ca7073] /usr/sbin/mariadbd(_Z18wsrep_init_startupb+0x14)[0x561964ca7724] /usr/sbin/mariadbd(+0x6ac161)[0x5619646de161] /usr/sbin/mariadbd(_Z11mysqld_mainiPPc+0x403)[0x5619646e2e63] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f7240916083] /usr/sbin/mariadbd(_start+0x2e)[0x5619646d7bbe]

Juan added a comment - 2023-08-03 18:47 - edited

FYI for anyone coming here looking for the workaround, the correct parameter is "gcache_recover", not "gcache_recovery", as in:

wsrep_provider_options="gcache.recover=off;"

Juan added a comment - 2023-08-03 18:47 - edited FYI for anyone coming here looking for the workaround, the correct parameter is "gcache_recover", not "gcache_recovery", as in: wsrep_provider_options= "gcache.recover=off;"

Khai Ping added a comment - 2023-08-04 00:49 - edited

@juan, it happens to us even when we use gcache.recover, after updating to the galera plugin to 4.11 , we no longer face the issue. There is a gcache fix in 4.11 which you already have based on your comment

https://fromdual.com/galera-cluster-release-notes#galera-plugin-26-4-11-release-notes

Khai Ping added a comment - 2023-08-04 00:49 - edited @juan, it happens to us even when we use gcache.recover, after updating to the galera plugin to 4.11 , we no longer face the issue. There is a gcache fix in 4.11 which you already have based on your comment https://fromdual.com/galera-cluster-release-notes#galera-plugin-26-4-11-release-notes

Alexey added a comment - 2023-12-20 21:55

Juan reopened the ticket for a wrong reason. The ticket was about std::bad_alloc runtime exception - i.e. a coding bug, here it is a failed heuristics in gcache scanning. This is possible since gcache is never flushed and its integrity is not to be expected. So yes, in this case one needs to

wsrep_provider_options="gcache.recover=off;"

There is a heuristic to detect corrupt buffers during scan, but it cannot be 100% reliable, so sometimes such situations will arise.

Alexey added a comment - 2023-12-20 21:55 Juan reopened the ticket for a wrong reason. The ticket was about std::bad_alloc runtime exception - i.e. a coding bug, here it is a failed heuristics in gcache scanning. This is possible since gcache is never flushed and its integrity is not to be expected. So yes, in this case one needs to wsrep_provider_options= "gcache.recover=off;" There is a heuristic to detect corrupt buffers during scan, but it cannot be 100% reliable, so sometimes such situations will arise.

People

Assignee:: Alexey

Reporter:: Yakov Kushnirsky

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2022-01-25 23:56

Updated:: 2024-07-07 20:41

Resolved:: 2023-12-20 21:55

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration