[MDEV-24615] MariaDB 10.5.8 Galera node fails to start with WSREP: std::bad_alloc Created: 2021-01-18  Updated: 2021-08-16  Resolved: 2021-08-16

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.5.8
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Otto Kekäläinen Assignee: Alexey
Resolution: Duplicate Votes: 1
Labels: None

Issue Links:
Relates
relates to MDEV-25739 Failed to create a new provider '/usr... Closed

 Description   

While running a MariaDB 10.5.8 Galera node and issuing `sudo systemctl restart mariadb` it failed to restart and run IST to join the cluster where it was a member of before the restart.

Maybe galera.gcache got corrupted on shutdown?

$ journalctl -fu mariadb
..
systemd[1]: Starting MariaDB 10.5.8 database server...
sh[52173]: WSREP: Recovered position 7f74fdea-cda3-11e9-ae2f-ab188b9a01f2:341781449
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] /usr/sbin/mariadbd (mysqld 10.5.8-MariaDB-1:10.5.8+maria~bionic) starting as process 52540 ...
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: Loading provider /usr/lib/galera/libgalera_smm.so initial position: 7f74fdea-cda3-11e9-ae2f-ab188b9a01f2:341781449
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: wsrep_load(): Galera 26.4.6(r1d8d67c8) by Codership Oy <info@codership.com> loaded successfully.
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: CRC-32C: using 64-bit x86 acceleration.
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: Found saved state: 7f74fdea-cda3-11e9-ae2f-ab188b9a01f2:-1, safe_to_bootstrap: 0
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: GCache DEBUG: opened preamble:
mariadbd[52540]: Version: 2
mariadbd[52540]: UUID: 7f74fdea-cda3-11e9-ae2f-ab188b9a01f2
mariadbd[52540]: Seqno: -1 - -1
mariadbd[52540]: Offset: -1
mariadbd[52540]: Synced: 0
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: 7f74fdea-cda3-11e9-ae2f-ab188b9a01f2, offset: -1
mariadbd[52540]: 2021-01-17 20:57:30 0 [Note] WSREP: GCache::RingBuffer initial scan...  0.0% (         0/5368709144 bytes) complete.
mariadbd[52540]: 2021-01-17 20:57:36 0 [ERROR] WSREP: std::bad_alloc
mariadbd[52540]: 2021-01-17 20:57:36 0 [ERROR] WSREP: Failed to create a new provider '/usr/lib/galera/libgalera_smm.so' with options 'gcache.size=5G; gcache.page_size=1G': Failed to initialize wsrep provider
mariadbd[52540]: 2021-01-17 20:57:36 0 [ERROR] WSREP: Failed to load provider
mariadbd[52540]: 2021-01-17 20:57:36 0 [ERROR] Aborting
mariadbd[52540]: Warning: Memory not freed: 88
systemd[1]: mariadb.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: mariadb.service: Failed with result 'exit-code'.
systemd[1]: Failed to start MariaDB 10.5.8 database server.

I also came across this: https://stackoverflow.com/questions/64834855/mariadb-cant-start-wsrep-stdbad-alloc/ - so it seems this is affecting multiple users.

Deleting the galera.gcache makes the server start, but then it runs a full SST and it is only a temporary solution, as whatever corrupted the galera.gcache file in the first place is still there.

 



 Comments   
Comment by Aurélien LEQUOY [ 2021-04-14 ]

I got the same trouble in 10.5, one node in cluster Crashed (he was a slave), reboot without Galera, and I got it when i tried to reboot in cluster (without be a slave)

Comment by Aurélien LEQUOY [ 2021-04-14 ]

version affected 10.5.9

Comment by Hayden Seitz [ 2021-05-05 ]

I have submitted a similar error: https://jira.mariadb.org/browse/MDEV-25605

Instead of "[ERROR] WSREP: std::bad_alloc", my cluster nodes are reporting "[ERROR] WSREP: deque::_M_new_elements_at_back"

Comment by Sergio Charrua [ 2021-05-28 ]

Have the same issue here, with MariaDB 10.5.9.
[mysqld]
event_scheduler=1
log-error=/var/log/mariadb/mariadb.log
log_slave_updates=1
relay-log-purge=1

[mariadb-10.5]
binlog_format=ROW
default-storage-engine=innodb
innodb_autoinc_lock_mode=2
innodb_locks_unsafe_for_binlog=1
wsrep_on=ON
query_cache_size=0
query_cache_type=0
datadir=/var/lib/mysql
innodb_log_file_size=200M
innodb_file_per_table
innodb_flush_log_at_trx_commit=2
wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
wsrep_cluster_address="gcomm://"
wsrep_cluster_name='dev_galera_cluster'
wsrep_node_address='192.168.41.82'
wsrep_node_name='devnode1'
wsrep_sst_method=rsync
wsrep_sst_auth=db_user:admin
server_id=11
wsrep-restart-slave=1

Though I had wsrep_on=ON, there is only 1 node running (all other nodes are off line). This setup used to work until yesterday 14:30 GMT

Yesterday's logs:
2021-05-27 14:35:43 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: b59c71a8-8bcd-11eb-850a-7fe030c3c134, offset: -1
2021-05-27 14:35:43 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/134217752 bytes) complete.
2021-05-27 14:35:43 0 [ERROR] WSREP: deque::_M_new_elements_at_back
2021-05-27 14:35:43 0 [ERROR] WSREP: Failed to create a new provider '/usr/lib64/galera-4/libgalera_smm.so' with options '': Failed to initialize wsrep
provider
2021-05-27 14:35:43 0 [ERROR] WSREP: Failed to load provider
2021-05-27 14:35:43 0 [ERROR] Aborting

I also noticed thousands of relay log files, since 1st May, with just about 600bytes sizes. The only way I had to solve the issue (mariadb not starting) was to set
wsrep_on=OFF
and then start the service. After connecting to CLI, I run :
reset slave;
and the relay logs are now cleared.

Today I made a new test, by setting wsrep to ON again. This is the result after a restart of mariadb service:

Today's log shows these errors:
2021-05-28 8:17:28 0 [Note] WSREP: Recovering GCache ring buffer: version: 2, UUID: b59c71a8-8bcd-11eb-850a-7fe030c3c134, offset: -1
2021-05-28 8:17:28 0 [Note] WSREP: GCache::RingBuffer initial scan... 0.0% ( 0/134217752 bytes) complete.
2021-05-28 8:17:28 0 [ERROR]* WSREP: deque::_M_new_elements_at_back*
2021-05-28 8:17:28 0 [ERROR] WSREP: Failed to create a new provider '/usr/lib64/galera-4/libgalera_smm.so' with options '': Failed to initialize wsrep provider
2021-05-28 8:17:28 0 [ERROR] WSREP: Failed to load provider
2021-05-28 8:17:28 0 [ERROR] Aborting

had to set wsrep=OFF once again to make it start.

Note: MariaDB was working fine since a few weeks, as a standalone node with the wsrep=ON option.
Back in early April, there was indeed 3 node with Galera cluster, but we had to turn off 2 of the nodes, leaving only 1.
There is no apparent reason for this node to not start...

Comment by Alexey [ 2021-07-19 ]

Gcache ring buffer is not synced to disk and is not guaranteed to be consistent on restart. Hence some bogus numbers can be read from disk and care must be taken of potential exceptions. In particular here insert may throw exception for bogus seqno value. Fixed in commit 4bb58377cf3ee02e4c69ce329c2e099b07c79368

Generated at Thu Feb 08 09:31:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.