Details
-
Bug
-
Status: Closed (View Workflow)
-
Major
-
Resolution: Incomplete
-
10.2.11
-
None
-
3 node galera cluster with CentOS Linux release 7.4 and 10.2.11-MariaDB
Description
Periodically (nearly daily) our cluster hangs. It starts with 'mysqld: WSREP: BF lock wait long' messages on one node. From that point on, only read operations can be done on the 3 cluster nodes and all insert/update/delete operations stucks (blocked?).
We eliminate the fault & started again by the following procedure:
1. Stopping all MariaDB instances
2. on one node: /var/lib/m.ysql/grastate.dat > safe_to_bootstrap to 1
3. galera_new_cluster
4. on the other nodes: systemctl start mariadb
We found a workaround: By changing the HAProxy configuration from "stixcky" to "Least-Conn".
Because we only have one Application-Server (1 client) this represents a change from a multi-master database to a single master database system.
(we are now always connect to the same galera node).
After this we tried to build a simple RePro: We insert/update/delete concurrently the same data row from all 3 nodes in the cluster.
This caused a cluster crash, but not with message 'BF lock wait long'_ :
_Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: recv_thread() joined.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: Closing replication queue.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: Closing slave action queue.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [Note] WSREP: Signalling provider to continue.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [Note] WSREP: SST received: a1df6c9c-db3b-11e7-9216-_6f0c255291b9:3058761
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [ERROR] WSREP: Trying to launch slave threads before creating connection at 'gcomm://10.98.206.2,10.98.206.7,10.98.206.9'
Aug 6 12:54:50 drasolf mysqld: mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.11/sql/wsrep_thd.cc:447: void wsrep_create_appliers(long int): Assertion `0' failed.
Aug 6 12:54:50 drasolf mysqld: 190806 12:54:50 [ERROR] mysqld got signal 6 ;
Aug 6 12:54:50 drasolf mysqld: This could be because you hit a bug. It is also possible that this binary
Aug 6 12:54:50 drasolf mysqld: or one of the libraries it was linked against is corrupt, improperly built,
Aug 6 12:54:50 drasolf mysqld: or misconfigured. This error can also be caused by malfunctioning hardware.
Aug 6 12:54:50 drasolf mysqld: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Aug 6 12:54:50 drasolf mysqld: We will try our best to scrape up some info that will hopefully help
Aug 6 12:54:50 drasolf mysqld: diagnose the problem, but since we have already crashed,
Aug 6 12:54:50 drasolf mysqld: something is definitely wrong and this may fail.
Aug 6 12:54:50 drasolf mysqld: Server version: 10.2.11-MariaDB
Aug 6 12:54:50 drasolf mysqld: key_buffer_size=134217728
Aug 6 12:54:50 drasolf mysqld: read_buffer_size=131072
Aug 6 12:54:50 drasolf mysqld: max_used_connections=0
Aug 6 12:54:50 drasolf mysqld: max_threads=153
Aug 6 12:54:50 drasolf mysqld: thread_count=7
Aug 6 12:54:50 drasolf mysqld: It is possible that mysqld could use up to
Aug 6 12:54:50 drasolf mysqld: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467244 K bytes of memory
Aug 6 12:54:50 drasolf mysqld: Hope that's ok; if not, decrease some variables in the equation.
Aug 6 12:54:50 drasolf mysqld: Thread pointer: 0x0
Aug 6 12:54:50 drasolf mysqld: Attempting backtrace. You can use the following information to find out
Aug 6 12:54:50 drasolf mysqld: where mysqld died. If you see no messages after this, something went
Aug 6 12:54:50 drasolf mysqld: terribly wrong...
Aug 6 12:54:50 drasolf mysqld: stack_bottom = 0x0 thread_stack 0x49000__
_
There may be an issue with the situation in which two nodes attempt to change the same row in a database, but with different values.