[MDEV-20082] MariaDB/Galrea Cluster hangs with "WSREP: BF lock wait long" - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Incomplete
Affects Version/s: 10.2.11
Fix Version/s: N/A
Component/s: Galera
Labels:
None
Environment:
3 node galera cluster with CentOS Linux release 7.4 and 10.2.11-MariaDB

Description

Periodically (nearly daily) our cluster hangs. It starts with 'mysqld: WSREP: BF lock wait long' messages on one node. From that point on, only read operations can be done on the 3 cluster nodes and all insert/update/delete operations stucks (blocked?).
We eliminate the fault & started again by the following procedure:
1. Stopping all MariaDB instances
2. on one node: /var/lib/m.ysql/grastate.dat > safe_to_bootstrap to 1
3. galera_new_cluster
4. on the other nodes: systemctl start mariadb

We found a workaround: By changing the HAProxy configuration from "stixcky" to "Least-Conn".
Because we only have one Application-Server (1 client) this represents a change from a multi-master database to a single master database system.
(we are now always connect to the same galera node).
After this we tried to build a simple RePro: We insert/update/delete concurrently the same data row from all 3 nodes in the cluster.
This caused a cluster crash, but not with message 'BF lock wait long'_ :

_Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: recv_thread() joined.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: Closing replication queue.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: Closing slave action queue.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [Note] WSREP: Signalling provider to continue.
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [Note] WSREP: SST received: a1df6c9c-db3b-11e7-9216-_6f0c255291b9:3058761
Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [ERROR] WSREP: Trying to launch slave threads before creating connection at 'gcomm://10.98.206.2,10.98.206.7,10.98.206.9'
Aug 6 12:54:50 drasolf mysqld: mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.11/sql/wsrep_thd.cc:447: void wsrep_create_appliers(long int): Assertion `0' failed.
Aug 6 12:54:50 drasolf mysqld: 190806 12:54:50 [ERROR] mysqld got signal 6 ;
Aug 6 12:54:50 drasolf mysqld: This could be because you hit a bug. It is also possible that this binary
Aug 6 12:54:50 drasolf mysqld: or one of the libraries it was linked against is corrupt, improperly built,
Aug 6 12:54:50 drasolf mysqld: or misconfigured. This error can also be caused by malfunctioning hardware.
Aug 6 12:54:50 drasolf mysqld: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
Aug 6 12:54:50 drasolf mysqld: We will try our best to scrape up some info that will hopefully help
Aug 6 12:54:50 drasolf mysqld: diagnose the problem, but since we have already crashed,
Aug 6 12:54:50 drasolf mysqld: something is definitely wrong and this may fail.
Aug 6 12:54:50 drasolf mysqld: Server version: 10.2.11-MariaDB
Aug 6 12:54:50 drasolf mysqld: key_buffer_size=134217728
Aug 6 12:54:50 drasolf mysqld: read_buffer_size=131072
Aug 6 12:54:50 drasolf mysqld: max_used_connections=0
Aug 6 12:54:50 drasolf mysqld: max_threads=153
Aug 6 12:54:50 drasolf mysqld: thread_count=7
Aug 6 12:54:50 drasolf mysqld: It is possible that mysqld could use up to
Aug 6 12:54:50 drasolf mysqld: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467244 K bytes of memory
Aug 6 12:54:50 drasolf mysqld: Hope that's ok; if not, decrease some variables in the equation.
Aug 6 12:54:50 drasolf mysqld: Thread pointer: 0x0
Aug 6 12:54:50 drasolf mysqld: Attempting backtrace. You can use the following information to find out
Aug 6 12:54:50 drasolf mysqld: where mysqld died. If you see no messages after this, something went
Aug 6 12:54:50 drasolf mysqld: terribly wrong...
Aug 6 12:54:50 drasolf mysqld: stack_bottom = 0x0 thread_stack 0x49000__
_

There may be an issue with the situation in which two nodes attempt to change the same row in a database, but with different values.

Attachments

Activity

People

Assignee:: Jan Lindström (Inactive)

Reporter:: peter koch

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2019-07-17 11:06

Updated:: 2019-12-12 13:30

Resolved:: 2019-12-12 13:30

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.