Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20082

MariaDB/Galrea Cluster hangs with "WSREP: BF lock wait long"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Incomplete
    • 10.2.11
    • N/A
    • Galera
    • None
    • 3 node galera cluster with CentOS Linux release 7.4 and 10.2.11-MariaDB

    Description

      Periodically (nearly daily) our cluster hangs. It starts with 'mysqld: WSREP: BF lock wait long' messages on one node. From that point on, only read operations can be done on the 3 cluster nodes and all insert/update/delete operations stucks (blocked?).
      We eliminate the fault & started again by the following procedure:
      1. Stopping all MariaDB instances
      2. on one node: /var/lib/m.ysql/grastate.dat > safe_to_bootstrap to 1
      3. galera_new_cluster
      4. on the other nodes: systemctl start mariadb

      We found a workaround: By changing the HAProxy configuration from "stixcky" to "Least-Conn".
      Because we only have one Application-Server (1 client) this represents a change from a multi-master database to a single master database system.
      (we are now always connect to the same galera node).
      After this we tried to build a simple RePro: We insert/update/delete concurrently the same data row from all 3 nodes in the cluster.
      This caused a cluster crash, but not with message 'BF lock wait long'_ :

      _Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: recv_thread() joined.
      Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: Closing replication queue.
      Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882457532160 [Note] WSREP: Closing slave action queue.
      Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [Note] WSREP: Signalling provider to continue.
      Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [Note] WSREP: SST received: a1df6c9c-db3b-11e7-9216-_6f0c255291b9:3058761
      Aug 6 12:54:50 drasolf mysqld: 2019-08-06 12:54:50 139882813352064 [ERROR] WSREP: Trying to launch slave threads before creating connection at 'gcomm://10.98.206.2,10.98.206.7,10.98.206.9'
      Aug 6 12:54:50 drasolf mysqld: mysqld: /home/buildbot/buildbot/padding_for_CPACK_RPM_BUILD_SOURCE_DIRS_PREFIX/mariadb-10.2.11/sql/wsrep_thd.cc:447: void wsrep_create_appliers(long int): Assertion `0' failed.
      Aug 6 12:54:50 drasolf mysqld: 190806 12:54:50 [ERROR] mysqld got signal 6 ;
      Aug 6 12:54:50 drasolf mysqld: This could be because you hit a bug. It is also possible that this binary
      Aug 6 12:54:50 drasolf mysqld: or one of the libraries it was linked against is corrupt, improperly built,
      Aug 6 12:54:50 drasolf mysqld: or misconfigured. This error can also be caused by malfunctioning hardware.
      Aug 6 12:54:50 drasolf mysqld: To report this bug, see https://mariadb.com/kb/en/reporting-bugs
      Aug 6 12:54:50 drasolf mysqld: We will try our best to scrape up some info that will hopefully help
      Aug 6 12:54:50 drasolf mysqld: diagnose the problem, but since we have already crashed,
      Aug 6 12:54:50 drasolf mysqld: something is definitely wrong and this may fail.
      Aug 6 12:54:50 drasolf mysqld: Server version: 10.2.11-MariaDB
      Aug 6 12:54:50 drasolf mysqld: key_buffer_size=134217728
      Aug 6 12:54:50 drasolf mysqld: read_buffer_size=131072
      Aug 6 12:54:50 drasolf mysqld: max_used_connections=0
      Aug 6 12:54:50 drasolf mysqld: max_threads=153
      Aug 6 12:54:50 drasolf mysqld: thread_count=7
      Aug 6 12:54:50 drasolf mysqld: It is possible that mysqld could use up to
      Aug 6 12:54:50 drasolf mysqld: key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467244 K bytes of memory
      Aug 6 12:54:50 drasolf mysqld: Hope that's ok; if not, decrease some variables in the equation.
      Aug 6 12:54:50 drasolf mysqld: Thread pointer: 0x0
      Aug 6 12:54:50 drasolf mysqld: Attempting backtrace. You can use the following information to find out
      Aug 6 12:54:50 drasolf mysqld: where mysqld died. If you see no messages after this, something went
      Aug 6 12:54:50 drasolf mysqld: terribly wrong...
      Aug 6 12:54:50 drasolf mysqld: stack_bottom = 0x0 thread_stack 0x49000__
      _

      There may be an issue with the situation in which two nodes attempt to change the same row in a database, but with different values.

      Attachments

        Activity

          People

            jplindst Jan Lindström (Inactive)
            peter.koch@cbc.de peter koch
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.