Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25883

Galera Cluster hangs while "DELETE FROM mysql.wsrep_cluster"

Details

    • Bug
    • Status: Closed (View Workflow)
    • Critical
    • Resolution: Fixed
    • 10.4.18
    • 10.4.22, 10.5.13
    • Galera
    • None
    • CentOS 7.8

    Description

      Hi,

      I have a setup with 3 Galera nodes on MariaDB 10.4.18.
      I have blocking issue with MariaDB stability during network fluctuations (this usually happens for me when my client performs network interventions of it has hardware issues on the network).

      The issues is the following:
      1. After few ups and downs of the network, on single node, Galera cluster goes into a complete freeze.
      2. The node that lost connection is OK, as it doe not receive traffic anymore
      3 The other 2 nodes are accessible via mysql connection with root account but when I check the process list in MariaDB I have always the following query hanged updating: DELETE FROM mysql.wsrep_cluster.

      • I see other queries after this, but they all have the time lower than "DELETE FROM mysql.wsrep_cluster", which makes me thing that this query somehow locks the whole database and the other queries are waiting for it. Unfortunately it never ends.

      Logs analysis:
      1. We do not see any slow log for the query that hangs
      2. We do not see any error in MariaDB / Galera error log

      In order to replicate this we perform the following test:
      1. Start Galera on 3 nodes
      2. We have a webservice that connects to Galera start a tool that sends queries continuously to the database
      3. We perform successive ifup / ifdown at about 60 seconds interval on single node
      4. After 10 -15 tries, the cluster hangs in the situation above

      I'm attaching the script that we use for ifup / ifdown testing and also I'm putting here he Galera configuration:

      #Galera Provider Configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
       
      #Galera Cluster Configuration
      wsrep_cluster_name="<EDITED>"
      wsrep_cluster_address="gcomm://<EDITED>"
       
      #Galera Synchronization Configuration
      wsrep_sst_method=rsync
       
      #Galera Node Configuration
      wsrep_node_address="<EDITED>"
      wsrep_node_name="<EDITED>"
      wsrep_slave_threads = 8
       
      wsrep_provider_options="gcache.size = 2G; gcache.page_size = 1G; gcs.fc_limit = 256; gcs.fc_factor = 0.99; cert.log_conflicts = ON;"
       
      #read consistency
      wsrep_sync_wait=3
      wsrep_retry_autocommit = 1
       
      wsrep_debug = 1
      wsrep_log_conflicts = 1
      

      Attachments

        1. threads-tool.php
          0.9 kB
        2. MariaDB validation diagram.png
          MariaDB validation diagram.png
          65 kB
        3. MariaDB-server.cnf
          4 kB
        4. MariaDB-Log.png
          MariaDB-Log.png
          24 kB
        5. ifdownupmain.sh
          0.6 kB
        6. galera-stability.sql
          2 kB
        7. galera-stability.php
          4 kB
        8. Galera Log - ETH-UP-Down-OK.txt
          5 kB
        9. Galera Log - ETH-UP-Down-FAIL.txt
          7 kB

        Issue Links

          Activity

            Hi Roel,

            It is fine. They are VPN closed and temporary test IPs.

            Thanks for the remark.

            ionut.andras Andras Marinescu added a comment - Hi Roel, It is fine. They are VPN closed and temporary test IPs. Thanks for the remark.

            Understood. Thanks.

            Roel Roel Van de Paar added a comment - Understood. Thanks.

            MDEV-23379 deprecated and ignored the setting innodb_thread_concurrency in MariaDB Server 10.5.5. I adjusted the "fix version" accordingly. The MariaDB Server 10.6 series was never affected by this.

            marko Marko Mäkelä added a comment - MDEV-23379 deprecated and ignored the setting innodb_thread_concurrency in MariaDB Server 10.5.5. I adjusted the "fix version" accordingly. The MariaDB Server 10.6 series was never affected by this.

            Hi Marko,

            In conclusion, setting this variable: innodb_thread_concurrency=0, should improve things?

            Thanks in advance.

            ionut.andras Andras Marinescu added a comment - Hi Marko, In conclusion, setting this variable: innodb_thread_concurrency=0, should improve things? Thanks in advance.

            ionut.andras, with regard to MDEV-23379, I did not test earlier release series than 10.5 at extreme load. It might be that already 10.3 significantly improved scalability thanks to reducing contention on trx_sys.mutex. 10.5 introduced further changes, to reduce contention on buf_pool.mutex and fil_system.mutex.

            marko Marko Mäkelä added a comment - ionut.andras , with regard to MDEV-23379 , I did not test earlier release series than 10.5 at extreme load. It might be that already 10.3 significantly improved scalability thanks to reducing contention on trx_sys.mutex . 10.5 introduced further changes, to reduce contention on buf_pool.mutex and fil_system.mutex .

            People

              jplindst Jan Lindström (Inactive)
              ionut.andras Andras Marinescu
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.