Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-24294

MariaDB - Cluster freezes if node hangs

Details

    Description

      Currently I have a recurring problem. Our database cluster, consisting of three nodes, currently fails almost daily. The reason is repeated that one of the three nodes hangs and thus somehow hangs the whole cluster. But... we have the cluster to protect us against failures.

      The problem behaves in such a way that every connection attempt is timed out. I connect via ssh to each of the nodes and execute the command "mariadb" or "mysql". So far it was always the case that the command worked on 2 of 3 nodes, one node (the hanging one) is not responding. If I now restart the hanging node via "reboot -f", the cluster is healthy again after a few seconds.

      A reboot without "-f" does not work because the MariaDB service cannot be stopped. Even after several hours the frozen node is not removed from the cluster.

      So far once the first and twice the third node hangs. Each time the whole cluster was no longer usable.

      The command "mysqlcheck -A -e" displays "OK" for all tables. So i hope that no one is corrupted.

      Before we upgraded to version 10.5.8, we did not have this problem. I don't know if this problem is related to the new version, so I'm reporting it here.

      We have two tables with 3 to 5 millions data records. The other tables (about 10 more) have 1 to 60.000 data records. The database is accessed about 20-100 times a second.

      I'm desperate about this, because the database has always been very stable.

      Does anyone have an idea?

      Following the configration:

      The innodb_buffer_pool_size is set to 22G and the max connections to 800 (up to now, a maximum of 120 were used simultaneously).

      # /etc/mysql/mariadb.conf.d/60-galera.cnf
      #
      # * Galera-related settings
      #
      # See the examples of server wsrep.cnf files in /usr/share/mysql
       
      [galera]
      # Mandatory settings
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_address="gcomm://10.0.0.3,10.0.0.4,10.0.0.2"
      binlog_format=row
      default_storage_engine=InnoDB
      innodb_autoinc_lock_mode=2
       
      # Allow server to accept connections on all interfaces.
      bind-address=0.0.0.0
       
      # Optional settings
      #wsrep_slave_threads=1
      #innodb_flush_log_at_trx_commit=0
       
      wsrep_cluster_name="mariadb-galera-cluster"
      wsrep_sst_method=rsync
       
      # Cluster node configuration
      wsrep_node_address="10.0.0.3"
      wsrep_node_name="db-1"
      

      Attachments

        Issue Links

          Activity

            violuke Luke Cousins added a comment -

            Thanks Seppo. Does this mean that you're confident that 10.6.x is not affected, or may this also be affected?

            violuke Luke Cousins added a comment - Thanks Seppo. Does this mean that you're confident that 10.6.x is not affected, or may this also be affected?
            erwin_se Ers Sein added a comment -

            Just wanted to add a +1 as we have seen multiple environments in production where the cluster completely hangs. It doesn't happen often, at most every other week or so and not in all environments. But regardless is giving a lot of frustration having to reboot machines and bootstrap to get the cluster going again. This is since we upgraded from 10.4.17 (10.4.18 and 10.4.20 so far).

            erwin_se Ers Sein added a comment - Just wanted to add a +1 as we have seen multiple environments in production where the cluster completely hangs. It doesn't happen often, at most every other week or so and not in all environments. But regardless is giving a lot of frustration having to reboot machines and bootstrap to get the cluster going again. This is since we upgraded from 10.4.17 (10.4.18 and 10.4.20 so far).
            seppo Seppo Jaakola added a comment -

            violuke erwin_se 10.6 has refactored high priority transaction conflict resolution and is not affected by MDEV-23328

            seppo Seppo Jaakola added a comment - violuke erwin_se 10.6 has refactored high priority transaction conflict resolution and is not affected by MDEV-23328
            Hrehora Rob added a comment -

            Is it recommended to downgrade to 10.4.17?

            Hrehora Rob added a comment - Is it recommended to downgrade to 10.4.17?

            commit ef2dbb8dbc3ee42b59adcd2ee4b9967ff55867a1

            jplindst Jan Lindström (Inactive) added a comment - commit ef2dbb8dbc3ee42b59adcd2ee4b9967ff55867a1

            People

              seppo Seppo Jaakola
              BC-M Malte Bastian
              Votes:
              11 Vote for this issue
              Watchers:
              20 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.