Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27689

Node hangs and complete galera cluster freezes

    XMLWordPrintable

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Incomplete
    • 10.5.8, 10.5.12, 10.6.5
    • N/A
    • Galera
    • None
    • 3 nodes, all of them: Debian 10.11, 128GB Ram, 32 CPU, ~ 500 tables, ~ 420GB data, biggest table with ~500.000.000 rows

    Description

      Every couple of days/weeks one of our 3 nodes hangs and freezes the complete galera cluster. It is a little bit like MDEV-24294 but with some differences.

      When one of the nodes hangs like this, I can still connect with SSH. But I can not run the mysql client to get info about the server and ws_rep variables. I run the client command and nothing happens. It does not open the mariadb console. There are no log lines in /var/log/syslog (although it is usually chatty when a node connects for example). I can not stop the service (service mariadb stop). When I kill the service, the cluster goes back online and works fine. When I start the service again, the node synchronizes immediately and we have no more problems for a week or two or three.

      It is not always the same node. Each of the nodes has this problem once in a while. It usually happens at night but not at the same time (sometimes 9pm, sometimes 4am) and there is not much load/memory used/network traffic when it happens. The mariadb server process also does not have a lot of load when I try to run the client and nothing happens. Today it was about 0.7%. And there are no othere processes with more load.

      We are running this cluster for a couple of years now and have been passing quite a few mariadb versions. The performance is very good. We have this particular problem since 10.5.8 and still have it now with 10.6.5.

      I have no clue how to further investigate or solve this problem. I would expect the cluster to exclude the hanging node and resume normal operations until the particular node joins again. Having no hanging node at all would even be better.

      This is our configuration:

      [client]
      port            = 3306
      socket          = /var/run/mysqld/mysqld.sock
       
      [mysqld_safe]
      socket          = /var/run/mysqld/mysqld.sock
      nice            = 0
       
      [mysqld]
      user            = mysql
      pid-file        = /var/run/mysqld/mysqld.pid
      socket          = /var/run/mysqld/mysqld.sock
      port            = 3306
      basedir         = /usr
      datadir         = /var/lib/mysql
      tmpdir=/data/tmp
      lc_messages_dir = /usr/share/mysql
      lc_messages     = en_US
      skip-external-locking
       
      character-set-server=utf8
      collation-server=utf8_general_ci
       
      bind-address=0.0.0.0
       
      max_connections         = 750
      connect_timeout         = 5
      wait_timeout            = 10000
      interactive_timeout     = 10000
      max_allowed_packet      = 1073741824
      thread_cache_size       = 128
      sort_buffer_size        = 4M
      bulk_insert_buffer_size = 16M
      tmp_table_size          = 64M
      max_heap_table_size     = 64M
       
      # MyIsam
      myisam_recover_options = BACKUP
      key_buffer_size         = 128M
      #open-files-limit       = 2000
      table_open_cache        = 2000
      myisam_sort_buffer_size = 512M
      concurrent_insert       = 2
      read_buffer_size        = 2M
      read_rnd_buffer_size    = 1M
       
      # Query Cache
      query_cache_limit               = 256K
      query_cache_size=0
      query_cache_type=0
       
      # Logging
      log_warnings            = 2
      slow_query_log          = 1
      slow_query_log_file     = /var/log/mysql/mariadb-slow.log
      long_query_time         = 2
      log_slow_verbosity      = query_plan
      log_queries_not_using_indexes = 0
       
      log_bin                 = /data/log/mysql/mariadb-bin
      log_bin_index           = /data/log/mysql/mariadb-bin.index
      binlog_expire_logs_seconds        = 10000
      max_binlog_size         = 100M
      binlog_format=ROW
       
      #Engine
      sql_mode                = NO_ENGINE_SUBSTITUTION
       
      #InnoDB
      default-storage-engine=innodb
      innodb_autoinc_lock_mode=2
      innodb_log_file_size    = 5G
      innodb_buffer_pool_size = 56G
      innodb_flush_log_at_trx_commit = 0
      innodb_log_buffer_size  = 512M
      innodb_file_per_table   = 1
      innodb_open_files       = 10000
      innodb_flush_method     = O_DIRECT
      innodb_table_locks      = 0
      innodb_lock_wait_timeout= 300
      skip-innodb-doublewrite
       
      [galera]
      bind-address=0.0.0.0
      wsrep_on=ON
      wsrep_provider=/usr/lib/galera/libgalera_smm.so
      wsrep_cluster_name="my_wsrep_cluster"
      wsrep_cluster_address=gcomm://10.200.0.7,10.200.0.6,10.200.0.9
      wsrep_node_address="10.200.0.6"
      wsrep_node_name="node_3"
      wsrep_sst_donor="node_2,node_3,"
      wsrep_sst_method=mariabackup
      wsrep_sst_auth=XXXXX
      wsrep_slave_threads=4
      wsrep_provider_options="gcache.size=2G"
       
      [mysqldump]
      quick
      quote-names
      max_allowed_packet      = 16M
       
      [mysql]
       
       
      [isamchk]
      key_buffer              = 16M
      
      

      Attachments

        Issue Links

          Activity

            People

              janlindstrom Jan Lindström
              uabelmann Ulrich Abelmann
              Votes:
              2 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.