Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32261

Galera Cluster does not mark lagging node as non-primary, wsrep_local_state_comment shows synced status. Entire cluster hangs with TOI.

    XMLWordPrintable

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.6.11
    • None
    • Galera
    • None
    • Prod

    Description

      Galera Cluster does not mark lagging node as non-primary, wsrep_local_state_comment shows synced status. Entire cluster hangs with TOI.

      We have a 3-node galera cluster on the primary site. There is another 3-node galera cluster in a DR site with binlog replication happening between node 1(master node) of primary cluster to node 1 of DR cluster . Node 1 has pc.weight set as 2, node 2 has it as 1 and node 3 has it set to 0 in wsrep_provider_options.

      We have observed that sometimes, one of the nodes ( even one with pc.weight = 1 or 0), lags behind in the cluster, shows wsrep_last_committed value less than the other two nodes and shows a high wsrep_local_recv_queue value but still it is NOT marked as NON-Primary component. The other nodes are waiting on the lagging node. And all the 3 nodes are hung, transactions are waiting forever either on commit or on "acquiring total order isolation" (sometime due to a truncate which is not the original offender). Surprisingly, 'wsrep_cluster_status' is shown as Primary for all nodes, wsrep_cluster_size shows 3 , wsrep_local_state_comment shows "synced" on all the nodes, all the nodes are wsrep_ready=yes and wsrep_connected=yes. The value for wsrep_local_recv_queue on the lagging node > 1 but the wsrep_last_committed value remains frozen. No errors are shown in mysqld log. Issue does not get resolved unless we bounce the problematic node and in some cases the entire cluster.

      Also, DML (especially deletes and updates) replication across cluster nodes is very slow and a delete of 10k rows takes 2 mins and update takes 4 mins to sync up across the all the nodes. Tried with higher values for evs.send_window and wsrep_slave_threads, still there is no change in performance.

      All the servers involved are 4 CPU and 32 GB RAM. RTT under 0.3 ms between nodes.
      rtt min/avg/max/mdev = 0.231/0.262/0.275/0.027 ms
      rtt min/avg/max/mdev = 0.198/0.250/0.291/0.043 ms
      rtt min/avg/max/mdev = 0.206/0.233/0.250/0.019 ms

      My.cnf values for 1st node -

      node1

      # this is only for the mysqld standalone daemon
      [mysqld]
      datadir=/data/mariadata/mysql
      socket=/data/mariadata/mysql/mysql.sock
      log_error=/data/mariadata/log/mysqld.log
      lower_case_table_names = 1
      log_bin_trust_function_creators = ON
      max_connections=1000
       
      binlog_format=row
      default_storage_engine=InnoDB
      innodb_autoinc_lock_mode=2
      innodb_flush_log_at_trx_commit=0
      innodb_buffer_pool_size=24G
       
      #replication
      server_id=1
      gtid_domain_id =21
      log_bin=/data/mariadata/binlogs/mariadb-bin
      relay_log=/data/mariadata/relaylogs/relay-bin
      log_slave_updates=ON
      expire_logs_days = 7
       
      #
      # * Galera-related settings
      #
      [galera]
      # Mandatory settings
       
      #Galera provider Configuration
      wsrep_on=ON
      wsrep_provider=/usr/lib64/galera-4/libgalera_smm.so
      # Optional settings
      wsrep_slave_threads=4
      wsrep_provider_options="gcache.size=500M;gcache.page_size=500M;pc.weight=2"
       
      #Galera cluster configuration
      wsrep_cluster_name="galera-dev"
      wsrep_cluster_address="gcomm://ip1, ip2, ip3"
       
      #Galer Node Configuration
      wsrep_node_name="hostname1"
      wsrep_node_address="ip1"
       
      #
      # Allow server to accept connections on all interfaces.
      #
      bind-address=0.0.0.0
      #
      #Galera sst configuration
      wsrep_sst_method=rsync
       
      #replication
      wsrep_gtid_mode=ON
      wsrep_gtid_domain_id=5
       
      # this is only for embedded server
      [embedded]
      

      Attaching the session logs taken from Galera nodes and the mysqld log files.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Neelima PITTA NEELIMA
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.