Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-9598

Donor's rsync SST script hangs if FTWRL fails

    XMLWordPrintable

Details

    Description

      When a joiner requests an rsync SST, wsrep_sst_rsync on the donor node executes FLUSH TABLES WITH READ LOCK before donating the SST. If FLUSH TABLES WITH READ LOCK is not successful, then this wsrep_sst_rsync process dies not die. Instead, it seems to stick around.

      Often, this script seems to have some locks in the database, so this can cause strange problems, such as the node being stuck in the DONOR/DESYNCED state.

      To reproduce, let's say that we have a 2-node cluster: one will act as the donor, and one as the joiner.

      Let's first create and populate a table:

      CREATE DATABASE test_db;
      USE test_db;
       
      CREATE TABLE test_table (
      	id int primary key,
      	str varchar(50)
      );
       
      DELIMITER $$
      CREATE PROCEDURE insert_test_data()
      BEGIN
        DECLARE i INT DEFAULT 1;
       
        WHILE i < 100000 DO
          INSERT INTO `test_table` (id, str)
      	VALUES (i, CONCAT('str', i));
      	SET i = i + 1;
        END WHILE;
      END$$
      DELIMITER ;
       
      CALL insert_test_data();
       
      DROP PROCEDURE insert_test_data;

      Then let's stop one of the nodes and delete the datadir:

      sudo systemctl stop mysql
      sudo rm -fr /var/lib/mysql/*

      And then on the donor node, let's start some DDL that will take a long time:

      CREATE TABLE test_table_copy AS SELECT t1.str AS str1, t2.str AS str2 FROM test_table t1 JOIN test_table t2 ON t1.id != t2.id;

      Once the DDL is started, let's start the SST on the joiner:

      sudo systemctl start mysql

      The donor will see an error like this:

      Feb 19 12:35:35 ip-172-31-22-174 mysqld: 2016-02-19 12:35:35 139654577776384 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '172.31.19.192:4444/rsync_sst' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/'     '' --gtid '474f3f92-d723-11e5-8da3-d3e87bd5db9a:1' --gtid-domain-id '0''
      Feb 19 12:35:35 ip-172-31-22-174 mysqld: 2016-02-19 12:35:35 139655842556672 [Note] WSREP: sst_donor_thread signaled with 0
      Feb 19 12:35:35 ip-172-31-22-174 mysqld: 2016-02-19 12:35:35 139654577776384 [Note] WSREP: Flushing tables for SST...
      Feb 19 12:35:36 ip-172-31-22-174 mysqld: 2016-02-19 12:35:36 139654577776384 [Warning] WSREP: error executing 'FLUSH TABLES WITH READ LOCK': 1205 (Lock wait timeout exceeded; try restarting transaction)
      Feb 19 12:35:36 ip-172-31-22-174 mysqld: 2016-02-19 12:35:36 139654577776384 [ERROR] WSREP: Failed to flush and lock tables
      Feb 19 12:35:36 ip-172-31-22-174 mysqld: 2016-02-19 12:35:36 139654577776384 [ERROR] WSREP: Failed to flush tables: -1 (Unknown error -1)
      Feb 19 12:35:36 ip-172-31-22-174 mysqld: 2016-02-19 12:35:36 139655555049216 [Warning] WSREP: 1.0 (): State transfer to 0.0 () failed: -78 (Remote address changed)

      And the wsrep_sst_rsync process will not die. For each additional SST attempt, there will be another leftover process:

      $ ps -elf | grep "wsrep_sst_rsync" | wc -l
      5
      $ ps -elf | grep "wsrep_sst_rsync"
      0 S mysql     2210     1  0  80   0 - 28837 wait   12:35 ?        00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role donor --address 172.31.19.192:4444/rsync_sst --socket /var/lib/mysql/mysql.sock --datadir /var/lib/mysql/  --gtid 474f3f92-d723-11e5-8da3-d3e87bd5db9a:1 --gtid-domain-id 0
      0 S mysql     2309     1  0  80   0 - 28837 wait   12:35 ?        00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role donor --address 172.31.19.192:4444/rsync_sst --socket /var/lib/mysql/mysql.sock --datadir /var/lib/mysql/  --gtid 474f3f92-d723-11e5-8da3-d3e87bd5db9a:1 --gtid-domain-id 0
      0 S mysql     2487     1  0  80   0 - 28837 wait   12:35 ?        00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role donor --address 172.31.19.192:4444/rsync_sst --socket /var/lib/mysql/mysql.sock --datadir /var/lib/mysql/  --gtid 474f3f92-d723-11e5-8da3-d3e87bd5db9a:1 --gtid-domain-id 0
      0 S mysql     2747     1  0  80   0 - 28837 wait   12:35 ?        00:00:00 /bin/bash -ue /usr//bin/wsrep_sst_rsync --role donor --address 172.31.19.192:4444/rsync_sst --socket /var/lib/mysql/mysql.sock --datadir /var/lib/mysql/  --gtid 474f3f92-d723-11e5-8da3-d3e87bd5db9a:1 --gtid-domain-id 0
      0 R ec2-user 14138  1915  0  80   0 - 28160 -      12:40 pts/0    00:00:00 grep --color=auto wsrep_sst_rsync

      Attachments

        Activity

          People

            nirbhay_c Nirbhay Choubey (Inactive)
            GeoffMontee Geoff Montee (Inactive)
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.