Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7808

FLUSH TABLES WITH READ LOCK has strange locking behavior

    XMLWordPrintable

Details

    Description

      It looks like if wsrep_OSU_method set to TOI, DDL is transmitted to other nodes even when the provider is paused, and the DDL on the originating node shouldn't run the DDL because it is supposedly waiting on a lock due to FLUSH TABLES WITH READ LOCK. The DDL runs on the originating node anyway. The DDL runs on all nodes except the originating node, if there's some DML running on that node to block it.

      According to the logs, it also looks like DDL may be getting sent to the other nodes more than once.

      DDL + no DML

      I have 3 Galera nodes in a cluster.

      On all 3 nodes, let's execute:

      SET GLOBAL wsrep_debug=ON;

      First, let's do some setup on one of the nodes:

      CREATE TABLE `tmp` (
        `a` int(11) NOT NULL,
        `b` varchar(20) DEFAULT NULL,
        PRIMARY KEY (`a`)
      );

      Now let's open up two MySQL shells on the same node.

      Execute this on the first shell:

      MariaDB [tmp]> FLUSH TABLES WITH READ LOCK;
      Query OK, 0 rows affected (0.14 sec)

      The log on this node shows that the provider has been paused:

      150320 11:43:04 [Note] WSREP: Provider paused at 5edc131e-ae0b-11e4-85d1-c2c0512e9e7e:2896 (495)

      In the second shell, let's try to drop the table:

      MariaDB [tmp]> DROP TABLE tmp;

      This hangs:

      MariaDB [tmp]> SHOW PROCESSLIST;
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      | Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      |  1 | system user |           | NULL | Sleep   | 1744 | NULL                 | NULL             |    0.000 |
      |  2 | system user |           | NULL | Sleep   | 1744 | wsrep aborter idle   | NULL             |    0.000 |
      |  8 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |
      | 18 | root        | localhost | tmp  | Query   |   14 | checking permissions | DROP TABLE tmp   |    0.000 |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      4 rows in set (0.00 sec)

      Let's attempt to kill the query once:

      MariaDB [tmp]> DROP TABLE tmp;
      ^CCtrl-C -- query killed. Continuing normally.

      The query is now stuck in the "killed" state:

      MariaDB [tmp]> SHOW PROCESSLIST;
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      | Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      |  1 | system user |           | NULL | Sleep   | 1815 | NULL                 | NULL             |    0.000 |
      |  2 | system user |           | NULL | Sleep   | 1815 | wsrep aborter idle   | NULL             |    0.000 |
      |  8 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |
      | 18 | root        | localhost | tmp  | Killed  |   85 | checking permissions | DROP TABLE tmp   |    0.000 |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      4 rows in set (0.00 sec)

      The table is already missing from all nodes:

      MariaDB [tmp]> SHOW TABLES;
      Empty set (0.00 sec)

      Logs on the other nodes suggest that the DROP TABLE statement may have been sent to other nodes twice, since we get an error saying the table doesn't exist:

      150320 13:51:39 [ERROR] Slave SQL: Error 'Unknown table 'tmp'' on query. Default database: 'tmp'. Query: 'DROP TABLE tmp', Error_code: 1051
      150320 13:51:39 [Warning] WSREP: RBR event 1 Query apply warning: 1, 5068
      150320 13:51:39 [Warning] WSREP: Ignoring error for TO isolated action: source: bafbdc15-cf25-11e4-9ace-2fc1846ad321 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 18 trx_id: -1 seqnos (l: 5111, g: 5068, s: 5067, d: 5067, ts: 184510656454187)

      If we try to cancel the query again, we lose connection to the server:

      MariaDB [tmp]> DROP TABLE tmp;
      ^CCtrl-C -- query killed. Continuing normally.
      ^CCtrl-C -- query killed. Continuing normally.
      ERROR 2013 (HY000): Lost connection to MySQL server during query

      However, the node is still up, and the query is stuck in the "killed" state:

      MariaDB [tmp]> SHOW PROCESSLIST;
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      | Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      |  1 | system user |           | NULL | Sleep   |  620 | NULL                 | NULL             |    0.000 |
      |  2 | system user |           | NULL | Sleep   |  620 | wsrep aborter idle   | NULL             |    0.000 |
      |  4 | root        | localhost | tmp  | Killed  |  174 | checking permissions | DROP TABLE tmp   |    0.000 |
      |  5 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |
      |  6 | root        | localhost | tmp  | Sleep   |  249 |                      | NULL             |    0.000 |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      5 rows in set (0.00 sec)

      The node is still in the cluster:

      MariaDB [tmp]> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';
      +--------------------+-------+
      | Variable_name      | Value |
      +--------------------+-------+
      | wsrep_cluster_size | 3     |
      +--------------------+-------+
      1 row in set (0.00 sec)

      As soon as we UNLOCK TABLES on the original node, the query is killed.

      MariaDB [tmp]> UNLOCK TABLES;
      Query OK, 0 rows affected (0.00 sec)
       
      MariaDB [tmp]> SHOW PROCESSLIST;
      +----+-------------+-----------+------+---------+------+--------------------+------------------+----------+
      | Id | User        | Host      | db   | Command | Time | State              | Info             | Progress |
      +----+-------------+-----------+------+---------+------+--------------------+------------------+----------+
      |  1 | system user |           | NULL | Sleep   | 2305 | NULL               | NULL             |    0.000 |
      |  2 | system user |           | NULL | Sleep   | 2305 | wsrep aborter idle | NULL             |    0.000 |
      |  8 | root        | localhost | tmp  | Query   |    0 | sleeping           | SHOW PROCESSLIST |    0.000 |
      | 18 | root        | localhost | tmp  | Sleep   |    1 |                    | NULL             |    0.000 |
      +----+-------------+-----------+------+---------+------+--------------------+------------------+----------+
      4 rows in set (0.00 sec)

      DDL + DML

      Let's add some DML to the mix.

      Some setup:

      CREATE TABLE `tmp` (
        `a` int(11) NOT NULL,
        `b` varchar(20) DEFAULT NULL,
        PRIMARY KEY (`a`)
      );
      DELIMITER $$
      CREATE PROCEDURE insert_test_data()
      BEGIN
        DECLARE i INT DEFAULT 1;
       
        WHILE i < 1000000 DO
          INSERT INTO `tmp` (`a`, `b`)
          VALUES (i, i);
          SET i = i + 1;
        END WHILE;
      END$$
      DELIMITER ;

      Now let's have three shells on the originating node:

      On the first shell, execute:

      CALL insert_test_data();

      On the second shell, execute:

      MariaDB [tmp]> FLUSH TABLES WITH READ LOCK;
      Query OK, 0 rows affected (0.00 sec)

      On the third shell, try to drop the table:

      MariaDB [tmp]> DROP TABLE tmp;

      Process list looks as expected:

      MariaDB [tmp]> SHOW PROCESSLIST;
      +----+-------------+-----------+------+---------+------+------------------------------+----------------------------------------------------------------------------------------+----------+
      | Id | User        | Host      | db   | Command | Time | State                        | Info                                                                                   | Progress |
      +----+-------------+-----------+------+---------+------+------------------------------+----------------------------------------------------------------------------------------+----------+
      |  1 | system user |           | NULL | Sleep   | 2648 | NULL                         | NULL                                                                                   |    0.000 |
      |  2 | system user |           | NULL | Sleep   | 2648 | wsrep aborter idle           | NULL                                                                                   |    0.000 |
      |  8 | root        | localhost | tmp  | Query   |    0 | sleeping                     | SHOW PROCESSLIST                                                                       |    0.000 |
      | 18 | root        | localhost | tmp  | Query   |   59 | Waiting for global read lock | INSERT INTO `tmp` (`a`, `b`)
          VALUES ( NAME_CONST('i',6601),  NAME_CONST('i',6601)) |    0.000 |
      | 20 | root        | localhost | tmp  | Query   |   20 | checking permissions         | DROP TABLE tmp                                                                         |    0.000 |
      +----+-------------+-----------+------+---------+------+------------------------------+----------------------------------------------------------------------------------------+----------+
      5 rows in set (0.00 sec)

      On other nodes, the table is gone:

      MariaDB [tmp]> SHOW TABLES;
      Empty set (0.00 sec)

      On the originating node, the table still exists:

      MariaDB [tmp]> SHOW TABLES;
      +---------------+
      | Tables_in_tmp |
      +---------------+
      | tmp           |
      +---------------+
      1 row in set (0.00 sec)

      Let's cancel the DML:

      MariaDB [tmp]> CALL insert_test_data();
      ^CCtrl-C -- query killed. Continuing normally.
      ERROR 1317 (70100): Query execution was interrupted

      The table still exists on the originating node and the query is dead:

      MariaDB [tmp]> SHOW TABLES;
      +---------------+
      | Tables_in_tmp |
      +---------------+
      | tmp           |
      +---------------+
      1 row in set (0.00 sec)
       
      MariaDB [tmp]> SHOW PROCESSLIST;
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      | Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      |  1 | system user |           | NULL | Sleep   | 2768 | NULL                 | NULL             |    0.000 |
      |  2 | system user |           | NULL | Sleep   | 2768 | wsrep aborter idle   | NULL             |    0.000 |
      |  8 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |
      | 18 | root        | localhost | tmp  | Sleep   |   13 |                      | NULL             |    0.000 |
      | 20 | root        | localhost | tmp  | Query   |  140 | checking permissions | DROP TABLE tmp   |    0.000 |
      +----+-------------+-----------+------+---------+------+----------------------+------------------+----------+
      5 rows in set (0.00 sec)

      It goes away when we UNLOCK TABLES:

      MariaDB [tmp]> UNLOCK TABLES;
      Query OK, 0 rows affected (0.00 sec)
       
      MariaDB [tmp]> SHOW TABLES;
      Empty set (0.00 sec)

      Again, the other nodes have an error about the missing table, suggesting it may have been executed twice:

      150320 14:04:56 [Warning] WSREP: RBR event 1 Query apply warning: 1, 5070
      150320 14:04:56 [Warning] WSREP: Ignoring error for TO isolated action: source: bafbdc15-cf25-11e4-9ace-2fc1846ad321 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 18 trx_id: -1 seqnos (l: 5113, g: 5070, s: 5069, d: 5069, ts: 185308382405439)

      Attachments

        Activity

          People

            ramesh Ramesh Sivaraman
            GeoffMontee Geoff Montee (Inactive)
            Votes:
            3 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.