[MDEV-7808] FLUSH TABLES WITH READ LOCK has strange locking behavior - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 5.5.42-galera, 10.0.17-galera
Fix Version/s: N/A
Component/s: Galera, wsrep
Labels:
- galera
- need_verification

Description

It looks like if wsrep_OSU_method set to TOI, DDL is transmitted to other nodes even when the provider is paused, and the DDL on the originating node shouldn't run the DDL because it is supposedly waiting on a lock due to FLUSH TABLES WITH READ LOCK. The DDL runs on the originating node anyway. The DDL runs on all nodes except the originating node, if there's some DML running on that node to block it.

According to the logs, it also looks like DDL may be getting sent to the other nodes more than once.

DDL + no DML

I have 3 Galera nodes in a cluster.

On all 3 nodes, let's execute:

SET GLOBAL wsrep_debug=ON;

First, let's do some setup on one of the nodes:

CREATE TABLE `tmp` (

  `a` int(11) NOT NULL,

  `b` varchar(20) DEFAULT NULL,

  PRIMARY KEY (`a`)

);

Now let's open up two MySQL shells on the same node.

Execute this on the first shell:

MariaDB [tmp]> FLUSH TABLES WITH READ LOCK;

Query OK, 0 rows affected (0.14 sec)

The log on this node shows that the provider has been paused:

150320 11:43:04 [Note] WSREP: Provider paused at 5edc131e-ae0b-11e4-85d1-c2c0512e9e7e:2896 (495)

In the second shell, let's try to drop the table:

MariaDB [tmp]> DROP TABLE tmp;

This hangs:

MariaDB [tmp]> SHOW PROCESSLIST;

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

| Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

|  1 | system user |           | NULL | Sleep   | 1744 | NULL                 | NULL             |    0.000 |

|  2 | system user |           | NULL | Sleep   | 1744 | wsrep aborter idle   | NULL             |    0.000 |

|  8 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |

| 18 | root        | localhost | tmp  | Query   |   14 | checking permissions | DROP TABLE tmp   |    0.000 |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

4 rows in set (0.00 sec)

Let's attempt to kill the query once:

MariaDB [tmp]> DROP TABLE tmp;

^CCtrl-C -- query killed. Continuing normally.

The query is now stuck in the "killed" state:

MariaDB [tmp]> SHOW PROCESSLIST;

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

| Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

|  1 | system user |           | NULL | Sleep   | 1815 | NULL                 | NULL             |    0.000 |

|  2 | system user |           | NULL | Sleep   | 1815 | wsrep aborter idle   | NULL             |    0.000 |

|  8 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |

| 18 | root        | localhost | tmp  | Killed  |   85 | checking permissions | DROP TABLE tmp   |    0.000 |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

4 rows in set (0.00 sec)

The table is already missing from all nodes:

MariaDB [tmp]> SHOW TABLES;

Empty set (0.00 sec)

Logs on the other nodes suggest that the DROP TABLE statement may have been sent to other nodes twice, since we get an error saying the table doesn't exist:

150320 13:51:39 [ERROR] Slave SQL: Error 'Unknown table 'tmp'' on query. Default database: 'tmp'. Query: 'DROP TABLE tmp', Error_code: 1051

150320 13:51:39 [Warning] WSREP: RBR event 1 Query apply warning: 1, 5068

150320 13:51:39 [Warning] WSREP: Ignoring error for TO isolated action: source: bafbdc15-cf25-11e4-9ace-2fc1846ad321 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 18 trx_id: -1 seqnos (l: 5111, g: 5068, s: 5067, d: 5067, ts: 184510656454187)

If we try to cancel the query again, we lose connection to the server:

MariaDB [tmp]> DROP TABLE tmp;

^CCtrl-C -- query killed. Continuing normally.

^CCtrl-C -- query killed. Continuing normally.

ERROR 2013 (HY000): Lost connection to MySQL server during query

However, the node is still up, and the query is stuck in the "killed" state:

MariaDB [tmp]> SHOW PROCESSLIST;

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

| Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

|  1 | system user |           | NULL | Sleep   |  620 | NULL                 | NULL             |    0.000 |

|  2 | system user |           | NULL | Sleep   |  620 | wsrep aborter idle   | NULL             |    0.000 |

|  4 | root        | localhost | tmp  | Killed  |  174 | checking permissions | DROP TABLE tmp   |    0.000 |

|  5 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |

|  6 | root        | localhost | tmp  | Sleep   |  249 |                      | NULL             |    0.000 |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

5 rows in set (0.00 sec)

The node is still in the cluster:

MariaDB [tmp]> SHOW GLOBAL STATUS LIKE 'wsrep_cluster_size';

+--------------------+-------+

| Variable_name      | Value |

+--------------------+-------+

| wsrep_cluster_size | 3     |

+--------------------+-------+

1 row in set (0.00 sec)

As soon as we UNLOCK TABLES on the original node, the query is killed.

MariaDB [tmp]> UNLOCK TABLES;

Query OK, 0 rows affected (0.00 sec)

MariaDB [tmp]> SHOW PROCESSLIST;

+----+-------------+-----------+------+---------+------+--------------------+------------------+----------+

| Id | User        | Host      | db   | Command | Time | State              | Info             | Progress |

+----+-------------+-----------+------+---------+------+--------------------+------------------+----------+

|  1 | system user |           | NULL | Sleep   | 2305 | NULL               | NULL             |    0.000 |

|  2 | system user |           | NULL | Sleep   | 2305 | wsrep aborter idle | NULL             |    0.000 |

|  8 | root        | localhost | tmp  | Query   |    0 | sleeping           | SHOW PROCESSLIST |    0.000 |

| 18 | root        | localhost | tmp  | Sleep   |    1 |                    | NULL             |    0.000 |

+----+-------------+-----------+------+---------+------+--------------------+------------------+----------+

4 rows in set (0.00 sec)

DDL + DML

Let's add some DML to the mix.

Some setup:

CREATE TABLE `tmp` (

  `a` int(11) NOT NULL,

  `b` varchar(20) DEFAULT NULL,

  PRIMARY KEY (`a`)

);

DELIMITER $$

CREATE PROCEDURE insert_test_data()

BEGIN

  DECLARE i INT DEFAULT 1;

  WHILE i < 1000000 DO

    INSERT INTO `tmp` (`a`, `b`)

    VALUES (i, i);

    SET i = i + 1;

  END WHILE;

END$$

DELIMITER ;

Now let's have three shells on the originating node:

On the first shell, execute:

CALL insert_test_data();

On the second shell, execute:

MariaDB [tmp]> FLUSH TABLES WITH READ LOCK;

Query OK, 0 rows affected (0.00 sec)

On the third shell, try to drop the table:

MariaDB [tmp]> DROP TABLE tmp;

Process list looks as expected:

MariaDB [tmp]> SHOW PROCESSLIST;

+----+-------------+-----------+------+---------+------+------------------------------+----------------------------------------------------------------------------------------+----------+

| Id | User        | Host      | db   | Command | Time | State                        | Info                                                                                   | Progress |

+----+-------------+-----------+------+---------+------+------------------------------+----------------------------------------------------------------------------------------+----------+

|  1 | system user |           | NULL | Sleep   | 2648 | NULL                         | NULL                                                                                   |    0.000 |

|  2 | system user |           | NULL | Sleep   | 2648 | wsrep aborter idle           | NULL                                                                                   |    0.000 |

|  8 | root        | localhost | tmp  | Query   |    0 | sleeping                     | SHOW PROCESSLIST                                                                       |    0.000 |

| 18 | root        | localhost | tmp  | Query   |   59 | Waiting for global read lock | INSERT INTO `tmp` (`a`, `b`)

    VALUES ( NAME_CONST('i',6601),  NAME_CONST('i',6601)) |    0.000 |

| 20 | root        | localhost | tmp  | Query   |   20 | checking permissions         | DROP TABLE tmp                                                                         |    0.000 |

+----+-------------+-----------+------+---------+------+------------------------------+----------------------------------------------------------------------------------------+----------+

5 rows in set (0.00 sec)

On other nodes, the table is gone:

MariaDB [tmp]> SHOW TABLES;

Empty set (0.00 sec)

On the originating node, the table still exists:

MariaDB [tmp]> SHOW TABLES;

+---------------+

| Tables_in_tmp |

+---------------+

| tmp           |

+---------------+

1 row in set (0.00 sec)

Let's cancel the DML:

MariaDB [tmp]> CALL insert_test_data();

^CCtrl-C -- query killed. Continuing normally.

ERROR 1317 (70100): Query execution was interrupted

The table still exists on the originating node and the query is dead:

MariaDB [tmp]> SHOW TABLES;

+---------------+

| Tables_in_tmp |

+---------------+

| tmp           |

+---------------+

1 row in set (0.00 sec)

MariaDB [tmp]> SHOW PROCESSLIST;

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

| Id | User        | Host      | db   | Command | Time | State                | Info             | Progress |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

|  1 | system user |           | NULL | Sleep   | 2768 | NULL                 | NULL             |    0.000 |

|  2 | system user |           | NULL | Sleep   | 2768 | wsrep aborter idle   | NULL             |    0.000 |

|  8 | root        | localhost | tmp  | Query   |    0 | sleeping             | SHOW PROCESSLIST |    0.000 |

| 18 | root        | localhost | tmp  | Sleep   |   13 |                      | NULL             |    0.000 |

| 20 | root        | localhost | tmp  | Query   |  140 | checking permissions | DROP TABLE tmp   |    0.000 |

+----+-------------+-----------+------+---------+------+----------------------+------------------+----------+

5 rows in set (0.00 sec)

It goes away when we UNLOCK TABLES:

MariaDB [tmp]> UNLOCK TABLES;

Query OK, 0 rows affected (0.00 sec)

MariaDB [tmp]> SHOW TABLES;

Empty set (0.00 sec)

Again, the other nodes have an error about the missing table, suggesting it may have been executed twice:

150320 14:04:56 [Warning] WSREP: RBR event 1 Query apply warning: 1, 5070

150320 14:04:56 [Warning] WSREP: Ignoring error for TO isolated action: source: bafbdc15-cf25-11e4-9ace-2fc1846ad321 version: 3 local: 0 state: APPLYING flags: 65 conn_id: 18 trx_id: -1 seqnos (l: 5113, g: 5070, s: 5069, d: 5069, ts: 185308382405439)

Attachments

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2015-03-20 18:07

I remember that galera used to not work with explicit table locks, not sure if it's still the case. Assigning to nirbhay_c.

Elena Stepanova added a comment - 2015-03-20 18:07 I remember that galera used to not work with explicit table locks, not sure if it's still the case. Assigning to nirbhay_c .

Jan Lindström added a comment - 2023-06-06 09:16

Both versions are EOL.

Jan Lindström added a comment - 2023-06-06 09:16 Both versions are EOL.

People

Assignee:: Ramesh Sivaraman

Reporter:: Geoff Montee (Inactive)

Votes:: 3 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 2015-03-20 17:56

Updated:: 2023-06-06 09:16

Resolved:: 2023-06-06 09:16

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server