[MDEV-13379] subsequent OPTIMIZEs on the same table may cause complete table deadlock - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Fix
Affects Version/s: 10.2(EOL)
Fix Version/s: N/A
Component/s: Galera
Labels:
- need_verification

Description

I get complete table deadlock with simple test case like below:

1. setup default two-node galera cluster on the same host, e.g. on ports 3314 and 3315
2. Create table:
create table test.t select 1;
3. Start following concurrent load:

( for i1 in {1..100000} ; do mysql -h127.0.0.1 -P3314 -e "insert into t select $i1" test || : ; done ) &> /dev/null &

( for i1 in {1..100000} ; do mysql -h127.0.0.1 -P3315 -e "insert into t select $i1" test || : ; done ) &> /dev/null &

( for k1 in {1..1000}; do sleep 10; mysql -h127.0.0.1 -P3314 -e "optimize table t" test & wait ; done ) &>/dev/null &

( for k1 in {1..1000}; do sleep 10; mysql -h127.0.0.1 -P3315 -e "optimize table t" test & wait ; done ) &>/dev/null &

while true ; do

mysql -h127.0.0.1 -P3314 -e 'show processlist'

mysql -h127.0.0.1 -P3315 -e 'show processlist'

mysql -h127.0.0.1 -P3314 -e 'select count(*) from t' test

mysql -h127.0.0.1 -P3315 -e 'select count(*) from t' test

sleep 2

done

The cluster usually hangs up within 1 min, this is what observed:

(11:17 sql) > mysql -h 127.0.0.1 -P 3314 -uroot  -e 'show processlist'

+-----+-------------+-----------------+------+---------+------+-------------------------+--------------------------+----------+

| Id  | User        | Host            | db   | Command | Time | State                   | Info                     | Progress |

+-----+-------------+-----------------+------+---------+------+-------------------------+--------------------------+----------+

|   1 | system user |                 | test | Sleep   |  748 | Waiting for table flush | optimize table t         |    0.000 |

|   2 | system user |                 | NULL | Sleep   |  815 | wsrep aborter idle      | NULL                     |    0.000 |

| 518 | a           | localhost:37078 | test | Query   |  748 | query end               | insert into t select 487 |    0.000 |

| 519 | a           | localhost:37080 | test | Query   |  748 | Waiting for table flush | select count(*) from t   |    0.000 |

| 520 | a           | localhost:37122 | test | Query   |  738 | checking permissions    | optimize table t         |    0.000 |

| 526 | root        | localhost:38382 | NULL | Query   |    0 | init                    | show processlist         |    0.000 |

+-----+-------------+-----------------+------+---------+------+-------------------------+--------------------------+----------+

(11:17 sql) > mysql -h 127.0.0.1 -P 3315 -uroot -e 'show processlist'

+-----+-------------+-----------------+------+---------+------+----------------------+--------------------------+----------+

| Id  | User        | Host            | db   | Command | Time | State                | Info                     | Progress |

+-----+-------------+-----------------+------+---------+------+----------------------+--------------------------+----------+

|   1 | system user |                 | NULL | Sleep   |  799 | wsrep aborter idle   | NULL                     |    0.000 |

|   2 | system user |                 | NULL | Sleep   |  748 | committed 992        | NULL                     |    0.000 |

| 517 | a           | localhost:46558 | test | Query   |  748 | query end            | insert into t select 505 |    0.000 |

| 518 | a           | localhost:46562 | test | Query   |  738 | checking permissions | optimize table t         |    0.000 |

| 524 | root        | localhost:47822 | NULL | Query   |    0 | init                 | show processlist         |    0.000 |

+-----+-------------+-----------------+------+---------+------+----------------------+--------------------------+----------+

It is kid of strange to me that I cannot reproduce the problem with alter command below instead of OPTIMIZE:
alter table t engine=innodb, algorithm=copy

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

gdb.txt
133 kB
2017-07-25 10:08

Issue Links

relates to

MDEV-10245 Two concurrent OPTIMIZE on InnoDB table can cause a long timed wait

Closed

MDEV-13316 Unresolved conflict between OPTIMIZE TABLE and DML transaction, causing cluster hang

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2017-07-25 10:22

See also ~~MDEV-10245~~ , which is also about concurrent OPTIMIZE causing a deadlock. Stack traces are different, though.

Elena Stepanova added a comment - 2017-07-25 10:22 See also MDEV-10245 , which is also about concurrent OPTIMIZE causing a deadlock. Stack traces are different, though.

Andrii Nikitin (Inactive) added a comment - 2017-07-25 10:40 - edited

smaller lock_wait_timeout isn't really helping as well: with lock_wait_timeout=30 client shows timeout error trying to read table, but other locked connections are not affected:

+ cluster1/sql.sh 'select count(*) from t'

m8 :944

m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction

+ sleep 2

+ true

+ m8-10.1.25/sql.sh show processlist

1       system user             NULL    Sleep   128     committed 928   NULL    0.000

2       system user             NULL    Sleep   184     wsrep aborter idle      NULL    0.000

505     root    localhost       test    Query   127     query end       insert into t select 485        0.000

506     root    localhost       test    Query   117     checking permissions    optimize table t        0.000

513     root    localhost       test    Query   0       init    show processlist        0.000

+ m9-10.1.25/sql.sh show processlist

1       system user             test    Sleep   128     Waiting for table flush optimize table t        0.000

2       system user             NULL    Sleep   168     wsrep aborter idle      NULL    0.000

487     root    localhost       test    Query   128     query end       insert into t select 465        0.000

490     root    localhost       test    Query   118     checking permissions    optimize table t        0.000

499     root    localhost       test    Query   0       init    show processlist        0.000

+ cluster1/sql.sh 'select count(*) from t'

m8 :944

m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction

+ sleep 2

+ true

+ m8-10.1.25/sql.sh show processlist

1       system user             NULL    Sleep   160     committed 928   NULL    0.000

2       system user             NULL    Sleep   216     wsrep aborter idle      NULL    0.000

505     root    localhost       test    Query   159     query end       insert into t select 485        0.000

506     root    localhost       test    Query   149     checking permissions    optimize table t        0.000

515     root    localhost       test    Query   0       init    show processlist        0.000

+ m9-10.1.25/sql.sh show processlist

1       system user             test    Sleep   160     Waiting for table flush optimize table t        0.000

2       system user             NULL    Sleep   200     wsrep aborter idle      NULL    0.000

487     root    localhost       test    Query   160     query end       insert into t select 465        0.000

490     root    localhost       test    Query   150     checking permissions    optimize table t        0.000

501     root    localhost       test    Query   0       init    show processlist        0.000

+ cluster1/sql.sh 'select count(*) from t'

m8 :944

m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction

+ sleep 2

+ true

+ m8-10.1.25/sql.sh show processlist

1       system user             NULL    Sleep   192     committed 928   NULL    0.000

2       system user             NULL    Sleep   248     wsrep aborter idle      NULL    0.000

505     root    localhost       test    Query   191     query end       insert into t select 485        0.000

506     root    localhost       test    Query   181     checking permissions    optimize table t        0.000

517     root    localhost       test    Query   0       init    show processlist        0.000

+ m9-10.1.25/sql.sh show processlist

1       system user             test    Sleep   192     Waiting for table flush optimize table t        0.000

2       system user             NULL    Sleep   232     wsrep aborter idle      NULL    0.000

487     root    localhost       test    Query   192     query end       insert into t select 465        0.000

490     root    localhost       test    Query   182     checking permissions    optimize table t        0.000

503     root    localhost       test    Query   0       init    show processlist        0.000

+ cluster1/sql.sh 'select count(*) from t'

m8 :944

m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction

+ sleep 2

+ true

+ m8-10.1.25/sql.sh show processlist

1       system user             NULL    Sleep   224     committed 928   NULL    0.000

2       system user             NULL    Sleep   280     wsrep aborter idle      NULL    0.000

505     root    localhost       test    Query   223     query end       insert into t select 485        0.000

506     root    localhost       test    Query   213     checking permissions    optimize table t        0.000

519     root    localhost       test    Query   0       init    show processlist        0.000

+ m9-10.1.25/sql.sh show processlist

1       system user             test    Sleep   224     Waiting for table flush optimize table t        0.000

2       system user             NULL    Sleep   264     wsrep aborter idle      NULL    0.000

487     root    localhost       test    Query   224     query end       insert into t select 465        0.000

490     root    localhost       test    Query   214     checking permissions    optimize table t        0.000

505     root    localhost       test    Query   0       init    show processlist        0.000

Andrii Nikitin (Inactive) added a comment - 2017-07-25 10:40 - edited smaller lock_wait_timeout isn't really helping as well: with lock_wait_timeout=30 client shows timeout error trying to read table, but other locked connections are not affected: + cluster1/sql.sh 'select count(*) from t' m8 :944 m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction + sleep 2 + true + m8-10.1.25/sql.sh show processlist 1 system user NULL Sleep 128 committed 928 NULL 0.000 2 system user NULL Sleep 184 wsrep aborter idle NULL 0.000 505 root localhost test Query 127 query end insert into t select 485 0.000 506 root localhost test Query 117 checking permissions optimize table t 0.000 513 root localhost test Query 0 init show processlist 0.000 + m9-10.1.25/sql.sh show processlist 1 system user test Sleep 128 Waiting for table flush optimize table t 0.000 2 system user NULL Sleep 168 wsrep aborter idle NULL 0.000 487 root localhost test Query 128 query end insert into t select 465 0.000 490 root localhost test Query 118 checking permissions optimize table t 0.000 499 root localhost test Query 0 init show processlist 0.000 + cluster1/sql.sh 'select count(*) from t' m8 :944 m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction + sleep 2 + true + m8-10.1.25/sql.sh show processlist 1 system user NULL Sleep 160 committed 928 NULL 0.000 2 system user NULL Sleep 216 wsrep aborter idle NULL 0.000 505 root localhost test Query 159 query end insert into t select 485 0.000 506 root localhost test Query 149 checking permissions optimize table t 0.000 515 root localhost test Query 0 init show processlist 0.000 + m9-10.1.25/sql.sh show processlist 1 system user test Sleep 160 Waiting for table flush optimize table t 0.000 2 system user NULL Sleep 200 wsrep aborter idle NULL 0.000 487 root localhost test Query 160 query end insert into t select 465 0.000 490 root localhost test Query 150 checking permissions optimize table t 0.000 501 root localhost test Query 0 init show processlist 0.000 + cluster1/sql.sh 'select count(*) from t' m8 :944 m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction + sleep 2 + true + m8-10.1.25/sql.sh show processlist 1 system user NULL Sleep 192 committed 928 NULL 0.000 2 system user NULL Sleep 248 wsrep aborter idle NULL 0.000 505 root localhost test Query 191 query end insert into t select 485 0.000 506 root localhost test Query 181 checking permissions optimize table t 0.000 517 root localhost test Query 0 init show processlist 0.000 + m9-10.1.25/sql.sh show processlist 1 system user test Sleep 192 Waiting for table flush optimize table t 0.000 2 system user NULL Sleep 232 wsrep aborter idle NULL 0.000 487 root localhost test Query 192 query end insert into t select 465 0.000 490 root localhost test Query 182 checking permissions optimize table t 0.000 503 root localhost test Query 0 init show processlist 0.000 + cluster1/sql.sh 'select count(*) from t' m8 :944 m9 :ERROR 1205 (HY000) at line 1: Lock wait timeout exceeded; try restarting transaction + sleep 2 + true + m8-10.1.25/sql.sh show processlist 1 system user NULL Sleep 224 committed 928 NULL 0.000 2 system user NULL Sleep 280 wsrep aborter idle NULL 0.000 505 root localhost test Query 223 query end insert into t select 485 0.000 506 root localhost test Query 213 checking permissions optimize table t 0.000 519 root localhost test Query 0 init show processlist 0.000 + m9-10.1.25/sql.sh show processlist 1 system user test Sleep 224 Waiting for table flush optimize table t 0.000 2 system user NULL Sleep 264 wsrep aborter idle NULL 0.000 487 root localhost test Query 224 query end insert into t select 465 0.000 490 root localhost test Query 214 checking permissions optimize table t 0.000 505 root localhost test Query 0 init show processlist 0.000

Jan Lindström added a comment - 2023-06-06 09:15

10.2 is EOL.

Jan Lindström added a comment - 2023-06-06 09:15 10.2 is EOL.

MariaDB Server

subsequent OPTIMIZEs on the same table may cause complete table deadlock

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration