[MDEV-10104] Table lock race condition with replication - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.24
Fix Version/s: 10.0.31
Component/s: Storage Engine - Aria
Labels:
None
Environment:
Debian Linux

Description

We have a MariaDB server that sometimes seems to get an indefinite table lock. The only way to solve it is to kill MariaDB and restart it. It "randomly happens", so I suspect it may be a race condition. The replication thread get's stuck in state "After opening tables". The lock seems to be caused by a query on an unrelated database with 3 million MyISAM tables (not receiving data via replication) in state "Opening tables". These databases are completely separated, so they shouldn't cause a lock between each other.

Processlist shows:

root@server ~ # mysql -e "show full processlist"

+----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+

| Id       | User                 | Host                  | db                   | Command | Time    | State                            | Info                                                  | Progress |

+----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+

|        2 | system user          |                       | NULL                 | Connect | 1116538 | Waiting for master to send event | NULL                                                  |    0.000 |

|        3 | system user          |                       | replication_database               | Connect |  264121 | After opening tables             | COMMIT                                                |    0.000 |

| 37332419 | user1                 | localhost             | large_database          | Killed  |  264121 | Opening tables                   | DELETE FROM `my_domain` WHERE date<'2016-04-29' |    0.000 |

| 48692447 | root                 | localhost             | NULL                 | Query   |       0 | init                             | show full processlist                                 |    0.000 |

+----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+

root@server ~ #

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

global_status.txt
2016-12-01 10:46
26 kB
Dreas van Donselaar
strace.txt
2017-03-01 11:32
608 kB
Dreas van Donselaar
strace_170419_1.txt
2017-04-19 12:04
71 kB
Dreas van Donselaar
strace_170419_2.txt
2017-04-19 12:04
68 kB
Dreas van Donselaar
strace_170425.txt
2017-04-25 10:54
605 kB
Dreas van Donselaar
patch
2017-05-05 12:44
6 kB
Michael Widenius

Activity

Ascending order - Click to sort in descending order

View 11 older comments

Dreas van Donselaar added a comment - 2017-04-25 11:13

New backtrace attached. Unfortunately mysqld started "hanging" shortly afterwards and became unresponsive to "kill -6". In the end I had to force kill it with "kill -9" which didn't produce a core dump. I did get a core dump from another slave that had the exact same issue (same position), however not running with the special binary. Let me know if that should be uploaded, hopefully the new backtrace provides sufficient information.

Dreas van Donselaar added a comment - 2017-04-25 11:13 New backtrace attached. Unfortunately mysqld started "hanging" shortly afterwards and became unresponsive to "kill -6". In the end I had to force kill it with "kill -9" which didn't produce a core dump. I did get a core dump from another slave that had the exact same issue (same position), however not running with the special binary. Let me know if that should be uploaded, hopefully the new backtrace provides sufficient information.

Dreas van Donselaar added a comment - 2017-04-26 08:27 - edited

I presume the relevant bit is:

Thread 5 (Thread 0x7f5d1fcb9700 (LWP 19725)):

#0  0x00007f5d3d58e18f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0

#1  0x000055d7a7d935f9 in inline_mysql_cond_wait (src_file=0x55d7a80fee80 "/root/mariadb-10.0.29/storage/maria/ma_pagecache.c", src_line=1991, mutex=0x55d7a90ae0a8 <maria_pagecache_var+200>, that=0x7f5d080008c8) at /root/mariadb-10.0.29/include/mysql/psi/mysql_thread.h:1165

#2  find_block (pagecache=pagecache@entry=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=pageno@entry=1, init_hits_left=init_hits_left@entry=3, wrmode=1 '\001', block_is_copied=1 '\001', reg_req=1 '\001', page_st=0x7f5d1fcb4d10) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:1991

#3  0x000055d7a7d94d2f in pagecache_read (pagecache=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=1, level=level@entry=3, buff=buff@entry=0x7f5d1fcb4e30 "", type=PAGECACHE_LSN_PAGE, lock=PAGECACHE_LOCK_WRITE, page_link=0x7f5d1fcb4dc0) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:3391

#4  0x000055d7a7da9b03 in _ma_fetch_keypage (page=page@entry=0x7f5d1fcb77d0, info=info@entry=0x7f5d083464b8, keyinfo=keyinfo@entry=0x7f5d080113f8, pos=pos@entry=8192, lock=lock@entry=PAGECACHE_LOCK_WRITE, level=level@entry=3, buff=0x7f5d1fcb4e30 "", return_buffer=0 '\000') at /root/mariadb-10.0.29/storage/maria/ma_page.c:109

#5  0x000055d7a7dc4033 in _ma_ck_real_delete (info=info@entry=0x7f5d083464b8, key=key@entry=0x7f5d1fcb7da0, root=root@entry=0x7f5d1fcb7858) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:222

#6  0x000055d7a7dc4339 in _ma_ck_delete (info=info@entry=0x7f5d083464b8, key=0x7f5d1fcb7da0) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:178

#7  0x000055d7a7dc1da0 in maria_delete (info=0x7f5d083464b8, record=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/storage/maria/ma_delete.c:88

#8  0x000055d7a7bc968f in handler::ha_delete_row (this=0x7f5d080d8880, buf=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/sql/handler.cc:6076

#9  0x000055d7a7b6ca38 in rpl_slave_state::record_gtid (this=0x55d7aa384f50, thd=0x7f5d08001278, gtid=gtid@entry=0x7f5d1fcb8610, sub_id=sub_id@entry=461048395, in_transaction=in_transaction@entry=true, in_statement=in_statement@entry=false) at /root/mariadb-10.0.29/sql/rpl_gtid.cc:659

#10 0x000055d7a7c6d868 in Query_log_event::do_apply_event (this=0x7f5d080a9718, rgi=0x7f5d08000990, query_arg=0x7f5d080f4c85 "COMMIT", q_len_arg=6) at /root/mariadb-10.0.29/sql/log_event.cc:4312

#11 0x000055d7a7a20345 in Log_event::apply_event (rgi=0x7f5d08000990, this=0x7f5d080a9718) at /root/mariadb-10.0.29/sql/log_event.h:1360

#12 apply_event_and_update_pos (ev=ev@entry=0x7f5d080a9718, thd=thd@entry=0x7f5d08001278, rgi=rgi@entry=0x7f5d08000990, rpt=rpt@entry=0x0) at /root/mariadb-10.0.29/sql/slave.cc:3300

#13 0x000055d7a7a229aa in exec_relay_log_event (serial_rgi=0x7f5d08000990, rli=0x55d7aa6e9b28, thd=0x7f5d08001278) at /root/mariadb-10.0.29/sql/slave.cc:3630

#14 handle_slave_sql (arg=0x55d7aa6e8160) at /root/mariadb-10.0.29/sql/slave.cc:4731

#15 0x00007f5d3d588464 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0

#16 0x00007f5d3b9519df in clone () from /lib/x86_64-linux-gnu/libc.so.6

Would be great to know if this is sufficient information to solve the issue, or if a core-dump is still required.

Dreas van Donselaar added a comment - 2017-04-26 08:27 - edited I presume the relevant bit is: Thread 5 (Thread 0x7f5d1fcb9700 (LWP 19725)): #0 0x00007f5d3d58e18f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x000055d7a7d935f9 in inline_mysql_cond_wait (src_file=0x55d7a80fee80 "/root/mariadb-10.0.29/storage/maria/ma_pagecache.c", src_line=1991, mutex=0x55d7a90ae0a8 <maria_pagecache_var+200>, that=0x7f5d080008c8) at /root/mariadb-10.0.29/include/mysql/psi/mysql_thread.h:1165 #2 find_block (pagecache=pagecache@entry=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=pageno@entry=1, init_hits_left=init_hits_left@entry=3, wrmode=1 '\001', block_is_copied=1 '\001', reg_req=1 '\001', page_st=0x7f5d1fcb4d10) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:1991 #3 0x000055d7a7d94d2f in pagecache_read (pagecache=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=1, level=level@entry=3, buff=buff@entry=0x7f5d1fcb4e30 "", type=PAGECACHE_LSN_PAGE, lock=PAGECACHE_LOCK_WRITE, page_link=0x7f5d1fcb4dc0) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:3391 #4 0x000055d7a7da9b03 in _ma_fetch_keypage (page=page@entry=0x7f5d1fcb77d0, info=info@entry=0x7f5d083464b8, keyinfo=keyinfo@entry=0x7f5d080113f8, pos=pos@entry=8192, lock=lock@entry=PAGECACHE_LOCK_WRITE, level=level@entry=3, buff=0x7f5d1fcb4e30 "", return_buffer=0 '\000') at /root/mariadb-10.0.29/storage/maria/ma_page.c:109 #5 0x000055d7a7dc4033 in _ma_ck_real_delete (info=info@entry=0x7f5d083464b8, key=key@entry=0x7f5d1fcb7da0, root=root@entry=0x7f5d1fcb7858) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:222 #6 0x000055d7a7dc4339 in _ma_ck_delete (info=info@entry=0x7f5d083464b8, key=0x7f5d1fcb7da0) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:178 #7 0x000055d7a7dc1da0 in maria_delete (info=0x7f5d083464b8, record=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/storage/maria/ma_delete.c:88 #8 0x000055d7a7bc968f in handler::ha_delete_row (this=0x7f5d080d8880, buf=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/sql/handler.cc:6076 #9 0x000055d7a7b6ca38 in rpl_slave_state::record_gtid (this=0x55d7aa384f50, thd=0x7f5d08001278, gtid=gtid@entry=0x7f5d1fcb8610, sub_id=sub_id@entry=461048395, in_transaction=in_transaction@entry=true, in_statement=in_statement@entry=false) at /root/mariadb-10.0.29/sql/rpl_gtid.cc:659 #10 0x000055d7a7c6d868 in Query_log_event::do_apply_event (this=0x7f5d080a9718, rgi=0x7f5d08000990, query_arg=0x7f5d080f4c85 "COMMIT", q_len_arg=6) at /root/mariadb-10.0.29/sql/log_event.cc:4312 #11 0x000055d7a7a20345 in Log_event::apply_event (rgi=0x7f5d08000990, this=0x7f5d080a9718) at /root/mariadb-10.0.29/sql/log_event.h:1360 #12 apply_event_and_update_pos (ev=ev@entry=0x7f5d080a9718, thd=thd@entry=0x7f5d08001278, rgi=rgi@entry=0x7f5d08000990, rpt=rpt@entry=0x0) at /root/mariadb-10.0.29/sql/slave.cc:3300 #13 0x000055d7a7a229aa in exec_relay_log_event (serial_rgi=0x7f5d08000990, rli=0x55d7aa6e9b28, thd=0x7f5d08001278) at /root/mariadb-10.0.29/sql/slave.cc:3630 #14 handle_slave_sql (arg=0x55d7aa6e8160) at /root/mariadb-10.0.29/sql/slave.cc:4731 #15 0x00007f5d3d588464 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x00007f5d3b9519df in clone () from /lib/x86_64-linux-gnu/libc.so.6 Would be great to know if this is sufficient information to solve the issue, or if a core-dump is still required.

Elena Stepanova added a comment - 2017-04-30 23:41

monty, please see above.

Elena Stepanova added a comment - 2017-04-30 23:41 monty , please see above.

Michael Widenius added a comment - 2017-05-04 16:45 - edited

Here is a patch that should fix the above hang (bug in aria page cache under very rare circumstances when things are evicted from table cache):

— a/storage/maria/ma_pagecache.c
+++ b/storage/maria/ma_pagecache.c
@@ -1975,8 +1975,8 @@ static PAGECACHE_BLOCK_LINK *find_block(PAGECACHE *pagecache,
}
else
{

DBUG_ASSERT(hash_link->requests > 0);
hash_link->requests--;
+ DBUG_ASSERT(block->hash_link == hash_link);
+ remove_reader(block);
KEYCACHE_DBUG_PRINT("find_block",
("request waiting for old page to be saved"));
{

See the attached 'patch' for a full patch for this and similar failures

Michael Widenius added a comment - 2017-05-04 16:45 - edited Here is a patch that should fix the above hang (bug in aria page cache under very rare circumstances when things are evicted from table cache): — a/storage/maria/ma_pagecache.c +++ b/storage/maria/ma_pagecache.c @@ -1975,8 +1975,8 @@ static PAGECACHE_BLOCK_LINK *find_block(PAGECACHE *pagecache, } else { DBUG_ASSERT(hash_link->requests > 0); hash_link->requests--; + DBUG_ASSERT(block->hash_link == hash_link); + remove_reader(block); KEYCACHE_DBUG_PRINT("find_block", ("request waiting for old page to be saved")); { See the attached 'patch' for a full patch for this and similar failures

Michael Widenius added a comment - 2017-05-05 15:27

Fixed and pushed into 10.0

Michael Widenius added a comment - 2017-05-05 15:27 Fixed and pushed into 10.0

MariaDB Server

Table lock race condition with replication

Details

Description

Attachments

Attachments

Activity

People

Dates

Git Integration