[MDEV-10104] Table lock race condition with replication Created: 2016-05-23  Updated: 2017-05-05  Resolved: 2017-05-05

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - Aria
Affects Version/s: 10.0.24
Fix Version/s: 10.0.31

Type: Bug Priority: Major
Reporter: Dreas van Donselaar Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None
Environment:

Debian Linux


Attachments: Text File global_status.txt     HTML File patch     Text File strace.txt     Text File strace_170419_1.txt     Text File strace_170419_2.txt     Text File strace_170425.txt    

 Description   

We have a MariaDB server that sometimes seems to get an indefinite table lock. The only way to solve it is to kill MariaDB and restart it. It "randomly happens", so I suspect it may be a race condition. The replication thread get's stuck in state "After opening tables". The lock seems to be caused by a query on an unrelated database with 3 million MyISAM tables (not receiving data via replication) in state "Opening tables". These databases are completely separated, so they shouldn't cause a lock between each other.

Processlist shows:

root@server ~ # mysql -e "show full processlist"
+----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+
| Id       | User                 | Host                  | db                   | Command | Time    | State                            | Info                                                  | Progress |
+----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+
|        2 | system user          |                       | NULL                 | Connect | 1116538 | Waiting for master to send event | NULL                                                  |    0.000 |
|        3 | system user          |                       | replication_database               | Connect |  264121 | After opening tables             | COMMIT                                                |    0.000 |
| 37332419 | user1                 | localhost             | large_database          | Killed  |  264121 | Opening tables                   | DELETE FROM `my_domain` WHERE date<'2016-04-29' |    0.000 |
| 48692447 | root                 | localhost             | NULL                 | Query   |       0 | init                             | show full processlist                                 |    0.000 |
+----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+
root@server ~ # 



 Comments   
Comment by Elena Stepanova [ 2016-06-03 ]

dreas, Could you please upload the error log from the slave startup and till the moment you had to kill it? Thanks.

Comment by Elena Stepanova [ 2016-07-29 ]

Please comment if you have further information on the issue, and the report will be re-opened.

Comment by Dreas van Donselaar [ 2016-12-01 ]

So this is occurring again on a server:

~ # mysql -e "show full processlist" | head -n4
Id User Host db Command Time State Info Progress
2 system user NULL Connect 524935 Waiting for master to send event NULL 0.000
4 system user replication_database1 Connect 155648 After opening tables COMMIT 0.000
102091094 exim localhost local_database Query 155641 Opening tables SELECT CONCAT(host, ':', port) FROM replication_database2.table1 WHERE domain='example.com' ORDER BY id 0.000

Happy to collect more details, however not sure how to do so. There is nothing in mysql.log around the time this issue occurred.

Comment by Dreas van Donselaar [ 2016-12-01 ]

After a kill -6:

161201 11:47:29 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.0.28-MariaDB-1~sid
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=2001
max_threads=2002
thread_count=198
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4658156 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55bde5373e1e]
/usr/sbin/mysqld(handle_fatal_signal+0x2fc)[0x55bde4edebcc]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11100)[0x7ff723bca100]
/lib/x86_64-linux-gnu/libc.so.6(__poll+0x2d)[0x7ff7220dc56d]
/usr/sbin/mysqld(_Z26handle_connections_socketsv+0x1e7)[0x55bde4cc83f7]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0xff6)[0x55bde4ccc5f6]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7ff72201d2b1]
/usr/sbin/mysqld(_start+0x2a)[0x55bde4cc041a]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file
161201 11:47:33 mysqld_safe Number of processes running now: 0
161201 11:47:33 mysqld_safe mysqld restarted

Comment by Dreas van Donselaar [ 2016-12-01 ]

I'm uploading mdev10104.tgz with the core dump, MySQL errors logs, my.cnf, and MariaDB binaries to the private FTP.

Comment by Dreas van Donselaar [ 2017-03-01 ]

We hit what appears to be the same bug again with some different conditions:

root@server ~ # mysql -e "SHOW FULL processlist"
+----------+-------------+-----------+--------+---------+--------+----------------------------------+-------------------------------------------------------------------------------------------------+----------+
| Id       | User        | Host      | db     | Command | Time   | State                            | Info                                                                                            | Progress |
+----------+-------------+-----------+--------+---------+--------+----------------------------------+-------------------------------------------------------------------------------------------------+----------+
|        2 | system user |           | NULL   | Connect | 320720 | Waiting for master to send event | NULL                                                                                            |    0.000 |
|        3 | system user |           | replicated_database1 | Connect |  46087 | After opening tables             | COMMIT                                                                                          |    0.000 |
| 47887750 | mysql_user        | localhost | replicated_database1 | Query   |  46086 | Opening tables                   | SELECT user FROM users WHERE a='' AND b IN ('*', '') |    0.000 |
| 53170618 | root        | localhost | NULL   | Query   |      0 | init                             | SHOW FULL processlist                                                                           |    0.000 |
+----------+-------------+-----------+--------+---------+--------+----------------------------------+-------------------------------------------------------------------------------------------------+----------+
root@server ~ # 

I can execute the SELECT without issues:

MariaDB [(none)]> use replicated_database1;
Database changed
MariaDB [replicated_database1]> SELECT user FROM users WHERE a='' AND b IN ('*', '')
Empty set (0.00 sec)
 
MariaDB [filter]>

Replication hangs on "COMMIT" with "After opening tables" again, the select query is on the same database this time. This specific server does NOT have millions of tables as in the other example, so apparently that's not a condition for this to occur. Running 10.0.29-MariaDB-1~sid.

Comment by Daniel Black [ 2017-03-01 ]

A back trace from all threads might give a clue as to what lock is being hit

gdb --batch --eval-command="thread apply all bt" -p $(pgrep -xn mysqld)

Comment by Dreas van Donselaar [ 2017-03-01 ]

I attached that already

Comment by Dreas van Donselaar [ 2017-04-19 ]

A potential theory from IRC:

(14:08:50) montywi: the issue is that ha_write_row hangs, which it should never be able to do
(14:09:31) montywi: could be a hang on disk
(14:10:06) montywi: except there is a pthrad_cond_wait on top of the trace. hm...
(14:12:08) montywi: dreas: ok, I have a quess what could be happening:
(14:12:37) montywi: - The open table cache gets full and it tries to eject a table that is in use (which shouldn't happen) which causes a deadlock

It was requested for me to apply a MariaDB debug binary to collect the required details. Will try and get that done.

Comment by Elena Stepanova [ 2017-04-19 ]

Assigning to Monty to have a proper owner, but we are actually waiting for the feedback as mentioned above.

Comment by Daniel Black [ 2017-04-19 ]

10.0.25 and later have a debug symbols package for a more descriptive backtrace https://downloads.mariadb.com/MariaDB/mariadb-10.0.30/repo/debian/pool/main/m/mariadb-10.0/

edit: appears to be only sid.

Comment by Dreas van Donselaar [ 2017-04-25 ]

New backtrace attached. Unfortunately mysqld started "hanging" shortly afterwards and became unresponsive to "kill -6". In the end I had to force kill it with "kill -9" which didn't produce a core dump. I did get a core dump from another slave that had the exact same issue (same position), however not running with the special binary. Let me know if that should be uploaded, hopefully the new backtrace provides sufficient information.

Comment by Dreas van Donselaar [ 2017-04-26 ]

I presume the relevant bit is:

Thread 5 (Thread 0x7f5d1fcb9700 (LWP 19725)):
#0  0x00007f5d3d58e18f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x000055d7a7d935f9 in inline_mysql_cond_wait (src_file=0x55d7a80fee80 "/root/mariadb-10.0.29/storage/maria/ma_pagecache.c", src_line=1991, mutex=0x55d7a90ae0a8 <maria_pagecache_var+200>, that=0x7f5d080008c8) at /root/mariadb-10.0.29/include/mysql/psi/mysql_thread.h:1165
#2  find_block (pagecache=pagecache@entry=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=pageno@entry=1, init_hits_left=init_hits_left@entry=3, wrmode=1 '\001', block_is_copied=1 '\001', reg_req=1 '\001', page_st=0x7f5d1fcb4d10) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:1991
#3  0x000055d7a7d94d2f in pagecache_read (pagecache=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=1, level=level@entry=3, buff=buff@entry=0x7f5d1fcb4e30 "", type=PAGECACHE_LSN_PAGE, lock=PAGECACHE_LOCK_WRITE, page_link=0x7f5d1fcb4dc0) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:3391
#4  0x000055d7a7da9b03 in _ma_fetch_keypage (page=page@entry=0x7f5d1fcb77d0, info=info@entry=0x7f5d083464b8, keyinfo=keyinfo@entry=0x7f5d080113f8, pos=pos@entry=8192, lock=lock@entry=PAGECACHE_LOCK_WRITE, level=level@entry=3, buff=0x7f5d1fcb4e30 "", return_buffer=0 '\000') at /root/mariadb-10.0.29/storage/maria/ma_page.c:109
#5  0x000055d7a7dc4033 in _ma_ck_real_delete (info=info@entry=0x7f5d083464b8, key=key@entry=0x7f5d1fcb7da0, root=root@entry=0x7f5d1fcb7858) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:222
#6  0x000055d7a7dc4339 in _ma_ck_delete (info=info@entry=0x7f5d083464b8, key=0x7f5d1fcb7da0) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:178
#7  0x000055d7a7dc1da0 in maria_delete (info=0x7f5d083464b8, record=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/storage/maria/ma_delete.c:88
#8  0x000055d7a7bc968f in handler::ha_delete_row (this=0x7f5d080d8880, buf=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/sql/handler.cc:6076
#9  0x000055d7a7b6ca38 in rpl_slave_state::record_gtid (this=0x55d7aa384f50, thd=0x7f5d08001278, gtid=gtid@entry=0x7f5d1fcb8610, sub_id=sub_id@entry=461048395, in_transaction=in_transaction@entry=true, in_statement=in_statement@entry=false) at /root/mariadb-10.0.29/sql/rpl_gtid.cc:659
#10 0x000055d7a7c6d868 in Query_log_event::do_apply_event (this=0x7f5d080a9718, rgi=0x7f5d08000990, query_arg=0x7f5d080f4c85 "COMMIT", q_len_arg=6) at /root/mariadb-10.0.29/sql/log_event.cc:4312
#11 0x000055d7a7a20345 in Log_event::apply_event (rgi=0x7f5d08000990, this=0x7f5d080a9718) at /root/mariadb-10.0.29/sql/log_event.h:1360
#12 apply_event_and_update_pos (ev=ev@entry=0x7f5d080a9718, thd=thd@entry=0x7f5d08001278, rgi=rgi@entry=0x7f5d08000990, rpt=rpt@entry=0x0) at /root/mariadb-10.0.29/sql/slave.cc:3300
#13 0x000055d7a7a229aa in exec_relay_log_event (serial_rgi=0x7f5d08000990, rli=0x55d7aa6e9b28, thd=0x7f5d08001278) at /root/mariadb-10.0.29/sql/slave.cc:3630
#14 handle_slave_sql (arg=0x55d7aa6e8160) at /root/mariadb-10.0.29/sql/slave.cc:4731
#15 0x00007f5d3d588464 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x00007f5d3b9519df in clone () from /lib/x86_64-linux-gnu/libc.so.6

Would be great to know if this is sufficient information to solve the issue, or if a core-dump is still required.

Comment by Elena Stepanova [ 2017-04-30 ]

monty, please see above.

Comment by Michael Widenius [ 2017-05-04 ]

Here is a patch that should fix the above hang (bug in aria page cache under very rare circumstances when things are evicted from table cache):

— a/storage/maria/ma_pagecache.c
+++ b/storage/maria/ma_pagecache.c
@@ -1975,8 +1975,8 @@ static PAGECACHE_BLOCK_LINK *find_block(PAGECACHE *pagecache,
}
else
{

  • DBUG_ASSERT(hash_link->requests > 0);
  • hash_link->requests--;
    + DBUG_ASSERT(block->hash_link == hash_link);
    + remove_reader(block);
    KEYCACHE_DBUG_PRINT("find_block",
    ("request waiting for old page to be saved"));
    {

See the attached 'patch' for a full patch for this and similar failures

Comment by Michael Widenius [ 2017-05-05 ]

Fixed and pushed into 10.0

Generated at Thu Feb 08 07:39:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.