Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-10104

Table lock race condition with replication

Details

    Description

      We have a MariaDB server that sometimes seems to get an indefinite table lock. The only way to solve it is to kill MariaDB and restart it. It "randomly happens", so I suspect it may be a race condition. The replication thread get's stuck in state "After opening tables". The lock seems to be caused by a query on an unrelated database with 3 million MyISAM tables (not receiving data via replication) in state "Opening tables". These databases are completely separated, so they shouldn't cause a lock between each other.

      Processlist shows:

      root@server ~ # mysql -e "show full processlist"
      +----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+
      | Id       | User                 | Host                  | db                   | Command | Time    | State                            | Info                                                  | Progress |
      +----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+
      |        2 | system user          |                       | NULL                 | Connect | 1116538 | Waiting for master to send event | NULL                                                  |    0.000 |
      |        3 | system user          |                       | replication_database               | Connect |  264121 | After opening tables             | COMMIT                                                |    0.000 |
      | 37332419 | user1                 | localhost             | large_database          | Killed  |  264121 | Opening tables                   | DELETE FROM `my_domain` WHERE date<'2016-04-29' |    0.000 |
      | 48692447 | root                 | localhost             | NULL                 | Query   |       0 | init                             | show full processlist                                 |    0.000 |
      +----------+----------------------+-----------------------+----------------------+---------+---------+----------------------------------+-------------------------------------------------------+----------+
      root@server ~ # 
      

      Attachments

        1. global_status.txt
          26 kB
          Dreas van Donselaar
        2. strace.txt
          608 kB
          Dreas van Donselaar
        3. strace_170419_1.txt
          71 kB
          Dreas van Donselaar
        4. strace_170419_2.txt
          68 kB
          Dreas van Donselaar
        5. strace_170425.txt
          605 kB
          Dreas van Donselaar
        6. patch
          6 kB
          Michael Widenius

        Activity

          New backtrace attached. Unfortunately mysqld started "hanging" shortly afterwards and became unresponsive to "kill -6". In the end I had to force kill it with "kill -9" which didn't produce a core dump. I did get a core dump from another slave that had the exact same issue (same position), however not running with the special binary. Let me know if that should be uploaded, hopefully the new backtrace provides sufficient information.

          dreas Dreas van Donselaar added a comment - New backtrace attached. Unfortunately mysqld started "hanging" shortly afterwards and became unresponsive to "kill -6". In the end I had to force kill it with "kill -9" which didn't produce a core dump. I did get a core dump from another slave that had the exact same issue (same position), however not running with the special binary. Let me know if that should be uploaded, hopefully the new backtrace provides sufficient information.
          dreas Dreas van Donselaar added a comment - - edited

          I presume the relevant bit is:

          Thread 5 (Thread 0x7f5d1fcb9700 (LWP 19725)):
          #0  0x00007f5d3d58e18f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
          #1  0x000055d7a7d935f9 in inline_mysql_cond_wait (src_file=0x55d7a80fee80 "/root/mariadb-10.0.29/storage/maria/ma_pagecache.c", src_line=1991, mutex=0x55d7a90ae0a8 <maria_pagecache_var+200>, that=0x7f5d080008c8) at /root/mariadb-10.0.29/include/mysql/psi/mysql_thread.h:1165
          #2  find_block (pagecache=pagecache@entry=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=pageno@entry=1, init_hits_left=init_hits_left@entry=3, wrmode=1 '\001', block_is_copied=1 '\001', reg_req=1 '\001', page_st=0x7f5d1fcb4d10) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:1991
          #3  0x000055d7a7d94d2f in pagecache_read (pagecache=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=1, level=level@entry=3, buff=buff@entry=0x7f5d1fcb4e30 "", type=PAGECACHE_LSN_PAGE, lock=PAGECACHE_LOCK_WRITE, page_link=0x7f5d1fcb4dc0) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:3391
          #4  0x000055d7a7da9b03 in _ma_fetch_keypage (page=page@entry=0x7f5d1fcb77d0, info=info@entry=0x7f5d083464b8, keyinfo=keyinfo@entry=0x7f5d080113f8, pos=pos@entry=8192, lock=lock@entry=PAGECACHE_LOCK_WRITE, level=level@entry=3, buff=0x7f5d1fcb4e30 "", return_buffer=0 '\000') at /root/mariadb-10.0.29/storage/maria/ma_page.c:109
          #5  0x000055d7a7dc4033 in _ma_ck_real_delete (info=info@entry=0x7f5d083464b8, key=key@entry=0x7f5d1fcb7da0, root=root@entry=0x7f5d1fcb7858) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:222
          #6  0x000055d7a7dc4339 in _ma_ck_delete (info=info@entry=0x7f5d083464b8, key=0x7f5d1fcb7da0) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:178
          #7  0x000055d7a7dc1da0 in maria_delete (info=0x7f5d083464b8, record=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/storage/maria/ma_delete.c:88
          #8  0x000055d7a7bc968f in handler::ha_delete_row (this=0x7f5d080d8880, buf=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/sql/handler.cc:6076
          #9  0x000055d7a7b6ca38 in rpl_slave_state::record_gtid (this=0x55d7aa384f50, thd=0x7f5d08001278, gtid=gtid@entry=0x7f5d1fcb8610, sub_id=sub_id@entry=461048395, in_transaction=in_transaction@entry=true, in_statement=in_statement@entry=false) at /root/mariadb-10.0.29/sql/rpl_gtid.cc:659
          #10 0x000055d7a7c6d868 in Query_log_event::do_apply_event (this=0x7f5d080a9718, rgi=0x7f5d08000990, query_arg=0x7f5d080f4c85 "COMMIT", q_len_arg=6) at /root/mariadb-10.0.29/sql/log_event.cc:4312
          #11 0x000055d7a7a20345 in Log_event::apply_event (rgi=0x7f5d08000990, this=0x7f5d080a9718) at /root/mariadb-10.0.29/sql/log_event.h:1360
          #12 apply_event_and_update_pos (ev=ev@entry=0x7f5d080a9718, thd=thd@entry=0x7f5d08001278, rgi=rgi@entry=0x7f5d08000990, rpt=rpt@entry=0x0) at /root/mariadb-10.0.29/sql/slave.cc:3300
          #13 0x000055d7a7a229aa in exec_relay_log_event (serial_rgi=0x7f5d08000990, rli=0x55d7aa6e9b28, thd=0x7f5d08001278) at /root/mariadb-10.0.29/sql/slave.cc:3630
          #14 handle_slave_sql (arg=0x55d7aa6e8160) at /root/mariadb-10.0.29/sql/slave.cc:4731
          #15 0x00007f5d3d588464 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
          #16 0x00007f5d3b9519df in clone () from /lib/x86_64-linux-gnu/libc.so.6
          

          Would be great to know if this is sufficient information to solve the issue, or if a core-dump is still required.

          dreas Dreas van Donselaar added a comment - - edited I presume the relevant bit is: Thread 5 (Thread 0x7f5d1fcb9700 (LWP 19725)): #0 0x00007f5d3d58e18f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #1 0x000055d7a7d935f9 in inline_mysql_cond_wait (src_file=0x55d7a80fee80 "/root/mariadb-10.0.29/storage/maria/ma_pagecache.c", src_line=1991, mutex=0x55d7a90ae0a8 <maria_pagecache_var+200>, that=0x7f5d080008c8) at /root/mariadb-10.0.29/include/mysql/psi/mysql_thread.h:1165 #2 find_block (pagecache=pagecache@entry=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=pageno@entry=1, init_hits_left=init_hits_left@entry=3, wrmode=1 '\001', block_is_copied=1 '\001', reg_req=1 '\001', page_st=0x7f5d1fcb4d10) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:1991 #3 0x000055d7a7d94d2f in pagecache_read (pagecache=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=1, level=level@entry=3, buff=buff@entry=0x7f5d1fcb4e30 "", type=PAGECACHE_LSN_PAGE, lock=PAGECACHE_LOCK_WRITE, page_link=0x7f5d1fcb4dc0) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:3391 #4 0x000055d7a7da9b03 in _ma_fetch_keypage (page=page@entry=0x7f5d1fcb77d0, info=info@entry=0x7f5d083464b8, keyinfo=keyinfo@entry=0x7f5d080113f8, pos=pos@entry=8192, lock=lock@entry=PAGECACHE_LOCK_WRITE, level=level@entry=3, buff=0x7f5d1fcb4e30 "", return_buffer=0 '\000') at /root/mariadb-10.0.29/storage/maria/ma_page.c:109 #5 0x000055d7a7dc4033 in _ma_ck_real_delete (info=info@entry=0x7f5d083464b8, key=key@entry=0x7f5d1fcb7da0, root=root@entry=0x7f5d1fcb7858) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:222 #6 0x000055d7a7dc4339 in _ma_ck_delete (info=info@entry=0x7f5d083464b8, key=0x7f5d1fcb7da0) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:178 #7 0x000055d7a7dc1da0 in maria_delete (info=0x7f5d083464b8, record=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/storage/maria/ma_delete.c:88 #8 0x000055d7a7bc968f in handler::ha_delete_row (this=0x7f5d080d8880, buf=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/sql/handler.cc:6076 #9 0x000055d7a7b6ca38 in rpl_slave_state::record_gtid (this=0x55d7aa384f50, thd=0x7f5d08001278, gtid=gtid@entry=0x7f5d1fcb8610, sub_id=sub_id@entry=461048395, in_transaction=in_transaction@entry=true, in_statement=in_statement@entry=false) at /root/mariadb-10.0.29/sql/rpl_gtid.cc:659 #10 0x000055d7a7c6d868 in Query_log_event::do_apply_event (this=0x7f5d080a9718, rgi=0x7f5d08000990, query_arg=0x7f5d080f4c85 "COMMIT", q_len_arg=6) at /root/mariadb-10.0.29/sql/log_event.cc:4312 #11 0x000055d7a7a20345 in Log_event::apply_event (rgi=0x7f5d08000990, this=0x7f5d080a9718) at /root/mariadb-10.0.29/sql/log_event.h:1360 #12 apply_event_and_update_pos (ev=ev@entry=0x7f5d080a9718, thd=thd@entry=0x7f5d08001278, rgi=rgi@entry=0x7f5d08000990, rpt=rpt@entry=0x0) at /root/mariadb-10.0.29/sql/slave.cc:3300 #13 0x000055d7a7a229aa in exec_relay_log_event (serial_rgi=0x7f5d08000990, rli=0x55d7aa6e9b28, thd=0x7f5d08001278) at /root/mariadb-10.0.29/sql/slave.cc:3630 #14 handle_slave_sql (arg=0x55d7aa6e8160) at /root/mariadb-10.0.29/sql/slave.cc:4731 #15 0x00007f5d3d588464 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #16 0x00007f5d3b9519df in clone () from /lib/x86_64-linux-gnu/libc.so.6 Would be great to know if this is sufficient information to solve the issue, or if a core-dump is still required.

          monty, please see above.

          elenst Elena Stepanova added a comment - monty , please see above.
          monty Michael Widenius added a comment - - edited

          Here is a patch that should fix the above hang (bug in aria page cache under very rare circumstances when things are evicted from table cache):

          — a/storage/maria/ma_pagecache.c
          +++ b/storage/maria/ma_pagecache.c
          @@ -1975,8 +1975,8 @@ static PAGECACHE_BLOCK_LINK *find_block(PAGECACHE *pagecache,
          }
          else
          {

          • DBUG_ASSERT(hash_link->requests > 0);
          • hash_link->requests--;
            + DBUG_ASSERT(block->hash_link == hash_link);
            + remove_reader(block);
            KEYCACHE_DBUG_PRINT("find_block",
            ("request waiting for old page to be saved"));
            {

          See the attached 'patch' for a full patch for this and similar failures

          monty Michael Widenius added a comment - - edited Here is a patch that should fix the above hang (bug in aria page cache under very rare circumstances when things are evicted from table cache): — a/storage/maria/ma_pagecache.c +++ b/storage/maria/ma_pagecache.c @@ -1975,8 +1975,8 @@ static PAGECACHE_BLOCK_LINK *find_block(PAGECACHE *pagecache, } else { DBUG_ASSERT(hash_link->requests > 0); hash_link->requests--; + DBUG_ASSERT(block->hash_link == hash_link); + remove_reader(block); KEYCACHE_DBUG_PRINT("find_block", ("request waiting for old page to be saved")); { See the attached 'patch' for a full patch for this and similar failures

          Fixed and pushed into 10.0

          monty Michael Widenius added a comment - Fixed and pushed into 10.0

          People

            monty Michael Widenius
            dreas Dreas van Donselaar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.