Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-5863

"Lock wait timeout exceeded", lost events, Assertion `!table || !table->in_use || ta ble->in_use == _current_thd()' failure upon parallel replicating

Details

    • Bug
    • Status: Closed (View Workflow)
    • Blocker
    • Resolution: Fixed
    • None
    • 10.0.10
    • None
    • None

    Description

      The attached binary log was created by a fair concurrent test on master, without any manual editing.
      In case it relevant, the master server was running with the following options:

      10.0/sql/mysqld --no-defaults --basedir=10.0 --datadir=/home/elenst/test_results/analyze2/data --lc-messages-dir=10.0/sql/share --character-sets-dir=10.0/sql/share/charsets --tmpdir=/home/elenst/test_results/analyze2/tmp --core-file --max-allowed-packet=128Mb --port=10730 --socket=/home/elenst/test_results/analyze2/mysql.sock --pid-file=/home/elenst/test_results/analyze2/mysql.pid --general-log --general-log-file=/home/elenst/test_results/analyze2/mysql.log --server_id=1 --report-host=127.0.0.1 --report_port=10730 --innodb --sql-mode=no_engine_substitution --binlog_commit_wait_count=10 --binlog_commit_wait_usec=1000000 --ignore-builtin-innodb --plugin-load=ha_innodb.so --slave-skip-errors=1054,1317,1049,1305,1539,1505 --log-output=FILE --slave_parallel_threads=16 --slave_parallel_max_queued=4194304

      Despite slave-specific options, it was not configured as a slave and was not replicating from anywhere.

      Now I'm feeding this binary log to a slave running with slave-parallel-threads=16 (no other special options either on master or on slave is necessary).

      I'm getting several problems.

      Replication failure

      Initially replication breaks with

      140314 18:04:40 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'test'. Query: 'INSERT INTO `view_A` ( `pk` ) VALUES (NULL)', Internal MariaDB error code: 1205
      140314 18:04:40 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
      140314 18:04:40 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 312054

      It happens every time, in the same place, does not seem to be a race condition, probably some logic specific to parallel replication.

      Data discrepancy

      After the previous failure, I simply issue START SLAVE again, and replication does not fail anymore.
      But when I check the contents of view_A, it turns out that it only has 4 rows while the master has 30+ (either the master on the original datadir, or after feeding the binlog into a new one via MySQL client).
      It also happens every time.

      Assertion failure

      When slave reports that it has finished replication, I attempt to shut down the server, and get the assertion failure upon shutdown:

      mysqld: 10.0/sql/handler.cc:2381: THD* handler::ha_thd() const: Assertion `!table || !table->in_use || table->in_use == _current_thd()' failed.
      140314 18:12:47 [ERROR] mysqld got signal 6 ;

      #5  0x00007f0a9a4b46f0 in *__GI_abort () at abort.c:92
      #6  0x00007f0a9a4aa621 in *__GI___assert_fail (assertion=0xfdbc48 "!table || !table->in_use || table->in_use == _current_thd()", file=<optimized out>, line=2381, function=0xfde5c0 "THD* handler::ha_thd() const") at assert.c:81
      #7  0x000000000085f648 in handler::ha_thd (this=0x7f0a6d62f088) at 10.0/sql/handler.cc:2381
      #8  0x0000000000b74035 in ha_innobase::extra (this=0x7f0a6d62f088, operation=HA_EXTRA_DETACH_CHILDREN) at 10.0/storage/xtradb/handler/ha_innodb.cc:12762
      #9  0x000000000060fc14 in mark_tmp_table_for_reuse (table=0x7f0a7121c070) at 10.0/sql/sql_base.cc:696
      #10 0x000000000060faf2 in mark_temp_tables_as_free_for_reuse (thd=0x7f0a6bd11070) at 10.0/sql/sql_base.cc:663
      #11 0x0000000000610224 in close_thread_tables (thd=0x7f0a6bd11070) at 10.0/sql/sql_base.cc:943
      #12 0x000000000077d1e1 in rpl_group_info::slave_close_thread_tables (this=0x7f0a6bcde870, thd=0x7f0a6bd11070) at 10.0/sql/rpl_rli.cc:1699
      #13 0x000000000077ced0 in rpl_group_info::cleanup_context (this=0x7f0a6bcde870, thd=0x7f0a6bd11070, error=true) at 10.0/sql/rpl_rli.cc:1611
      #14 0x00000000005e3381 in handle_slave_sql (arg=0x7f0a7c3ad070) at 10.0/sql/slave.cc:4594
      #15 0x0000000000a2da58 in pfs_spawn_thread (arg=0x7f0a72887ef0) at 10.0/storage/perfschema/pfs.cc:1853
      #16 0x00007f0a9c04eb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
      #17 0x00007f0a9a559a7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

      Stack trace from:

      revision-id: monty@mariadb.org-20140313144311-hmphz2gup3qdjlp9
      date: 2014-03-13 16:43:11 +0200
      build-date: 2014-03-14 19:43:28 +0400
      revno: 4051
      branch-nick: 10.0

      Server where the binary log was created

      revision-id: svoj@mariadb.org-20140307074306-blf3begv8ee0tb5j
      date: 2014-03-07 11:43:06 +0400
      build-date: 2014-03-14 17:43:49 +0200
      revno: 4045
      branch-nick: 10.0

      Attachments

        Issue Links

          Activity

            I modified the command line in description, it was mistakenly copied from the slave startup. All options are the same, and the disclaimer about useless slave options applies, only server_id, port and datadir location were different – those I fixed to avoid confusion.

            elenst Elena Stepanova added a comment - I modified the command line in description, it was mistakenly copied from the slave startup. All options are the same, and the disclaimer about useless slave options applies, only server_id, port and datadir location were different – those I fixed to avoid confusion.

            I was able to reproduce with the supplied binlog file in a mysql-test-run test case

            knielsen Kristian Nielsen added a comment - I was able to reproduce with the supplied binlog file in a mysql-test-run test case

            Very likely has something to do with my table cache patches.

            svoj Sergey Vojtovich added a comment - Very likely has something to do with my table cache patches.

            OTOH it looks like temporary tables are in question. These don't go through table cache and must have table->in_use set to 0. Probably it is uninitialized?

            svoj Sergey Vojtovich added a comment - OTOH it looks like temporary tables are in question. These don't go through table cache and must have table->in_use set to 0. Probably it is uninitialized?

            Hmm... not really they should have proper in_use, but anyway crash doesn't seem to be relevant to table cache because it is temporary table.

            svoj Sergey Vojtovich added a comment - Hmm... not really they should have proper in_use, but anyway crash doesn't seem to be relevant to table cache because it is temporary table.

            There are multiple bugs visible from this report (thanks, btw!).

            The lock wait timeout is a rather serious, and complex, issue with InnoDB row
            locks. Below a small test case that reproduces it (not deterministic, may need
            to be run a couple of times with --repeat to trigger the issue).

            The problem is an INSERT and a DELETE on the same table. The DELETE blocks the
            INSERT until commit, but the INSERT does not similarly block the DELETE. This
            means that the INSERT can run first and group commit with the DELETE on the
            master, but then on the slave, if the DELETE runs ahead, it will block the
            INSERT, and the INSERT in turn will block the DELETE due to commit
            ordering. The result is a deadlock.

            --source include/have_innodb.inc
            --source include/have_debug_sync.inc
            --source include/have_debug_sync.inc
            --source include/master-slave.inc

            --connection slave
            --source include/stop_slave.inc

            --connection master
            CREATE TABLE t1 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB;
            INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);

            1. Create a group commit with INSERT and DELETE, in that order.
              --connect (con1,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
              SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued1 WAIT_FOR master_cont1';
              send INSERT INTO t1 VALUES (7, NULL);
              --connection master
              SET debug_sync='now WAIT_FOR master_queued1';

            --connect (con2,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
            SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued2';
            send DELETE FROM t1 WHERE b <= 3;

            --connection master
            SET debug_sync='now WAIT_FOR master_queued2';
            SET debug_sync='now SIGNAL master_cont1';

            --connection con1
            REAP;
            --connection con2
            REAP;
            SET debug_sync='RESET';
            --save_master_pos

            --connection slave
            SET @old_parallel= @@GLOBAL.slave_parallel_threads;
            SET GLOBAL slave_parallel_threads=16;
            --source include/start_slave.inc
            --sync_with_master

            SELECT * FROM t1 ORDER BY a;

            --source include/stop_slave.inc
            SET GLOBAL slave_parallel_threads=@old_parallel;
            --source include/start_slave.inc

            --connection master
            DROP TABLE t1;
            --source include/rpl_end.inc

            knielsen Kristian Nielsen added a comment - There are multiple bugs visible from this report (thanks, btw!). The lock wait timeout is a rather serious, and complex, issue with InnoDB row locks. Below a small test case that reproduces it (not deterministic, may need to be run a couple of times with --repeat to trigger the issue). The problem is an INSERT and a DELETE on the same table. The DELETE blocks the INSERT until commit, but the INSERT does not similarly block the DELETE. This means that the INSERT can run first and group commit with the DELETE on the master, but then on the slave, if the DELETE runs ahead, it will block the INSERT, and the INSERT in turn will block the DELETE due to commit ordering. The result is a deadlock. --source include/have_innodb.inc --source include/have_debug_sync.inc --source include/have_debug_sync.inc --source include/master-slave.inc --connection slave --source include/stop_slave.inc --connection master CREATE TABLE t1 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB; INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6); Create a group commit with INSERT and DELETE, in that order. --connect (con1,127.0.0.1,root,,test,$SERVER_MYPORT_1,) SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued1 WAIT_FOR master_cont1'; send INSERT INTO t1 VALUES (7, NULL); --connection master SET debug_sync='now WAIT_FOR master_queued1'; --connect (con2,127.0.0.1,root,,test,$SERVER_MYPORT_1,) SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued2'; send DELETE FROM t1 WHERE b <= 3; --connection master SET debug_sync='now WAIT_FOR master_queued2'; SET debug_sync='now SIGNAL master_cont1'; --connection con1 REAP; --connection con2 REAP; SET debug_sync='RESET'; --save_master_pos --connection slave SET @old_parallel= @@GLOBAL.slave_parallel_threads; SET GLOBAL slave_parallel_threads=16; --source include/start_slave.inc --sync_with_master SELECT * FROM t1 ORDER BY a; --source include/stop_slave.inc SET GLOBAL slave_parallel_threads=@old_parallel; --source include/start_slave.inc --connection master DROP TABLE t1; --source include/rpl_end.inc

            Ok, so I will close this as fixed.

            There were several important issues found by this test:

            1. A temporary table bug seen during shutdown (the assertion). Monty fixed
            this in an earlier push to 10.0.

            2. The MDEV-5921. This was reported also by nanyi607rao on
            maria-developers@lists.launchpad.net, I have pushed a fix for this.

            3. The MDEV-5914. This bug is still open, however I have pushed to 10.0 a
            temporary fix that should be sufficient to make this bug disappear.

            Please re-run the test to check that it is really fixed. I did at some point
            see another error in the logs about temporary tables when I applied the
            attached binlog. I cannot repeat it now, so it may have been another symptom
            of the problem (1), but it would be good to get verification. I also did not
            verify that the differences in rows between tables are gone, though fixing (2)
            and (3) should fix it, hopefully.

            This was a very good test, really important to get parallel replication ready
            for end-users.

            knielsen Kristian Nielsen added a comment - Ok, so I will close this as fixed. There were several important issues found by this test: 1. A temporary table bug seen during shutdown (the assertion). Monty fixed this in an earlier push to 10.0. 2. The MDEV-5921 . This was reported also by nanyi607rao on maria-developers@lists.launchpad.net, I have pushed a fix for this. 3. The MDEV-5914 . This bug is still open, however I have pushed to 10.0 a temporary fix that should be sufficient to make this bug disappear. Please re-run the test to check that it is really fixed. I did at some point see another error in the logs about temporary tables when I applied the attached binlog. I cannot repeat it now, so it may have been another symptom of the problem (1), but it would be good to get verification. I also did not verify that the differences in rows between tables are gone, though fixing (2) and (3) should fix it, hopefully. This was a very good test, really important to get parallel replication ready for end-users.

            People

              knielsen Kristian Nielsen
              elenst Elena Stepanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.