[MDEV-5863] "Lock wait timeout exceeded", lost events, Assertion `!table || !table->in_use || ta ble->in_use == _current_thd()' failure upon parallel replicating Created: 2014-03-14  Updated: 2014-03-21  Resolved: 2014-03-21

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: None
Fix Version/s: 10.0.10

Type: Bug Priority: Blocker
Reporter: Elena Stepanova Assignee: Kristian Nielsen
Resolution: Fixed Votes: 0
Labels: None

Attachments: File mysql-bin.000001.gz    
Issue Links:
Relates
relates to MDEV-5914 Parallel replication deadlock due to ... Closed
relates to MDEV-5921 In parallel replication, an error is ... Closed

 Description   

The attached binary log was created by a fair concurrent test on master, without any manual editing.
In case it relevant, the master server was running with the following options:

10.0/sql/mysqld --no-defaults --basedir=10.0 --datadir=/home/elenst/test_results/analyze2/data --lc-messages-dir=10.0/sql/share --character-sets-dir=10.0/sql/share/charsets --tmpdir=/home/elenst/test_results/analyze2/tmp --core-file --max-allowed-packet=128Mb --port=10730 --socket=/home/elenst/test_results/analyze2/mysql.sock --pid-file=/home/elenst/test_results/analyze2/mysql.pid --general-log --general-log-file=/home/elenst/test_results/analyze2/mysql.log --server_id=1 --report-host=127.0.0.1 --report_port=10730 --innodb --sql-mode=no_engine_substitution --binlog_commit_wait_count=10 --binlog_commit_wait_usec=1000000 --ignore-builtin-innodb --plugin-load=ha_innodb.so --slave-skip-errors=1054,1317,1049,1305,1539,1505 --log-output=FILE --slave_parallel_threads=16 --slave_parallel_max_queued=4194304

Despite slave-specific options, it was not configured as a slave and was not replicating from anywhere.

Now I'm feeding this binary log to a slave running with slave-parallel-threads=16 (no other special options either on master or on slave is necessary).

I'm getting several problems.

Replication failure

Initially replication breaks with

140314 18:04:40 [ERROR] Slave SQL: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'test'. Query: 'INSERT INTO `view_A` ( `pk` ) VALUES (NULL)', Internal MariaDB error code: 1205
140314 18:04:40 [Warning] Slave: Lock wait timeout exceeded; try restarting transaction Error_code: 1205
140314 18:04:40 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000001' position 312054

It happens every time, in the same place, does not seem to be a race condition, probably some logic specific to parallel replication.

Data discrepancy

After the previous failure, I simply issue START SLAVE again, and replication does not fail anymore.
But when I check the contents of view_A, it turns out that it only has 4 rows while the master has 30+ (either the master on the original datadir, or after feeding the binlog into a new one via MySQL client).
It also happens every time.

Assertion failure

When slave reports that it has finished replication, I attempt to shut down the server, and get the assertion failure upon shutdown:

mysqld: 10.0/sql/handler.cc:2381: THD* handler::ha_thd() const: Assertion `!table || !table->in_use || table->in_use == _current_thd()' failed.
140314 18:12:47 [ERROR] mysqld got signal 6 ;

#5  0x00007f0a9a4b46f0 in *__GI_abort () at abort.c:92
#6  0x00007f0a9a4aa621 in *__GI___assert_fail (assertion=0xfdbc48 "!table || !table->in_use || table->in_use == _current_thd()", file=<optimized out>, line=2381, function=0xfde5c0 "THD* handler::ha_thd() const") at assert.c:81
#7  0x000000000085f648 in handler::ha_thd (this=0x7f0a6d62f088) at 10.0/sql/handler.cc:2381
#8  0x0000000000b74035 in ha_innobase::extra (this=0x7f0a6d62f088, operation=HA_EXTRA_DETACH_CHILDREN) at 10.0/storage/xtradb/handler/ha_innodb.cc:12762
#9  0x000000000060fc14 in mark_tmp_table_for_reuse (table=0x7f0a7121c070) at 10.0/sql/sql_base.cc:696
#10 0x000000000060faf2 in mark_temp_tables_as_free_for_reuse (thd=0x7f0a6bd11070) at 10.0/sql/sql_base.cc:663
#11 0x0000000000610224 in close_thread_tables (thd=0x7f0a6bd11070) at 10.0/sql/sql_base.cc:943
#12 0x000000000077d1e1 in rpl_group_info::slave_close_thread_tables (this=0x7f0a6bcde870, thd=0x7f0a6bd11070) at 10.0/sql/rpl_rli.cc:1699
#13 0x000000000077ced0 in rpl_group_info::cleanup_context (this=0x7f0a6bcde870, thd=0x7f0a6bd11070, error=true) at 10.0/sql/rpl_rli.cc:1611
#14 0x00000000005e3381 in handle_slave_sql (arg=0x7f0a7c3ad070) at 10.0/sql/slave.cc:4594
#15 0x0000000000a2da58 in pfs_spawn_thread (arg=0x7f0a72887ef0) at 10.0/storage/perfschema/pfs.cc:1853
#16 0x00007f0a9c04eb50 in start_thread (arg=<optimized out>) at pthread_create.c:304
#17 0x00007f0a9a559a7d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112

Stack trace from:

revision-id: monty@mariadb.org-20140313144311-hmphz2gup3qdjlp9
date: 2014-03-13 16:43:11 +0200
build-date: 2014-03-14 19:43:28 +0400
revno: 4051
branch-nick: 10.0

Server where the binary log was created

revision-id: svoj@mariadb.org-20140307074306-blf3begv8ee0tb5j
date: 2014-03-07 11:43:06 +0400
build-date: 2014-03-14 17:43:49 +0200
revno: 4045
branch-nick: 10.0



 Comments   
Comment by Elena Stepanova [ 2014-03-17 ]

I modified the command line in description, it was mistakenly copied from the slave startup. All options are the same, and the disclaimer about useless slave options applies, only server_id, port and datadir location were different – those I fixed to avoid confusion.

Comment by Kristian Nielsen [ 2014-03-18 ]

I was able to reproduce with the supplied binlog file in a mysql-test-run test case

Comment by Sergey Vojtovich [ 2014-03-18 ]

Very likely has something to do with my table cache patches.

Comment by Sergey Vojtovich [ 2014-03-18 ]

OTOH it looks like temporary tables are in question. These don't go through table cache and must have table->in_use set to 0. Probably it is uninitialized?

Comment by Sergey Vojtovich [ 2014-03-18 ]

Hmm... not really they should have proper in_use, but anyway crash doesn't seem to be relevant to table cache because it is temporary table.

Comment by Kristian Nielsen [ 2014-03-19 ]

There are multiple bugs visible from this report (thanks, btw!).

The lock wait timeout is a rather serious, and complex, issue with InnoDB row
locks. Below a small test case that reproduces it (not deterministic, may need
to be run a couple of times with --repeat to trigger the issue).

The problem is an INSERT and a DELETE on the same table. The DELETE blocks the
INSERT until commit, but the INSERT does not similarly block the DELETE. This
means that the INSERT can run first and group commit with the DELETE on the
master, but then on the slave, if the DELETE runs ahead, it will block the
INSERT, and the INSERT in turn will block the DELETE due to commit
ordering. The result is a deadlock.

--source include/have_innodb.inc
--source include/have_debug_sync.inc
--source include/have_debug_sync.inc
--source include/master-slave.inc

--connection slave
--source include/stop_slave.inc

--connection master
CREATE TABLE t1 (a INT PRIMARY KEY, b INT, KEY b_idx(b)) ENGINE=InnoDB;
INSERT INTO t1 VALUES (1,NULL), (2,2), (3,NULL), (4,4), (5, NULL), (6, 6);

  1. Create a group commit with INSERT and DELETE, in that order.
    --connect (con1,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
    SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued1 WAIT_FOR master_cont1';
    send INSERT INTO t1 VALUES (7, NULL);
    --connection master
    SET debug_sync='now WAIT_FOR master_queued1';

--connect (con2,127.0.0.1,root,,test,$SERVER_MYPORT_1,)
SET debug_sync='commit_after_release_LOCK_prepare_ordered SIGNAL master_queued2';
send DELETE FROM t1 WHERE b <= 3;

--connection master
SET debug_sync='now WAIT_FOR master_queued2';
SET debug_sync='now SIGNAL master_cont1';

--connection con1
REAP;
--connection con2
REAP;
SET debug_sync='RESET';
--save_master_pos

--connection slave
SET @old_parallel= @@GLOBAL.slave_parallel_threads;
SET GLOBAL slave_parallel_threads=16;
--source include/start_slave.inc
--sync_with_master

SELECT * FROM t1 ORDER BY a;

--source include/stop_slave.inc
SET GLOBAL slave_parallel_threads=@old_parallel;
--source include/start_slave.inc

--connection master
DROP TABLE t1;
--source include/rpl_end.inc

Comment by Kristian Nielsen [ 2014-03-21 ]

Ok, so I will close this as fixed.

There were several important issues found by this test:

1. A temporary table bug seen during shutdown (the assertion). Monty fixed
this in an earlier push to 10.0.

2. The MDEV-5921. This was reported also by nanyi607rao on
maria-developers@lists.launchpad.net, I have pushed a fix for this.

3. The MDEV-5914. This bug is still open, however I have pushed to 10.0 a
temporary fix that should be sufficient to make this bug disappear.

Please re-run the test to check that it is really fixed. I did at some point
see another error in the logs about temporary tables when I applied the
attached binlog. I cannot repeat it now, so it may have been another symptom
of the problem (1), but it would be good to get verification. I also did not
verify that the differences in rows between tables are gone, though fixing (2)
and (3) should fix it, hopefully.

This was a very good test, really important to get parallel replication ready
for end-users.

Generated at Thu Feb 08 07:07:37 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.