[MDEV-9027] [PATCH] A customer reported strange locking behavior with TokuDB Created: 2015-10-27  Updated: 2022-09-08

Status: Confirmed
Project: MariaDB Server
Component/s: Storage Engine - TokuDB
Affects Version/s: 10.0.22, 10.1.8
Fix Version/s: 10.1

Type: Bug Priority: Minor
Reporter: Jonathan Day (Inactive) Assignee: Sergei Golubchik
Resolution: Unresolved Votes: 2
Labels: tokudb, upstream
Environment:

Issue related to TokuDB


Attachments: File insert-select-concurrent.test    
Issue Links:
Blocks
is blocked by MDEV-9252 10.0.23 merge Closed
is blocked by MDEV-9537 10.0.24 merge Closed
is blocked by MDEV-10224 10.0.26 merge Closed
Sprint: 10.0.23, 10.0.24, 10.0.26

 Description   

Customer reports the following:

"We have uncovered a locking issue that doesn't "look right" to me. Here's how to reproduce it (I can send you a VM with all this setup if needed).

1) Create and populate the sysbench sbtest1 table as a TokuDB table.
2) Run a benchmark (any kind is fine...even a single-threaded one will do).
3) Issue "create temporary table temp1 as select id from sbtest1;" or "insert into temp1 select id from sbtest1;" in a different session..
4) Check the processlist in a 3rd session...you'll see that sbtest1 is locked from any insert/update/delete activity while the "select id from sbtest1" in step #3 runs.

This only happens if log-bin is set to a value (ie. binlogging is turned on).

This behavior happens whether the default transaction isolation is read-comitted, read-uncommitted, repeatable-read, or serializable....it should only happen with serializable. This behavior does NOT happen if the selected table is InnoDB, only if it's TokuDB. And again, this only happens if binlogging is on (which it must be for replication and disaster recovery)...if we turn binlogging off, the locks don't appear."

The customer has a VM that duplicates issue if that is of help. I can be point of contact with them as required. - Jon



 Comments   
Comment by SalsaLabs - Dave Sisk [ 2015-10-28 ]

For a high-volume web-site, this is a crippling bug. One long-running batch job or long-running report, and the web-site screeches to a halt for all customers. Not good. The locking behavior should reflect the transaction isolation level, and work consistently regardless of whether binlogging is enabled or not, or the table storage engine is InnoDB or TokuDB.

Comment by Elena Stepanova [ 2015-10-29 ]

jonathan.day@mariadb.com,

Which binlog format does the customer use, and you used while reproducing?
Is it reproducible for you with row format?

Comment by Elena Stepanova [ 2015-10-29 ]

Reproducible easily with a crude MTR test below.
The test does the following:

  • create a TokuDB table;
  • populate it with some data;
  • start inserting into it one row at a time;
  • in a different connection, run create temporaray table as select from the tokudb table;
  • in a third connection, monitor the processlist

=> with mixed (or statement) binlog format, the processlist shows that for the whole duration of CREATE, INSERTs are stuck in 'Waiting for table level lock', they are not moving forward at all; as soon as CREATE finishes, INSERTs start progressing.

show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',1))	0.000
5	root	localhost	test	Query	0	creating table	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	2	Waiting for table level lock	insert into t1 values (null,  NAME_CONST('i',2))	0.000
5	root	localhost	test	Query	2	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
 
...
 
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	18	Waiting for table level lock	insert into t1 values (null,  NAME_CONST('i',2))	0.000
5	root	localhost	test	Query	18	Queried about 20000 rows	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',12))	0.000
5	root	localhost	test	Query	0	init	show processlist	0.000

With row format, the situation is a bit different. INSERTs stop for a little while, and then fail due to lock timeout.

show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',1))	0.000
5	root	localhost	test	Query	0	creating table	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	1	update	insert into t1 values (null,  NAME_CONST('i',6))	0.000
5	root	localhost	test	Query	2	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	3	update	insert into t1 values (null,  NAME_CONST('i',6))	0.000
5	root	localhost	test	Query	4	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Sleep	1		NULL	0.000
5	root	localhost	test	Query	6	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Sleep	3		NULL	0.000
5	root	localhost	test	Query	8	Queried about 10000 rows	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
 
...
 
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Sleep	13		NULL	0.000
5	root	localhost	test	Query	18	Queried about 20000 rows	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Sleep	15		NULL	0.000
5	root	localhost	test	Query	0	init	show processlist	0.000

If we increase the lock timeout (tokudb_lock_timeout), we'll see that INSERTs also don't move here, just wait in a different state (apparently on a different lock):

Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	1	query end	insert into t1 values (null,  NAME_CONST('i',1))	0.000
5	root	localhost	test	Query	0	creating table	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	2	update	insert into t1 values (null,  NAME_CONST('i',5))	0.000
5	root	localhost	test	Query	2	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
 
...
 
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	18	update	insert into t1 values (null,  NAME_CONST('i',5))	0.000
5	root	localhost	test	Query	18	Queried about 20000 rows	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	1	query end	insert into t1 values (null,  NAME_CONST('i',28))	0.000
5	root	localhost	test	Query	0	init	show processlist	0.000

With InnoDB, the picture is essentially different. Regardless the binlog format, INSERTs keep progressing over the whole duration of CREATE, although slowly:

Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',1))	0.000
5	root	localhost	test	Query	0	creating table	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',26))	0.000
5	root	localhost	test	Query	2	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	1	query end	insert into t1 values (null,  NAME_CONST('i',49))	0.000
5	root	localhost	test	Query	4	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	update	insert into t1 values (null,  NAME_CONST('i',60))	0.000
5	root	localhost	test	Query	6	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	update	insert into t1 values (null,  NAME_CONST('i',78))	0.000
5	root	localhost	test	Query	8	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	update	insert into t1 values (null,  NAME_CONST('i',98))	0.000
5	root	localhost	test	Query	10	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',110))	0.000
5	root	localhost	test	Query	12	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	1	update	insert into t1 values (null,  NAME_CONST('i',121))	0.000
5	root	localhost	test	Query	14	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',125))	0.000
5	root	localhost	test	Query	16	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',150))	0.000
5	root	localhost	test	Query	18	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',170))	0.000
5	root	localhost	test	Query	20	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',179))	0.000
5	root	localhost	test	Query	22	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',201))	0.000
5	root	localhost	test	Query	24	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	update	insert into t1 values (null,  NAME_CONST('i',236))	0.000
5	root	localhost	test	Query	26	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',256))	0.000
5	root	localhost	test	Query	28	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',271))	0.000
5	root	localhost	test	Query	30	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Query	0	query end	insert into t1 values (null,  NAME_CONST('i',292))	0.000
5	root	localhost	test	Query	32	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Sleep	2		NULL	0.000
5	root	localhost	test	Query	34	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Sleep	4		NULL	0.000
5	root	localhost	test	Query	36	Sending data	create temporary table tmp engine=InnoDB as select * from t1	0.000
6	root	localhost	test	Query	0	init	show processlist	0.000
show processlist;
Id	User	Host	db	Command	Time	State	Info	Progress
4	root	localhost	test	Sleep	6		NULL	0.000
5	root	localhost	test	Query	0	init	show processlist	0.000

Test case

# Attention: run with --mysqld=--plugin-load-add=ha_tokudb, if it's not enabled by default.
# The test case is written for MariaDB server, it might require modifications to be run with other servers
 
--source include/have_log_bin.inc
--source include/have_binlog_format_mixed.inc
--source include/have_innodb.inc
 
--let $initial_row_count= 30000
--let $insert_count= 300
 
create table t1 (pk int auto_increment primary key, i int) engine=TokuDB;
--delimiter ||
create procedure pr(ins_count int)
begin
  declare i int default 1;
  while i <= ins_count
  do
    insert into t1 values (null, i);
    set i = i+1;
  end while;
end ||
--delimiter ;
 
# Pre-populate the table with $initial_row_count rows. Without autocommit it will be faster
set autocommit=0;
eval call pr($initial_row_count);
set autocommit=1;
 
# Start inserting row by row
--send
eval call pr($insert_count);
 
--connect (con1,localhost,root,,)
 
# Wait till inserts start working.
# wait_show_condition does not work due to quote marks, have to do it half-manually
--let $wait_condition= SELECT COUNT(*) FROM information_schema.PROCESSLIST WHERE INFO LIKE 'insert into t1 values%'
--source include/wait_condition.inc
 
# Start CREATE .. AS SELECT
--send create temporary table tmp engine=InnoDB as select * from t1
 
--connect (con2,localhost,root,,)
 
# Monitor processlist 
# pay attention to 'State' -- when we hit the problem, it shows 'Waiting for table level lock',
#           and to 'Info' -- it shows that the inserts do not move, it is the same value that keeps waiting
# Do it until CREATE finishes
 
while (`SELECT COUNT(*) FROM information_schema.PROCESSLIST WHERE INFO LIKE 'create temporary table%'`)
{
  show processlist;
  --sleep 2
}
--disconnect con2
 
--connection con1
--reap
 
# Check that INSERTs started moving
show processlist;
 
--disconnect con1
 
--connection default
--reap
 
# Cleanup
drop table t1;
drop procedure pr;

Comment by Elena Stepanova [ 2015-10-29 ]

Re-filed in Tokutek bugtracker as https://tokutek.atlassian.net/browse/DB-891

Comment by Rich Prohaska [ 2015-11-06 ]

This wiki explains the concurrency model of tokudb for various isolation levels.
https://github.com/percona/tokudb-engine/wiki/Transactions-and-Concurrency

As you can see, the 'create select' statement grabs a range lock on the entire range of the source tables in repeatable read isolation level. In addition, this range is prelocked when the statement starts execution. You can see this with the lock visualization information schema tables for tokudb.
https://github.com/percona/tokudb-engine/wiki/Lock-Visualization-in-TokuDB

In contrast, tokudb will not grab any locks on the source tables when executing a 'create select' statement in read committed isolation level. Perhaps read committed isolation level is acceptable for your application.

Tokudb could be modified to turn off prelocking in repeatable read isolation level for certain queries like the 'create select' above. This could be implemented with a session variable.

Comment by Elena Stepanova [ 2015-11-06 ]

prohaska7,

But as the description says (and my tests also confirmed it), the problem is the same no matter which transaction isolation level is used; e.g. with my test, I'm getting the same lock for the same time even if I run the test with transaction-isolation=read-committed, and even if I also set it explicitly for each session, just in case TokuDB somehow ignored the startup option.

Comment by Rich Prohaska [ 2015-11-06 ]

Hello Elena,
Thanks for writing the test.

This combination of settings allows concurrency: the 'create select' client must use ROW format and read committed or uncomitted ISO. The 'insert' client can be any binlog format and any ISO.

The problem is occurs due to the interaction between MariaDB and tokudb's store_lock method which supplies a lock level for the execution of a SQL statement. For the 'create select' statement, the lock level is TL_READ_NO_INSERT for mixed or statement binlog format. In contrast, the lock level is TL_READ for row binlog format. TL_READ allows concurrent inserts while TL_READ_NO_INSERT does not. This explains the case when the insert is stuck in the 'Waiting for table level lock' state.

When row binlog format is used, tokudb's lock manager comes into play, and the tokudb row locking depends on the transaction ISO. Repeatable read ISO locks the source tokudb tables which prohibits concurrent insertions. Thats why you see tokudb lock timeouts on inserts. Read committed does not lock the source tables, so there are no conflicts for the insert client, so the inserts work.

Comment by Elena Stepanova [ 2015-11-06 ]

prohaska7, thank you, it all makes sense now.
I can confirm that with row binlog format + read-committed isolation level, inserts keep happening steadily.
Do I understand you correctly that it is an intended behavior, and we should not expect any fix/change in this regard?

jonathan.day@mariadb.com, would it be an acceptable workaround for the customer – using row binary logging + read-committed?

Comment by Rich Prohaska [ 2015-11-06 ]

innodb::store_lock has logic that maps TL_READ_NO_INSERT to TL_READ to handle 'create select' and 'insert select' scenarios. tokudb could do something similar, and that would solve the first problem which is getting concurrency for binlog mixed and statement format. the second problem of repeatable read locking is not any easy change.

Comment by SalsaLabs - Dave Sisk [ 2015-11-07 ]

"In contrast, tokudb will not grab any locks on the source tables when executing a 'create select' statement in read committed isolation level. Perhaps read committed isolation level is acceptable for your application."

Just to be clear, TokuDB IS grabbing locks on the source tables when executing a 'create select' or an 'insert select' even with the transaction isolation level set to read-committed or even read-uncommitted. Read-committed would indeed be acceptable for our application...but TokuDB is not obeying the transaction isolation levels as it should.

Comment by Elena Stepanova [ 2015-11-07 ]

SalsaLabs:
Please note further clarification from Rich – it was said about using read-committed isolation level together with row-based format. Only read-committed won't solve the problem, and only row-based format won't help either, but using them simultaneously should (and it is true at least for my small test above).

Comment by Rich Prohaska [ 2015-11-08 ]

This commit 'https://github.com/prohaska7/mariadb-server/commit/4677765e38253e84f102d93e48cdb841372a78ff' allows concurrent writes to source tables in 'create select' and 'insert select' statements for ANY binlog format. Still needs read committed ISO so that tokudb locks are not taken on the source tables. Should I include some license in a pull request for this change?

Comment by Elena Stepanova [ 2015-11-08 ]

prohaska7, thanks a lot!

serg will know what's needed in regard to the license.

The priority is increased as there is a contributed patch.

Comment by SalsaLabs - Dave Sisk [ 2015-11-10 ]

Thanks folks. Will this fix be in the next maintenance release of MariaDB, or do I have to get it some other way? I can wait a while if it'll be in the next release...

Comment by Sergei Golubchik [ 2016-02-17 ]

It will be automatically merged when we get a new TokuDB.

Unfortunately, TokuDB (as taken from the Percona Server source tarball) is currently unusable.
When TokuDB will be fixed, I'll push it into the main tree and this bug will go away.

Comment by Phil Sweeney [ 2016-06-22 ]

serg I can't see this suggested patch in the Percona repository so I don't think a TokuDB merge will help here. I believe this patch is just suggested by Rich for MariaDB.

Comment by Sergei Golubchik [ 2016-06-22 ]

Oh. I see. Thanks. Sorry for confusion.

Comment by Sergei Golubchik [ 2016-06-22 ]

This cannot be fixed with the richprohaska patch.

Generally, If you do INSERT ... SELECT in a statement-based replication mode, you do want to prevent concurrent inserts into the source table. Otherwise a concurrent insert might finish before the long INSERT ... SELECT and it will be written to the binary log before the INSERT ... SELECT. That's why the server uses a strong read lock on the source table — TokuDB is not at fault here, it merely respects the default server behavior and leaves the lock level as is.

You should use RBR if you want INSERT ... SELECT to allow concurrent inserts and also want your slaves to be consistent with the master.

The suggested patch weakens the table lock, concurrent inserts become possible, and that breaks replication.

InnoDB allows concurrent inserts, indeed, but only into the “not yet selected” part of the source table, so INSERT ... SELECT will see the inserted row, as if it was inserted before INSERT ... SELECT has started. InnoDB uses gap locks to prevent concurrent inserts into the “already selected” part of the source table.

TokuDB needs to change its internal locking to be able to use weak read locks for INSERT ... SELECT without breaking the statement based replication.

Comment by Manuel Arostegui [ 2017-04-06 ]

I have also been affected by this bug on a few tables with INSERT SELECT and CREATE SELECT
As soon as we converted the tables to InnoDB the issue stopped.

Generated at Thu Feb 08 07:31:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.