[MDEV-24813] Locking full table scan fails to use table-level locking Created: 2021-02-09  Updated: 2024-01-31

Status: Confirmed
Project: MariaDB Server
Component/s: Locking, Optimizer, Storage Engine - InnoDB
Affects Version/s: 10.5, 10.11
Fix Version/s: 10.5

Type: Bug Priority: Critical
Reporter: Marko Mäkelä Assignee: Yuchen Pei
Resolution: Unresolved Votes: 2
Labels: optimizer-feature, performance

Issue Links:
Relates
relates to MDEV-10719 'create temporary table as select' ge... Open
relates to MDEV-17512 Deadlock after locking all rows in table Open
relates to MDEV-27992 DELETE fails to delete record after b... Closed
relates to MDEV-14479 Do not acquire InnoDB record locks wh... Closed
relates to MDEV-16232 Use fewer mini-transactions Stalled
relates to MDEV-17918 InnoDB sometimes prints incorrect buf... Open
relates to MDEV-20612 Improve InnoDB lock_sys scalability Closed
relates to MDEV-28800 SIGABRT due to running out of memory ... Stalled
relates to MDEV-28804 Increased lock objects in 10.6+ leadi... Stalled
relates to MDEV-29565 Inconsistent read and write, which us... Confirmed
relates to MDEV-30368 Locking behaviour is missing in docum... Open

 Description   

The following test case shows that table locks are not being passed to the storage engine in any of the cases of locking read.

--source include/have_innodb.inc
--source include/have_sequence.inc
--source include/not_debug.inc
 
SET @save_freq=@@GLOBAL.innodb_purge_rseg_truncate_frequency;
SET GLOBAL innodb_purge_rseg_truncate_frequency=1;
 
CREATE TABLE t (a INT PRIMARY KEY) ENGINE=InnoDB;
INSERT INTO t(a) SELECT seq FROM seq_0_to_1000000;
 
--source include/wait_all_purged.inc
SET GLOBAL innodb_purge_rseg_truncate_frequency=@save_freq;
 
SET profiling_history_size=100;
SET profiling = 1;
 
START TRANSACTION;
LOCK TABLE t WRITE;
SELECT COUNT(*) FROM t FOR UPDATE;
UNLOCK TABLES;
COMMIT;
SELECT COUNT(*) FROM t FOR UPDATE;
 
START TRANSACTION;
LOCK TABLE t READ;
SELECT COUNT(*) FROM t LOCK IN SHARE MODE;
UNLOCK TABLES;
COMMIT;
SELECT COUNT(*) FROM t LOCK IN SHARE MODE;
 
SHOW PROFILES;
SET profiling = 0;
DROP TABLE t;

I would have expected that an explicit LOCK TABLE would be passed to the storage engine, like it seems to have been the case during the development of MDEV-14479.

I would also expect the optimizer to automatically pass information to the storage engine so that the table can be locked upfront, if we know that the entire table will be scanned in a locking operation. This should include the following:

  • CREATE…SELECT
  • INSERT…SELECT
  • SELECT…LOCK IN SHARE MODE
  • SELECT…FOR UPDATE
  • SELECT executed at TRANSACTION ISOLATION LEVEL SERIALIZABLE
  • UPDATE
  • DELETE

If no WHERE or JOIN condition applies to the table, nor a LIMIT is present, then we should expect the operation to scan the entire table, and the storage engine should be requested to lock the entire table.

MDEV-14479 fixed the InnoDB record locking in such a way that when the table has already been locked in the corresponding mode, no individual record locks will be allocated and created. If we set a breakpoint in lock_create_low() after ha_innobase::create() has finished in the above test case, we will observe the following:

10.5 ffc5d064895cadbc42711efd7dbb6ae1b323f050

Thread 15 "mariadbd" hit Breakpoint 1, lock_rec_create_low (c_lock=0x0, 
    thr=0x0, type_mode=3, page_id={m_id = 21474836484}, 
    page=0x7ffff0b70000 "", heap_no=2, index=0x7fffbc01af40, 
    trx=0x7ffff17260d8, holds_trx_mutex=false)
    at /mariadb/10.5m/storage/innobase/lock/lock0lock.cc:1294
(gdb) p index.table.locks
$7 = {count = 1, start = 0x7ffff1726cb0, end = 0x7ffff1726cb0, 
  node = &lock_table_t::locks}
(gdb) p *index.table.locks.start
$8 = {trx = 0x7ffff17260d8, trx_locks = {prev = 0x0, next = 0x0}, 
  index = 0x0, hash = 0x0, requested_time = 0, wait_time = 0, un_member = {
    tab_lock = {table = 0x7fffbc019e10, locks = {prev = 0x0, next = 0x0}}, 
    rec_lock = {page_id = {m_id = 140736347610640}, n_bits = 0}}, 
  type_mode = 17}

Note: type_mode == 17 == LOCK_TABLE | LOCK_IX. For the LOCK IN SHARE MODE, it would be 16 == LOCK_TABLE | LOCK_IS. We would expect the table lock to exist in LOCK_X or LOCK_S mode, so that no locks will have to be allocated for each individual visited row.

If we change the CREATE TABLE to CREATE TEMPORARY TABLE, we can get an accurate estimate of how fast the execution would be without the row-level locking.

On my system, with locking, each full-table-scan consumes between 0.49 and 0.66 seconds (the first run being the slowest). This can probably be attributed to dynamic memory allocation.

With a temporary table, each locking full-table scan consumes between 0.28 and 0.30 seconds. The memory operations related to row-level locking are roughly doubling the execution time (and seriously increasing the memory footprint)!



 Comments   
Comment by Sergei Petrunia [ 2021-02-10 ]

I think this should be implemented in steps. How about defining the first step as follows:

If
1. this is a DML statement (which does Locking Reads in InnODB)
2. and the query is doing a full table scan without row filtering
then:
3. Use a table-level lock.

InnoDB already knows about whether #1 is true.
The only issue is #2.

#2 can be implemented as follows:
before (or right after) the handler->rnd_init(scan=false) call, the SQL layer will make an extra call to inform the InnoDB storage engine what condition will be attached to the table.

(TODO: check if this can be unified with Table Condition Pushdown).

Comment by Sergei Petrunia [ 2021-02-10 ]

Implementation suggestion is below. Need to think this through.

== Use Table Condition Pushdown for WHERE ==

There is already handler::cond_push() call which informs the storage engine
about the condition. InnoDB could implement it and this way see whether
a condition is being present.

Possible issues:
Issue 1:
setting optimizer_switch='table_condition_pushdown=off' will cause
table-level locks to be acquired! Is this ok?
It is a common practice among support/etc to disable optimizer switch
flags to see what happens...

Issue 2:
InnoDB will take a note about whether there is a pushed condition, but it
will not check the condition.

That is, we'll have:

  COND *ha_innobase::cond_push(COND *arg) {
     ...
     return arg;
  }

what should be in handler::pushed_cond, then? The code checks this member and seeing NULL interprets it as "nothing is pushed". Is this ok? If it is not, should we put "Item(1)" there?

== Introduce push_limit ==

To inform the storage engine about the LIMIT clause, we introduce

void handler::push_limit(ha_rows soft_limit)

Which serves as a hint to the storage engine that the SQL layer is not going to read more than soft_limit rows in a scan.

The idea is that this is generally useful for storage engines that do batching - they should not batch much more than soft_limit rows.

Comment by Roel Van de Paar [ 2022-08-30 ]

Marko informed me that this is the same bug:

CREATE TABLE IF NOT EXISTS v0 ( v1 VARCHAR ( 82 ) ) ;
INSERT INTO v0 ( v1 ) VALUES ( 25 ) , ( ( SELECT -1 WHERE ( v1 NOT IN ( v1 NOT IN ( SELECT v1 ) ) AND v1 NOT IN ( 80 ) ) ) ) , ( 89215105.000000 ) , ( 0 ) , ( 2147483647 ) , ( 127 ) , ( 74 ) , ( -128 ) ;
INSERT INTO v0 ( v1 ) SELECT v1 FROM ( SELECT 'x' FROM v0 AS v10 , v0 AS v9 NATURAL JOIN v0 AS v8 , v0 ) AS v2 , ( SELECT 40561621.000000 FROM v0 AS v7 , v0 AS v6 , v0 AS v5 JOIN v0 ) AS v3 JOIN v0 AS v4 NATURAL JOIN v0 ORDER BY v1 ;

Leads to:

10.11.0-opt>CREATE TABLE IF NOT EXISTS v0 ( v1 VARCHAR ( 82 ) ) ;
Query OK, 0 rows affected (0.015 sec)
 
10.11.0-opt> INSERT INTO v0 ( v1 ) VALUES ( 25 ) , ( ( SELECT -1 WHERE ( v1 NOT IN ( v1 NOT IN ( SELECT v1 ) ) AND v1 NOT IN ( 80 ) ) ) ) , ( 89215105.000000 ) , ( 0 ) , ( 2147483647 ) , ( 127 ) , ( 74 ) , ( -128 ) ;
Query OK, 8 rows affected (0.002 sec)
Records: 8  Duplicates: 0  Warnings: 0
 
10.11.0-opt> INSERT INTO v0 ( v1 ) SELECT v1 FROM ( SELECT 'x' FROM v0 AS v10 , v0 AS v9 NATURAL JOIN v0 AS v8 , v0 ) AS v2 , ( SELECT 40561621.000000 FROM v0 AS v7 , v0 AS v6 , v0 AS v5 JOIN v0 ) AS v3 JOIN v0 AS v4 NATURAL JOIN v0 ORDER BY v1 ;
ERROR 1206 (HY000): The total number of locks exceeds the lock table size

And

10.11.0 bc563f1a4b0b38de3b41fd0f0d3d8b7f1aacbd8b (Optimized)

2022-08-30  8:43:42 0 [Note] /test/MD190822-mariadb-10.11.0-linux-x86_64-opt/bin/mysqld: ready for connections.
Version: '10.11.0-MariaDB'  socket: '/test/MD190822-mariadb-10.11.0-linux-x86_64-opt/socket.sock'  port: 10942  MariaDB Server
2022-08-30  8:44:35 4 [Warning] InnoDB: Over 67 percent of the buffer pool is occupied by lock heaps or the adaptive hash index! Check that your transactions do not set too many row locks. innodb_buffer_pool_size=126M. Starting the InnoDB Monitor to print diagnostics.
 
=====================================
2022-08-30 08:44:35 0x1466fd360700 INNODB MONITOR OUTPUT
...

If it would help, I can reduce this testcase further.

Comment by Sergei Golubchik [ 2023-08-31 ]

May be InnoDB should automatically upgrade row locks to a table lock, when there're too many of them?
This is a fairly standard behavior of many database engines.

Generated at Thu Feb 08 09:32:52 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.