Analysis:
=========
During shutdown, InnoDB calls ibuf_merge_or_delete_for_page() for the problematic page (0, 739). But the desired bit for a given
page in the bitmap page is already set to IBUF_BITMAP_FREE. So we fail to remove the entry (0, 739) from change buffer index. So
jumped to understand ibuf_delete_recs() where the problematic page (0, 739) was deleted. During that, InnoDB change
buffer index has only root, leaf pages and there are no internal nodes.
mtr_t mtr;
loop:
btr_pcur_t pcur;
pcur.btr_cur.page_cur.index= ibuf.index;
ibuf_mtr_start(&mtr);
if (btr_pcur_open(&tuple, PAGE_CUR_GE, BTR_MODIFY_LEAF, &pcur, &mtr))
goto func_exit;
if (!btr_pcur_is_on_user_rec(&pcur))
{
ut_ad(btr_pcur_is_after_last_on_page(&pcur));
goto func_exit;
}
for (;;)
{
ut_ad(btr_pcur_is_on_user_rec(&pcur));
const rec_t* ibuf_rec = btr_pcur_get_rec(&pcur);
if (ibuf_rec_get_space(&mtr, ibuf_rec) != page_id.space()
if (ibuf_delete_rec(page_id, &pcur, &tuple, &mtr))
{
/* Deletion was pessimistic and mtr was committed:
we start from the beginning again */
ut_ad(mtr.has_committed());
goto loop;
}
if (btr_pcur_is_after_last_on_page(&pcur))
{
ibuf_mtr_commit(&mtr);
btr_pcur_close(&pcur);
goto loop;
}
btr_cur_open() searches with tuple only page_id (0, 739). Root page has this following (..(0, 563)(child 60), (0, 739)(child 63)..) since the mode is
PAGE_CUR_L for non-leaf node. It leads to child page 60 and deletes the record of (0, 739). once we reached the end of the page, again we do
open the change buffer index and we end up in page 60. Fail to find the record on page 60. ibuf_delete_recs() fails to delete the entries completely.
Even page 62 which is next to child page 60 has the record (0, 739)
Since change buffer index is in 5.5+ format, primary key for the index is
{space, 0, page_no, counter}
. But we fail to use the counter field for searching the tuple.
Thanks to vlad.lesin for helping me in analysing this issue.
Attachments
Issue Links
relates to
MDEV-33699write statements gets stuck during IO-bound insert benchmark with InnoDB
> Since change buffer index is in 5.5+ format, primary key for the index is
{space, 0, page_no, counter}
I am not sure about this, because when we get offsets of the record on root page, there are 7 fields. The first 4 fields is {space, 0, page_no, counter}
records on page N, and then, when we open cursor again, there will be the same problem as we have now, i.e. the cursor will point to the supremum of page N, and we will miss records on page N+1.
Vladislav Lesin
added a comment - > Since change buffer index is in 5.5+ format, primary key for the index is
{space, 0, page_no, counter}
I am not sure about this, because when we get offsets of the record on root page, there are 7 fields. The first 4 fields is {space, 0, page_no, counter}
, the last field is a child page id. There are 2 more fields in the key. And, if so, https://github.com/MariaDB/server/commit/015ab499696382b0e3b8d70118beefafd328a779 commit just shifts the problem from "counter" field to the next to the "counter" one.
I.e. what if we have
{a, 0, b, count, c, ...}
record on leaf page N and
{a, 0, b, count, c+1, ...}
record on page N+1. And we use
{a, 0, b, count}
search tuple. We will delete all
{a, 0, b, count, ...}
records on page N, and then, when we open cursor again, there will be the same problem as we have now, i.e. the cursor will point to the supremum of page N, and we will miss records on page N+1.
Change buffer records may be in the pre-MySQL 5.5 format in case innodb_change_buffering=inserts is being used. We must not assume anything about the format.
I think that we should perform a forward scan and first delete-mark all records in the same mini-transaction that modifies the change buffer bitmap. Then, start another mini-transaction that will lock the entire change buffer while doing a forward scan and "pessimistic" delete of all matching change buffer records.
Marko Mäkelä
added a comment - Change buffer records may be in the pre-MySQL 5.5 format in case innodb_change_buffering=inserts is being used. We must not assume anything about the format.
I think that we should perform a forward scan and first delete-mark all records in the same mini-transaction that modifies the change buffer bitmap. Then, start another mini-transaction that will lock the entire change buffer while doing a forward scan and "pessimistic" delete of all matching change buffer records.
I could analyze the issue from the RR trace. The internal cursor routines and modes (PAGE_CUR_GE) are behaving well and as per design. The IBUF B-tree state is also correct. Also, verified that there is no issue with the format or key comparison.
The reason why we are not able to delete all the records is in the implementation of ibuf_delete_recs function. When the cursor is positioned in supremum record in previous page, which is required for correct operation of GE and LT mode, we are exiting from the function. Landing in previous non-user record is not any general issue and the correct way is to proceed further to the next leaf page to get back all the records. The easiest way is to use the btr_pcur_open_on_user_rec() interface which takes care of positioning the cursor in first available user record.
It is somewhat hard to repeat the issue as not all paths use ibuf_delete_recs and we need to simulate the case where IBUF records for that page spans across multiple leaf pages of the IBUF tree. I could repeat the issue with 4k page size and some code tweak. I am attaching the code modifications and the test here. This test can be used to understand the issue better and I don't intend to add it to our mtr test suite as it could be hard to maintain it over time.
1. Code tweak to force ibuf_delete_recs while allocating page after deleting and re-creating index.
# MDEV 32489: Change buffer index fails to delete the records
#
-- source include/have_innodb.inc
SET GLOBAL innodb_change_buffering = none;
SET GLOBAL innodb_file_per_table = OFF;
let $loop=10;
while ($loop)
{
eval CREATE TABLE t$loop(c1 INT AUTO_INCREMENT PRIMARY KEY,c2 CHAR(100))ENGINE=InnoDB;
eval CREATE INDEX i1 ON t$loop (c2);
eval INSERT INTO t$loop(c2) VALUES('mariadb');
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop LIMIT 4;
dec $loop;
}
# -- vertical_results
# select BLOCK_ID, SPACE, PAGE_NUMBER, PAGE_TYPE , TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
SET GLOBAL innodb_change_buffering = all;
--echo # Force change buffering when possible in debug build.
--disable_query_log
--error 0,ER_UNKNOWN_SYSTEM_VARIABLE
SET GLOBAL innodb_change_buffering_debug = 1;
--enable_query_log
let $loop=10;
eval INSERT INTO t10(c2) SELECT c2 FROM t8 IGNORE INDEX (i1) LIMIT 1;
dec $loop;
while ($loop)
{
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop IGNORE INDEX (i1) LIMIT 4;
dec $loop;
}
# select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
# select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;
--disable_query_log
--error 0,ER_UNKNOWN_SYSTEM_VARIABLE
SET GLOBAL innodb_change_buffering_debug = default;
--enable_query_log
SET GLOBAL innodb_change_buffering = default;
DROP INDEX i1 ON t7;
CREATE INDEX i1 ON t7 (c2);
# select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
# select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;
Debarun Banerjee
added a comment - I could analyze the issue from the RR trace. The internal cursor routines and modes (PAGE_CUR_GE) are behaving well and as per design. The IBUF B-tree state is also correct. Also, verified that there is no issue with the format or key comparison.
The reason why we are not able to delete all the records is in the implementation of ibuf_delete_recs function. When the cursor is positioned in supremum record in previous page, which is required for correct operation of GE and LT mode, we are exiting from the function. Landing in previous non-user record is not any general issue and the correct way is to proceed further to the next leaf page to get back all the records. The easiest way is to use the btr_pcur_open_on_user_rec() interface which takes care of positioning the cursor in first available user record.
It is somewhat hard to repeat the issue as not all paths use ibuf_delete_recs and we need to simulate the case where IBUF records for that page spans across multiple leaf pages of the IBUF tree. I could repeat the issue with 4k page size and some code tweak. I am attaching the code modifications and the test here. This test can be used to understand the issue better and I don't intend to add it to our mtr test suite as it could be hard to maintain it over time.
1. Code tweak to force ibuf_delete_recs while allocating page after deleting and re-creating index.
diff --git a/storage/innobase/fsp/fsp0fsp.cc b/storage/innobase/fsp/fsp0fsp.cc
index c5d04a19662..2d18edd2211 100644
--- a/storage/innobase/fsp/fsp0fsp.cc
+++ b/storage/innobase/fsp/fsp0fsp.cc
@@ -1215,8 +1215,9 @@ fsp_alloc_free_page(
}
}
+ auto ret_block = fsp_page_create(space, page_no, init_mtr);
fsp_alloc_from_free_frag(block, xdes, descr, free, mtr);
- return fsp_page_create(space, page_no, init_mtr);
+ return ret_block;
}
2. mtr Test: mysql-test/suite/innodb/t/change_buffer_delete.test
#
# MDEV 32489: Change buffer index fails to delete the records
#
-- source include/have_innodb.inc
SET GLOBAL innodb_change_buffering = none;
SET GLOBAL innodb_file_per_table = OFF;
let $loop=10;
while ($loop)
{
eval CREATE TABLE t$loop(c1 INT AUTO_INCREMENT PRIMARY KEY,c2 CHAR(100))ENGINE=InnoDB;
eval CREATE INDEX i1 ON t$loop (c2);
eval INSERT INTO t$loop(c2) VALUES('mariadb');
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop;
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop LIMIT 4;
dec $loop;
}
# -- vertical_results
# select BLOCK_ID, SPACE, PAGE_NUMBER, PAGE_TYPE , TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
SET GLOBAL innodb_change_buffering = all;
--echo # Force change buffering when possible in debug build.
--disable_query_log
--error 0,ER_UNKNOWN_SYSTEM_VARIABLE
SET GLOBAL innodb_change_buffering_debug = 1;
--enable_query_log
let $loop=10;
eval INSERT INTO t10(c2) SELECT c2 FROM t8 IGNORE INDEX (i1) LIMIT 1;
dec $loop;
while ($loop)
{
eval INSERT INTO t$loop(c2) SELECT c2 FROM t$loop IGNORE INDEX (i1) LIMIT 4;
dec $loop;
}
# select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
# select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;
--disable_query_log
--error 0,ER_UNKNOWN_SYSTEM_VARIABLE
SET GLOBAL innodb_change_buffering_debug = default;
--enable_query_log
SET GLOBAL innodb_change_buffering = default;
DROP INDEX i1 ON t7;
CREATE INDEX i1 ON t7 (c2);
# select PAGE_NUMBER, TABLE_NAME, INDEX_NAME, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where TABLE_NAME like '%t1%' AND INDEX_NAME = "i1";
# select PAGE_NUMBER, NUMBER_RECORDS, DATA_SIZE from INFORMATION_SCHEMA.INNODB_BUFFER_PAGE where PAGE_TYPE LIKE 'IBUF_INDEX%' AND SPACE=0;
SET GLOBAL innodb_fast_shutdown=0;
--source include/restart_mysqld.inc
let $loop=10;
while ($loop)
{
eval DROP TABLE t$loop;
dec $loop;
}
SET GLOBAL innodb_file_per_table = default;
3. Test run with 4k page size
./mtr --mem --mysqld=--innodb_page_size=4k innodb.change_buffer_delete
> Since change buffer index is in 5.5+ format, primary key for the index is
{space, 0, page_no, counter}I am not sure about this, because when we get offsets of the record on root page, there are 7 fields. The first 4 fields is {space, 0, page_no, counter}
, the last field is a child page id. There are 2 more fields in the key. And, if so, https://github.com/MariaDB/server/commit/015ab499696382b0e3b8d70118beefafd328a779 commit just shifts the problem from "counter" field to the next to the "counter" one.
I.e. what if we have
{a, 0, b, count, c, ...}record on leaf page N and
{a, 0, b, count, c+1, ...}record on page N+1. And we use
{a, 0, b, count}search tuple. We will delete all
{a, 0, b, count, ...}records on page N, and then, when we open cursor again, there will be the same problem as we have now, i.e. the cursor will point to the supremum of page N, and we will miss records on page N+1.