ERROR HY000: Lost connection to MySQL server during query
disconnect con1;
connection default;
-FOUND 1 /Wrote log record for ibuf update in place operation/ in my_restart.err
+FOUND 2 /Wrote log record for ibuf update in place operation/ in my_restart.err
CHECK TABLE t1;
Table Op Msg_type Msg_text
test.t1 check status OK
mysqltest: Result content mismatch
- saving '/home/kevg/work/debug/mysql-test/var/log/innodb.innodb-change-buffer-recovery-innodb/' to '/home/kevg/work/debug/mysql-test/var/log/innodb.innodb-change-buffer-recovery-innodb/'
In debug mode with -O2 it sometime crashes like this:
#9 page_cur_delete_rec (cursor=0x7f40106e8d01, index=<optimized out>, offsets=0x7ffe97abb990, mtr=0x7ffe97abc230) at page0cur.cc:2417
#10 0x0000000000b79fab in btr_cur_optimistic_delete_func (cursor=0x7ffe97abc760, flags=<optimized out>, mtr=0x7ffe97abc230) at btr0cur.cc:5163
#11 0x00000000009c4387 in ibuf_delete_rec (space=4, page_no=27, pcur=0x7ffe97abc760, search_tuple=0x26faf98, mtr=0x7ffe97abc230) at ibuf0ibuf.cc:4261
#12 0x00000000009c11de in ibuf_merge_or_delete_for_page (block=0x7f40102526a8, page_id=..., page_size=0x7f40102526b8, update_ibuf_bitmap=1) at ibuf0ibuf.cc:4645
#13 0x00000000009fcd6f in mlog_init_t::ibuf_merge (this=<optimized out>, mtr=...) at log0recv.cc:330
#14 0x00000000009f44c5 in recv_apply_hashed_log_recs (last_batch=<optimized out>) at log0recv.cc:2516
#15 0x0000000000ae989b in innobase_start_or_create_for_mysql () at srv0start.cc:2224
#16 0x000000000097d77f in innobase_init (p=<optimized out>) at ha_innodb.cc:4377
#17 0x00000000007af955 in ha_initialize_handlerton (plugin=0x2367ca8) at handler.cc:521
#18 0x000000000060cc59 in plugin_initialize (tmp_root=0x7ffe97ac26b8, plugin=0x2367ca8, argc=0x15d1f68 <remaining_argc>, argv=0x2225428, options_only=<optimized out>) at sql_plugin.cc:1416
#19 0x000000000060c77e in plugin_init (argc=0x15d1f68 <remaining_argc>, argv=0x2225428, flags=2) at sql_plugin.cc:1697
#20 0x00000000005525b5 in init_server_components () at mysqld.cc:5311
#21 mysqld_main (argc=<optimized out>, argv=0x2225428) at mysqld.cc:5907
Adding -mysqld=-debug=d,ib_log_checkpoint_avoid make debug version fail with the same backtrace.
kevg, in MDEV-19346 I had to disable the assertion, because it would now fails by default, due to fewer redo log checkpoints being executed. I hope that you can now more easily reduce the test case and narrow down the cause. Please remember --mysqld=--debug=d,ib_log_checkpoint_avoid just in case.
Marko Mäkelä
added a comment - kevg , in MDEV-19346 I had to disable the assertion, because it would now fails by default, due to fewer redo log checkpoints being executed. I hope that you can now more easily reduce the test case and narrow down the cause. Please remember --mysqld=--debug=d,ib_log_checkpoint_avoid just in case.
Marko Mäkelä
added a comment - We could trivially fix this bug by removing the ability of the purge of history to submit work to the change buffer , but first I would like to see some results of performance tests.
I believe that the test failure shares the root cause with MySQL Bug #61104, which was never fixed (the assertion was merely changed to a warning).
Marko Mäkelä
added a comment - I believe that the test failure shares the root cause with MySQL Bug #61104 , which was never fixed (the assertion was merely changed to a warning).
Based on the benchmarks that were run by axel, the change buffering of purge operations is significantly improving performance when the buffer pool size is slightly smaller than the working set.
Hence, we should try to spend some more effort to correct the error rather than to disable the feature altogether.
Marko Mäkelä
added a comment - Based on the benchmarks that were run by axel , the change buffering of purge operations is significantly improving performance when the buffer pool size is slightly smaller than the working set.
Hence, we should try to spend some more effort to correct the error rather than to disable the feature altogether.
I attach the spread sheet MDEV-19344.ods with the results of my benchmark. The workload is OLTP-ish with SELECT using the secondary index (and thus realizing buffered changes). Three different mixes of writes are used additionally:
DELETE + INSERT
UPDATE on the indexed column
a mix of the former two
I did 3 complete runs with different buffer pool size. The working set is ~40GB and the buffer pool was set to 32G, 20G and 40G respectively. Relevant for this task are the runs with 32 and 20G.
It turns out that workloads with UPDATE suffer quite a lot from removing the change buffering. Performance drops by ~5% on average, more than 10% for certain thread counts.
Axel Schwenke
added a comment - - edited I attach the spread sheet MDEV-19344.ods with the results of my benchmark. The workload is OLTP-ish with SELECT using the secondary index (and thus realizing buffered changes). Three different mixes of writes are used additionally:
DELETE + INSERT
UPDATE on the indexed column
a mix of the former two
I did 3 complete runs with different buffer pool size. The working set is ~40GB and the buffer pool was set to 32G, 20G and 40G respectively. Relevant for this task are the runs with 32 and 20G.
It turns out that workloads with UPDATE suffer quite a lot from removing the change buffering. Performance drops by ~5% on average, more than 10% for certain thread counts.
mleich provided a rr record trace of a related failure for a 10.5-based tree that I spent all day analyzing today. I think that I may have found the cause of this corruption. I do not know this for sure yet, and I would have to go back to an older version to reproduce the failure stated in the Description, to check it with the fix, once I have a fix.
In short, the function ibuf_get_volume_buffered_hash() is calculating the hash value for garbage, and thus it will consider two equal keys to be distinct. The key is (NULL,520) in our case.
len = ibuf_rec_get_size(
rec, types,
rec_get_n_fields_old(rec) - IBUF_REC_FIELD_USER,
FALSE, comp);
fold = ut_fold_binary(data, len);
For the two occurrences of (NULL,520), we got len=510 even though the payload (0x80, 0x0, 0x2, 0x8) seems to be in the first 4 bytes. The more complete scenario was like this:
SET @r=REPEAT(_ucs2 0x0008, 152);
Secondary index page contents at eviction:
(@r,34),(@r,35),(@r,36)
Buffered changes:
866171 INSERT (NULL, 518)
page contents after applying: (NULL,518),(@r,34),(@r,35),(@r,36)
885859 DELETE_MARK (@r, 34)
page contents after applying: (NULL,518),d(@r,34),(@r,35),(@r,36)
886807 DELETE (@r, 34)
page contents after applying: (NULL,518),(@r,35),(@r,36)
890308 DELETE_MARK (NULL, 518)
page contents after applying: d(NULL,518),(@r,35),(@r,36)
890961 DELETE (NULL, 518)
page contents after applying: (@r,35),(@r,36)
903869 DELETE_MARK (@r, 35)
page contents after applying: d(@r,35),(@r,36)
904861 DELETE (@r, 35)
page contents after applying: (@r,36)
912808 DELETE_MARK (@r, 36)
page contents after applying: d(@r,36)
922143 INSERT (NULL, 519)
page contents after applying: (NULL, 519), d(@r,36)
923229 DELETE (@r, 36)
page contents after applying: (NULL, 519)
923233 DELETE (@r, 36) # duplicating the previous op
page contents after applying: (NULL, 519)
929226 DELETE_MARK (NULL, 519)
page contents after applying: d(NULL, 519)
931121 INSERT (NULL, 520)
page contents after applying: d(NULL, 519), (NULL, 520)
931299 DELETE (NULL, 519)
page contents after applying: (NULL, 520)
934641 DELETE_MARK (NULL, 520)
page contents after applying: (NULL, 520)
935534 DELETE (NULL, 520) # ERROR! page would become empty!
page contents after applying: empty (not allowed!)
945413 INSERT (NULL, 521)
946546 DELETE_MARK (NULL, 521)
946550 DELETE_MARK (NULL, 521) # duplicating the previous operation
950322 INSERT (REPEAT (_ucs2 0x0003, 4), 521) # (NULL, 521) was never deleted
For the wrongly buffered DELETE operation, we miscalculated that there would be at least 4 records in the page based on the change buffer contents, in addition to the 3 records that we cannot see because the page is not in the buffer pool. Finally, after all these records were buffered, we would merge the changes to the page and issue a message like this:
2020-05-05 16:29:00 15 [ERROR] InnoDB: Unable to purge a record
InnoDB: tuple DATA TUPLE: 2 fields;
0: SQL NULL;
1: len 4; hex 80000208; asc ;;InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 32
0: SQL NULL;
1: len 4; hex 80000208; asc ;;space 5 offset 1109 (1 records, index id 31)
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
More anomalies that I did not check yet:
The above log is from ibuf_insert_low() calls, the number being the when counter of rr replay. There are 15 calls before the wrongly buffered DELETE operation. But, there only exist 14 buffered operations in the change buffer at that time.
The function ibuf_get_volume_buffered() processes the records backwards. Could this be a problem if some first buffered operations are not considered?
There is no underflow protection when min_n_recs is decremented. If we only saw previously buffered DELETE operations, we might wrongly consider the page to be nonempty.
The keys for DELETE operations will not be added to ibuf_get_volume_buffered_hash(). This might actually be OK.
The primary fix will be to compute the hash only on the index record that would be applied to the secondary index leaf page in the table. While doing that, we could replace the slow ut_fold_binary() with ut_crc32().
Marko Mäkelä
added a comment - mleich provided a rr record trace of a related failure for a 10.5-based tree that I spent all day analyzing today. I think that I may have found the cause of this corruption. I do not know this for sure yet, and I would have to go back to an older version to reproduce the failure stated in the Description, to check it with the fix, once I have a fix.
In short, the function ibuf_get_volume_buffered_hash() is calculating the hash value for garbage, and thus it will consider two equal keys to be distinct. The key is (NULL,520) in our case.
len = ibuf_rec_get_size(
rec, types,
rec_get_n_fields_old(rec) - IBUF_REC_FIELD_USER,
FALSE, comp);
fold = ut_fold_binary(data, len);
For the two occurrences of (NULL,520), we got len=510 even though the payload ( 0x80, 0x0, 0x2, 0x8 ) seems to be in the first 4 bytes. The more complete scenario was like this:
SET @r=REPEAT(_ucs2 0x0008, 152);
Secondary index page contents at eviction:
(@r,34),(@r,35),(@r,36)
Buffered changes:
866171 INSERT (NULL, 518)
page contents after applying: (NULL,518),(@r,34),(@r,35),(@r,36)
885859 DELETE_MARK (@r, 34)
page contents after applying: (NULL,518),d(@r,34),(@r,35),(@r,36)
886807 DELETE (@r, 34)
page contents after applying: (NULL,518),(@r,35),(@r,36)
890308 DELETE_MARK (NULL, 518)
page contents after applying: d(NULL,518),(@r,35),(@r,36)
890961 DELETE (NULL, 518)
page contents after applying: (@r,35),(@r,36)
903869 DELETE_MARK (@r, 35)
page contents after applying: d(@r,35),(@r,36)
904861 DELETE (@r, 35)
page contents after applying: (@r,36)
912808 DELETE_MARK (@r, 36)
page contents after applying: d(@r,36)
922143 INSERT (NULL, 519)
page contents after applying: (NULL, 519), d(@r,36)
923229 DELETE (@r, 36)
page contents after applying: (NULL, 519)
923233 DELETE (@r, 36) # duplicating the previous op
page contents after applying: (NULL, 519)
929226 DELETE_MARK (NULL, 519)
page contents after applying: d(NULL, 519)
931121 INSERT (NULL, 520)
page contents after applying: d(NULL, 519), (NULL, 520)
931299 DELETE (NULL, 519)
page contents after applying: (NULL, 520)
934641 DELETE_MARK (NULL, 520)
page contents after applying: (NULL, 520)
935534 DELETE (NULL, 520) # ERROR! page would become empty!
page contents after applying: empty (not allowed!)
945413 INSERT (NULL, 521)
946546 DELETE_MARK (NULL, 521)
946550 DELETE_MARK (NULL, 521) # duplicating the previous operation
950322 INSERT (REPEAT (_ucs2 0x0003, 4), 521) # (NULL, 521) was never deleted
For the wrongly buffered DELETE operation, we miscalculated that there would be at least 4 records in the page based on the change buffer contents, in addition to the 3 records that we cannot see because the page is not in the buffer pool. Finally, after all these records were buffered, we would merge the changes to the page and issue a message like this:
2020-05-05 16:29:00 15 [ERROR] InnoDB: Unable to purge a record
InnoDB: tuple DATA TUPLE: 2 fields;
0: SQL NULL;
1: len 4; hex 80000208; asc ;;InnoDB: record PHYSICAL RECORD: n_fields 2; compact format; info bits 32
0: SQL NULL;
1: len 4; hex 80000208; asc ;;space 5 offset 1109 (1 records, index id 31)
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
More anomalies that I did not check yet:
The above log is from ibuf_insert_low() calls, the number being the when counter of rr replay . There are 15 calls before the wrongly buffered DELETE operation. But, there only exist 14 buffered operations in the change buffer at that time.
The function ibuf_get_volume_buffered() processes the records backwards. Could this be a problem if some first buffered operations are not considered?
There is no underflow protection when min_n_recs is decremented. If we only saw previously buffered DELETE operations, we might wrongly consider the page to be nonempty.
The keys for DELETE operations will not be added to ibuf_get_volume_buffered_hash() . This might actually be OK.
The primary fix will be to compute the hash only on the index record that would be applied to the secondary index leaf page in the table. While doing that, we could replace the slow ut_fold_binary() with ut_crc32() .
That is, no duplicate record was inserted into the change buffer for either the DELETE (@r, 36) or the DELETE_MARK (NULL, 521). Only 14 records were present when the DELETE (NULL, 520) was wrongly buffered.
In ibuf_get_volume_buffered() we check also the preceding change buffer page. This worries me a little, because normally, a backward B-tree scan in InnoDB could lead to a deadlock of threads, and we normally save the cursor position and search for the preceding record to move to the preceding page. Only if the caller ibuf_insert_low() was invoked with mode = BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT would there likely be no concern, because we would be holding ibuf_mutex and ibuf_pessimistic_insert_mutex. However, that call has been there since the inclusion of InnoDB in MySQL. If I remember correctly the discussions between 2005 and 2007, going back one more page could lead to deadlocks. So, I guess that we are fine with regard to deadlocks here. (Also, I am not aware of any actual deadlocks in this code.)
The following hash function ought to work. False positives are not a correctness problem. For example, ('fu','bar'), (NULL,'fubar'),('fub','ar') would be treated as equal, and we would increment min_n_rec by at most 1 for those 3 distinct keys.
/** Determine if a change buffer record has been encountered already.
@param rec change buffer record in the MySQL 5.5 format
@param hash hash table of encountered records
@param size number of elements in hash
@retval true if a distinct record
@retval false if this may be duplicating an earlier record */
/* We have not seen this record yet. Remember it. */
*hash|= bitmask;
returntrue;
}
The comparison min_n_recs < 2 in ibuf_insert_low() actually does check for underflow, because the variable is a signed type (lint, not ulint).
With the corrected hash function, the estimate of min_n_recs looks conservative to me, even if there were multiple change buffer pages of buffered changes and the estimate did not consider them all. In the worst case, we missed many preceding pages filled with IBUF_DELETE_OP records that had been buffered for the user page earlier. But, it would have been an error to buffer those operations if the records did not exist in the first place. So, it is OK to ignore any preceding pages of buffered operations and start with a conservative estimate that there exist at least min_n_recs=0 records in the page.
but the test innodb.innodb-change-buffer-recovery did not fail in 70 runs even without my fix. I will try harder, possibly with an older version, so that I can test my fix.
Marko Mäkelä
added a comment - - edited The ‘extra’ call to ibuf_insert_low() is explained by the following:
err = ibuf_insert_low(BTR_MODIFY_PREV, op, no_counter,
entry, entry_size,
index, page_id, zip_size, thr);
if (err == DB_FAIL) {
err = ibuf_insert_low(BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT,
op, no_counter, entry, entry_size,
index, page_id, zip_size, thr);
}
That is, no duplicate record was inserted into the change buffer for either the DELETE (@r, 36) or the DELETE_MARK (NULL, 521) . Only 14 records were present when the DELETE (NULL, 520) was wrongly buffered.
In ibuf_get_volume_buffered() we check also the preceding change buffer page. This worries me a little, because normally, a backward B-tree scan in InnoDB could lead to a deadlock of threads, and we normally save the cursor position and search for the preceding record to move to the preceding page. Only if the caller ibuf_insert_low() was invoked with mode = BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT would there likely be no concern, because we would be holding ibuf_mutex and ibuf_pessimistic_insert_mutex . However, that call has been there since the inclusion of InnoDB in MySQL . If I remember correctly the discussions between 2005 and 2007, going back one more page could lead to deadlocks. So, I guess that we are fine with regard to deadlocks here. (Also, I am not aware of any actual deadlocks in this code.)
The following hash function ought to work. False positives are not a correctness problem. For example, ('fu','bar'), (NULL,'fubar'),('fub','ar') would be treated as equal, and we would increment min_n_rec by at most 1 for those 3 distinct keys.
/** Determine if a change buffer record has been encountered already.
@param rec change buffer record in the MySQL 5.5 format
@param hash hash table of encountered records
@param size number of elements in hash
@retval true if a distinct record
@retval false if this may be duplicating an earlier record */
static bool ibuf_get_volume_buffered_hash( const rec_t *rec, ulint *hash,
ulint size)
{
ut_ad(rec_get_n_fields_old(rec) > IBUF_REC_FIELD_USER);
const ulint start= rec_get_field_start_offs(rec, IBUF_REC_FIELD_USER);
const ulint len= rec_get_data_size_old(rec) - start;
const uint32_t fold= ut_crc32(rec + start, len);
hash+= (fold / (CHAR_BIT * sizeof *hash)) % size;
ulint bitmask= static_cast <ulint>(1) << (fold % (CHAR_BIT * sizeof (*hash)));
if (*hash & bitmask)
return false ;
/* We have not seen this record yet. Remember it. */
*hash|= bitmask;
return true ;
}
The comparison min_n_recs < 2 in ibuf_insert_low() actually does check for underflow, because the variable is a signed type ( lint , not ulint ).
With the corrected hash function, the estimate of min_n_recs looks conservative to me, even if there were multiple change buffer pages of buffered changes and the estimate did not consider them all. In the worst case, we missed many preceding pages filled with IBUF_DELETE_OP records that had been buffered for the user page earlier. But, it would have been an error to buffer those operations if the records did not exist in the first place. So, it is OK to ignore any preceding pages of buffered operations and start with a conservative estimate that there exist at least min_n_recs=0 records in the page.
Update: I tried 10.2 with the following patch:
diff --git a/storage/innobase/include/page0page.ic b/storage/innobase/include/page0page.ic
index 4efb92f7182..276f85df69a 100644
--- a/storage/innobase/include/page0page.ic
+++ b/storage/innobase/include/page0page.ic
@@ -173,9 +173,7 @@ page_header_set_field(
{
ut_ad(page);
ut_ad(field <= PAGE_N_RECS);
-#if 0 /* FIXME: MDEV-19344 hits this */
ut_ad(field != PAGE_N_RECS || val);
-#endif
ut_ad(field == PAGE_N_HEAP || val < srv_page_size);
ut_ad(field != PAGE_N_HEAP || (val & 0x7fff) < srv_page_size);
but the test innodb.innodb-change-buffer-recovery did not fail in 70 runs even without my fix. I will try harder, possibly with an older version, so that I can test my fix.
So far, I have been unable to repeat the assertion failure on 10.2. It looks like the result difference was a caused by the test not removing the file my_restart.err between runs, causing a failure on ./mtr --repeat. That test problem was fixed in 10.2.28, 10.3.19, 10.4.10, 10.5.0.
Marko Mäkelä
added a comment - So far, I have been unable to repeat the assertion failure on 10.2. It looks like the result difference was a caused by the test not removing the file my_restart.err between runs, causing a failure on ./mtr --repeat . That test problem was fixed in 10.2.28, 10.3.19, 10.4.10, 10.5.0 .
The rr trace mentioned above belongs to some test run on a 10.5 develoment tree where the server aborts with (snip from server error log slightly edited)
Matthias Leich
added a comment -
The rr trace mentioned above belongs to some test run on a 10.5 develoment tree where the server aborts with (snip from server error log slightly edited)
[ERROR] InnoDB: Unable to purge a record
mysqld: /home/mleich/bb-10.5-marko/storage/innobase/ibuf/ibuf0ibuf.cc:3994: void ibuf_delete(const dtuple_t*, buf_block_t*, dict_index_t*, mtr_t*): Assertion `0' failed.
[ERROR] mysqld got signal 6 ;
mleich reported the assertion on crash recovery. It turns out that the assertion was missing || recv_recovery_is_on() at the end. The reason for this is the following condition, which has been removed in 10.5 (MDEV-12353):
if (page_get_n_recs(page) == 1 && !recv_recovery_is_on()) {
/* Empty the page, unless we are applying the redo log
during crash recovery. During normal operation, the
page_create_empty() gets logged as one of MLOG_PAGE_CREATE,
MLOG_COMP_PAGE_CREATE, MLOG_ZIP_PAGE_COMPRESS. */
ut_ad(page_is_leaf(page));
/* Usually, this should be the root page,
and the whole index tree should become empty.
However, this could also be a call in
btr_cur_pessimistic_update() to delete the only
record in the page and to insert another one. */
page_cur_move_to_next(cursor);
ut_ad(page_cur_is_after_last(cursor));
page_create_empty(page_cur_get_block(cursor),
const_cast<dict_index_t*>(index), mtr);
return;
}
Only during crash recovery, we can decrement PAGE_N_RECS from 1 to 0. In all other cases, we would invoke page_create_empty(). The assertion failed on the change buffer tree, because the change buffer was emptied during crash recovery.
Marko Mäkelä
added a comment - mleich reported the assertion on crash recovery. It turns out that the assertion was missing || recv_recovery_is_on() at the end. The reason for this is the following condition, which has been removed in 10.5 ( MDEV-12353 ):
if (page_get_n_recs(page) == 1 && !recv_recovery_is_on()) {
/* Empty the page, unless we are applying the redo log
during crash recovery. During normal operation, the
page_create_empty() gets logged as one of MLOG_PAGE_CREATE,
MLOG_COMP_PAGE_CREATE, MLOG_ZIP_PAGE_COMPRESS. */
ut_ad(page_is_leaf(page));
/* Usually, this should be the root page,
and the whole index tree should become empty.
However, this could also be a call in
btr_cur_pessimistic_update() to delete the only
record in the page and to insert another one. */
page_cur_move_to_next(cursor);
ut_ad(page_cur_is_after_last(cursor));
page_create_empty(page_cur_get_block(cursor),
const_cast <dict_index_t*>(index), mtr);
return ;
}
Only during crash recovery, we can decrement PAGE_N_RECS from 1 to 0. In all other cases, we would invoke page_create_empty() . The assertion failed on the change buffer tree, because the change buffer was emptied during crash recovery.
Summary: The test was not tolerant of ./mtr --repeat=2 and the assertion that was added in a hope to catch the bug that I now filed and fixed as MDEV-22497 was bogus. I removed the assertion that was already bracketed inside #if 0.
Marko Mäkelä
added a comment - Summary: The test was not tolerant of ./mtr --repeat=2 and the assertion that was added in a hope to catch the bug that I now filed and fixed as MDEV-22497 was bogus. I removed the assertion that was already bracketed inside #if 0 .
kevg, in
MDEV-19346I had to disable the assertion, because it would now fails by default, due to fewer redo log checkpoints being executed. I hope that you can now more easily reduce the test case and narrow down the cause. Please remember --mysqld=--debug=d,ib_log_checkpoint_avoid just in case.