[MDEV-19344] innodb.innodb-change-buffer-recovery-innodb fails with incorrect results or crash Created: 2019-04-26 Updated: 2020-08-11 Resolved: 2020-05-07 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.2, 10.3, 10.4 |
| Fix Version/s: | 10.2.33, 10.3.24, 10.4.14 |
| Type: | Bug | Priority: | Major |
| Reporter: | Eugene Kosov (Inactive) | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | corruption, not-10.5, upstream-5.5 | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
In debug mode running a test with --repeat results in an every second run failed like this:
In debug mode with -O2 it sometime crashes like this:
Adding - |
| Comments |
| Comment by Marko Mäkelä [ 2019-05-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
kevg, in | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We could trivially fix this bug by removing the ability of the purge of history to submit work to the change buffer, but first I would like to see some results of performance tests. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I ported to 10.2 my patch to cease the purge buffering. It also re-enables the assertion. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I believe that the test failure shares the root cause with MySQL Bug #61104, which was never fixed (the assertion was merely changed to a warning). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-10-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Based on the benchmarks that were run by axel, the change buffering of purge operations is significantly improving performance when the buffer pool size is slightly smaller than the working set. Hence, we should try to spend some more effort to correct the error rather than to disable the feature altogether. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2019-10-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I attach the spread sheet MDEV-19344.ods
I did 3 complete runs with different buffer pool size. The working set is ~40GB and the buffer pool was set to 32G, 20G and 40G respectively. Relevant for this task are the runs with 32 and 20G. It turns out that workloads with UPDATE suffer quite a lot from removing the change buffering. Performance drops by ~5% on average, more than 10% for certain thread counts. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mleich provided a rr record trace of a related failure for a 10.5-based tree that I spent all day analyzing today. I think that I may have found the cause of this corruption. I do not know this for sure yet, and I would have to go back to an older version to reproduce the failure stated in the Description, to check it with the fix, once I have a fix. In short, the function ibuf_get_volume_buffered_hash() is calculating the hash value for garbage, and thus it will consider two equal keys to be distinct. The key is (NULL,520) in our case.
For the two occurrences of (NULL,520), we got len=510 even though the payload (0x80, 0x0, 0x2, 0x8) seems to be in the first 4 bytes. The more complete scenario was like this:
For the wrongly buffered DELETE operation, we miscalculated that there would be at least 4 records in the page based on the change buffer contents, in addition to the 3 records that we cannot see because the page is not in the buffer pool. Finally, after all these records were buffered, we would merge the changes to the page and issue a message like this:
More anomalies that I did not check yet:
The primary fix will be to compute the hash only on the index record that would be applied to the secondary index leaf page in the table. While doing that, we could replace the slow ut_fold_binary() with ut_crc32(). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The ‘extra’ call to ibuf_insert_low() is explained by the following:
That is, no duplicate record was inserted into the change buffer for either the DELETE (@r, 36) or the DELETE_MARK (NULL, 521). Only 14 records were present when the DELETE (NULL, 520) was wrongly buffered. In ibuf_get_volume_buffered() we check also the preceding change buffer page. This worries me a little, because normally, a backward B-tree scan in InnoDB could lead to a deadlock of threads, and we normally save the cursor position and search for the preceding record to move to the preceding page. Only if the caller ibuf_insert_low() was invoked with mode = BTR_MODIFY_TREE | BTR_LATCH_FOR_INSERT would there likely be no concern, because we would be holding ibuf_mutex and ibuf_pessimistic_insert_mutex. However, that call has been there since the inclusion of InnoDB in MySQL. If I remember correctly the discussions between 2005 and 2007, going back one more page could lead to deadlocks. So, I guess that we are fine with regard to deadlocks here. (Also, I am not aware of any actual deadlocks in this code.) The following hash function ought to work. False positives are not a correctness problem. For example, ('fu','bar'), (NULL,'fubar'),('fub','ar') would be treated as equal, and we would increment min_n_rec by at most 1 for those 3 distinct keys.
The comparison min_n_recs < 2 in ibuf_insert_low() actually does check for underflow, because the variable is a signed type (lint, not ulint). With the corrected hash function, the estimate of min_n_recs looks conservative to me, even if there were multiple change buffer pages of buffered changes and the estimate did not consider them all. In the worst case, we missed many preceding pages filled with IBUF_DELETE_OP records that had been buffered for the user page earlier. But, it would have been an error to buffer those operations if the records did not exist in the first place. So, it is OK to ignore any preceding pages of buffered operations and start with a conservative estimate that there exist at least min_n_recs=0 records in the page. Update: I tried 10.2 with the following patch:
but the test innodb.innodb-change-buffer-recovery did not fail in 70 runs even without my fix. I will try harder, possibly with an older version, so that I can test my fix. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
So far, I have been unable to repeat the assertion failure on 10.2. It looks like the result difference was a caused by the test not removing the file my_restart.err between runs, causing a failure on ./mtr --repeat. That test problem was fixed in 10.2.28, 10.3.19, 10.4.10, 10.5.0. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Matthias Leich [ 2020-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mleich reported the assertion on crash recovery. It turns out that the assertion was missing || recv_recovery_is_on() at the end. The reason for this is the following condition, which has been removed in 10.5 (
Only during crash recovery, we can decrement PAGE_N_RECS from 1 to 0. In all other cases, we would invoke page_create_empty(). The assertion failed on the change buffer tree, because the change buffer was emptied during crash recovery. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-05-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Summary: The test was not tolerant of ./mtr --repeat=2 and the assertion that was added in a hope to catch the bug that I now filed and fixed as |