[MDEV-27927] row_sel_try_search_shortcut_for_mysql() does not latch a page, violating read view isolation Created: 2022-02-22 Updated: 2023-07-26 Resolved: 2022-10-05 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - InnoDB |
| Affects Version/s: | 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11 |
| Fix Version/s: | 10.3.37, 10.4.27, 10.5.18, 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Si Liu | Assignee: | Vladislav Lesin |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | galera, innodb, isolation, repeatable_read | ||
| Attachments: |
|
| Description |
|
We have been testing Galera-MariaDB on data consistency models. We (together with the Galera contributor) found REPEATABLE READ anomalies with Galera disabled. Hence, there might be something wrong with MariaDB. We used the default REPEATABLE READ isolation in our testing. The collected data (or histories of transaction executions): The specific anomaly was shown at https://github.com/codership/galera/issues/609#issuecomment-1036096287 Also see the table attached. Notation: w/r(key, value, client_id, txn_id); operations of a txn (denoted by a column) are listed from top down. Anomaly: txn393 read value 3328 on key 0 (red cell, column 1) in between the reads of value 33324 on key 0 (yellow cells). Setup: 3 server nodes and 3 client nodes. Galera Version: 26.4.9. MariaDB Version: 10.4.22. |
| Comments |
| Comment by Marko Mäkelä [ 2022-03-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Can you please provide an executable test script that delivers a "pass" or "fail" result? Preferably in source code format, so that it can be adapted as needed. To actually fix this bug, I think that the server needs to be run with rr record, so that we can find out what exactly is going wrong on the server side. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-03-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Please state the exact version. Could this be a duplicate of If not, would the scenario of MDEV-26643 match this? Or, could this be explained by the anomaly that is described in MDEV-26642? Apart from that: An active InnoDB transaction is always supposed to see the uncommitted changes that it has made itself. Thus, if a single transaction comprises reads and writes, REPEATABLE READ or READ COMMITTED is not quite what one might expect. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Zoe Song [ 2022-04-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, the Repeatable Read we mentioned here is actually Cut Isolation (lower level of Repeatable Read), i.e., given a key and a transaction, if this transaction read the key multiple times and there is no write on the key between these read operations, then the results of read should be consistent. For the executable testing scripts, in the beginning, we were testing Causal Consistency of Galera Cluster. During the investigation, we found this repeatable read problem. To be noticed, the data reported here were generated by the Galera engineer, who managed to get the data from a single MariaDB node, with Galera replication disabled. Our original testing scripts cannot detect this Repeatable Read issue, so we created a new testing script for it. I attached our testing scripts here. Our testing scripts are following Jepsen’s testing methodology. Since it is black-box testing, we cannot produce the exact same results. If you want to reproduce this violation, it’s better to rerun the testing scripts with larger parameters. For example, 20 keys, 10 clients, 100 transactions per client, and 25 operations per transaction. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-05-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I edited the scripts to connect to 127.0.0.1 (localhost would use Unix domain sockets) to connect to a manually started server:
I started the server, and then started the test harness as follows:
I was not able to reproduce anything on 10.9 in this way, except Too many connections if I started further 100 client processes a few more times. The stack traces on the server process would show a large number of blocked lock_wait() calls. nobiplusplus or sciascid, can you please give me an exact invocation? I also tried a 10.4 branch, and I saw the same anomaly: 100% CPU load for a short while after starting the clients, and then about 1000 threads would be stuck in lock_wait_suspend_thread(). What is the correct way to invoke the test? I tested this on a machine with 40 hardware threads. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Sciascia [ 2022-05-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
marko I managed to reproduce as follows:
At the end, the last script check_rr.py will sometimes report a violation of RR: "find RR violation in trace: 0" You can find more about the violation in output/0/rr_violation.txt. For example, in one case I found the following: ( var: 0, old_val: 62636, new_val: 72307, tra_id: 385) Meaning that supposedly, transaction 385 violated RR. Now checking the trace in output/0/result.txt, I can see the following operations for transaction 385: r(5,62632,6,385) It read row 0 a bunch of times. It always read value 62636, except for one time where it read 72307. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Sciascia [ 2022-05-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I forgot to add that I reproduced this in 10.4 (at commit 16cebed54065ad9e18953aa86d48f6007d53c2d3)
Build type debug. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Si Liu [ 2022-05-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@sciascid Thank you for answering @marko question on setting up the reproduction! Indeed, your following interpretation of the Repeatable Read anomaly makes sense! "It read row 0 a bunch of times. It always read value 62636, except for one time where it read 72307. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Zoe Song [ 2022-05-13 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Daniele Sciascia Hi, did you run it on pure MariaDB, or Galera-MariaDB with Galera disabled? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniele Sciascia [ 2022-05-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
zoooesong Hi, there are no separate distributions for MariaDB / Galera-MariaDB. I reproduced with MariaDB with Galera disabled. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-05-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you both. For a 10.6-based branch, I got the following output to galera.py -t 100 -c 10:
The script group_data.py did not report anything, and check_rr.py did not generate any file. I am not a Python expert, but it looks like the script would always output a line like the following because None is apparently not equal to the empty list:
I tried to patch the script as follows, to get any violations reported directly to the standard output:
I tested with both the default innodb_page_size=16k and the minimum innodb_page_size=4k. To diagnose and fix this, I would need an rr record trace (https://rr-project.org/) of a server that generated an invalid run. Preferably the server should be built with cmake -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF because conditional branches make rr much slower. The -DPLUGIN_PERFSCHEMA=NO is useful also for -DCMAKE_BUILD_TYPE=RelWithDebInfo. I tested with debug builds. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Zoe Song [ 2022-05-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Marko Mäkelä Hi, sorry for the wrong if condition of empty list in python script. I create a new repo for mariadb data generation here: mariadb_data_generation To set up the MariaDB, I tried to build it with CMake. Unfortunately, it continually returned errors, so I use the method in this script to setup the MariaDB-Server. This week I tried MariaDB in two different ways, both with the following parameters:
And the results are as below: 1. Setup MariaDB locally and use 127.0.0.1 to connect and generate workloads: After 300 runs, no violation was found 2. Setup the server and client of MariaDB on two seperate machines ( both from EmuLab ): After 300 runs, RR violation appeared in multiple runs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-06-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Sorry for the delay. For a 10.6 debug build, I initially was unable to reproduce any error. I used the scripts as is, just changing the hardcoded password:
I understand that the issue is rather sporadic. The timing characteristics between 10.4 and 10.6 could be rather different due to performance improvements. I reproduced this on a recent 10.4:
My build options included the following, which may be of importance (at least it will make the test complete much faster than with a debug build):
Finally, I also reproduced this for a 10.6 non-debug build (no failure with a debug build):
I did not double check the validity of these results. I did not even check if the parsing error that triggered the Python stack trace would cause a bogus report of RR violation. For the 10.4 run, there was no such stack trace shown. vlad.lesin, can you please try generate and debug rr record trace of the server where the failure is repeated? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-08-31 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The test case MDEV-27927-test.diff The problem is that despite the following comment:
row_sel_try_search_shortcut_for_mysql() does not protect the record's page with its latch after the record was found in the table's AHI. Useful comment from Marko: So we need to acquire page latch before releasing AHI latch in row_sel_try_search_shortcut_for_mysql(). The unanswered question to me is why the comment explains row_sel_try_search_shortcut_for_mysql() latching order, but row_sel_try_search_shortcut_for_mysql() does not correspond to the comment. Wasn't page latch removed intentionally for some reason? If somebody is affected, the workaround is to disable AHI until the bug is fixed. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-06 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
vlad.lesin, thank you for the analysis. I understood that btr_search_guess_on_hash() would only acquire an index page latch if it is invoked with ahi_latch=NULL. A possible fix would be that btr_search_guess_on_hash(…, ahi_latch=NULL, …) would only buffer-fix the page, and it would always be the caller’s responsibility to acquire the page latch after releasing the ahi_latch. Also a "trylock" acquisition is possible while holding the ahi_latch, if a failure to acquire the page latch will be treated as "AHI lookup failed". The adaptive hash index is completely indifferent to any non-key fields or to the delete-mark flag. In the clustered index, any fields starting with DB_TRX_ID are not part of the key, so the AHI does not have to be invalidated if they are updated (in place). An MVCC read will have to read DB_TRX_ID and possibly DB_ROLL_PTR, and it is only safe to do that while holding a latch on the index page. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-09-07 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
ahi_latch argument of btr_cur_search_to_nth_level_func() and btr_pcur_open_with_no_init_func() is used only for row_sel_try_search_shortcut_for_mysql() (10.4). btr_cur_search_to_nth_level_func(..., ahi_latch !=0, ...) is invoked only from btr_pcur_open_with_no_init_func(..., ahi_latch !=0, ...), which, in turns, is invoked only from row_sel_try_search_shortcut_for_mysql(). I suppose that separate case with ahi_latch!=0 was intentionally implemented to protect row_sel_store_mysql_rec() call in row_search_mvcc() just after row_sel_try_search_shortcut_for_mysql() call. After the ahi_latch was moved from row_seach_mvcc() to row_sel_try_search_shortcut_for_mysql(), there is not need in it at all if btr_search_guess_on_hash() invokes buf_page_get_known_nowait() unconditionally. And if btr_search_guess_on_hash() latched the page, any access to the record in row_sel_try_search_shortcut_for_mysql() after btr_pcur_open_with_no_init() call will be protected with page latch. So the fix I propose is just to remove ahi_latch argument from btr_pcur_open_with_no_init_func(), btr_cur_search_to_nth_level_func() and btr_search_guess_on_hash(). The test I provided will hang after the fix because the sync point will be under page latch produced my SELECT, while UPDATE will hang trying to latch the page. So there will not be a test for the fix, at least I don't see the way to develop it. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-09-08 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you vlad.lesin, your analysis is plausible and suggestion is reasonable. Let us remove the ahi_latch argument. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Axel Schwenke [ 2022-09-18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Attached benchmark results: MDEV-27927.pdf The good news: there are no regressions in 10.4 branch (red vs. green) nor in the 10.6 branch (pink vs. light blue). But I also tested 10.6 with AHI turned off (the default) and it was much better for batched UPDATE workloads. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-09-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
There were two bugs found during 10.6 fix version testing with RQG(bb-10.6- | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vladislav Lesin [ 2022-10-03 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some update. Pushed bb-10.[36]- | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2022-10-04 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you, the 10.3 version looks good to me. I think that the debug assertion that you would adjust in buf_page_get_known_nowait() can be safely removed altogether, and that function signature can remain unchanged. In |