[MDEV-27927] row_sel_try_search_shortcut_for_mysql() does not latch a page, violating read view isolation - Jira

Marko Mäkelä added a comment - 2022-03-29 14:43

Can you please provide an executable test script that delivers a "pass" or "fail" result? Preferably in source code format, so that it can be adapted as needed.

To actually fix this bug, I think that the server needs to be run with rr record, so that we can find out what exactly is going wrong on the server side.

Marko Mäkelä added a comment - 2022-03-29 14:43 Can you please provide an executable test script that delivers a "pass" or "fail" result? Preferably in source code format, so that it can be adapted as needed. To actually fix this bug, I think that the server needs to be run with rr record , so that we can find out what exactly is going wrong on the server side.

Marko Mäkelä added a comment - 2022-03-29 15:42

Please state the exact version. Could this be a duplicate of ~~MDEV-27992~~, which was caused by ~~MDEV-27025~~?

If not, would the scenario of ~~MDEV-26643~~ match this? Or, could this be explained by the anomaly that is described in ~~MDEV-26642~~?

Apart from that: An active InnoDB transaction is always supposed to see the uncommitted changes that it has made itself. Thus, if a single transaction comprises reads and writes, REPEATABLE READ or READ COMMITTED is not quite what one might expect.

Marko Mäkelä added a comment - 2022-03-29 15:42 Please state the exact version. Could this be a duplicate of MDEV-27992 , which was caused by MDEV-27025 ? If not, would the scenario of MDEV-26643 match this? Or, could this be explained by the anomaly that is described in MDEV-26642 ? Apart from that: An active InnoDB transaction is always supposed to see the uncommitted changes that it has made itself. Thus, if a single transaction comprises reads and writes, REPEATABLE READ or READ COMMITTED is not quite what one might expect.

Zoe Song added a comment - 2022-04-03 06:00

Hi, the Repeatable Read we mentioned here is actually Cut Isolation (lower level of Repeatable Read), i.e., given a key and a transaction, if this transaction read the key multiple times and there is no write on the key between these read operations, then the results of read should be consistent.

For the executable testing scripts, in the beginning, we were testing Causal Consistency of Galera Cluster. During the investigation, we found this repeatable read problem. To be noticed, the data reported here were generated by the Galera engineer, who managed to get the data from a single MariaDB node, with Galera replication disabled. Our original testing scripts cannot detect this Repeatable Read issue, so we created a new testing script for it. I attached our testing scripts here.

Our testing scripts are following Jepsen’s testing methodology. Since it is black-box testing, we cannot produce the exact same results. If you want to reproduce this violation, it’s better to rerun the testing scripts with larger parameters. For example, 20 keys, 10 clients, 100 transactions per client, and 25 operations per transaction.

Zoe Song added a comment - 2022-04-03 06:00 Hi, the Repeatable Read we mentioned here is actually Cut Isolation (lower level of Repeatable Read), i.e., given a key and a transaction, if this transaction read the key multiple times and there is no write on the key between these read operations, then the results of read should be consistent. For the executable testing scripts, in the beginning, we were testing Causal Consistency of Galera Cluster. During the investigation, we found this repeatable read problem. To be noticed, the data reported here were generated by the Galera engineer, who managed to get the data from a single MariaDB node, with Galera replication disabled. Our original testing scripts cannot detect this Repeatable Read issue, so we created a new testing script for it. I attached our testing scripts here. Our testing scripts are following Jepsen’s testing methodology. Since it is black-box testing, we cannot produce the exact same results. If you want to reproduce this violation, it’s better to rerun the testing scripts with larger parameters. For example, 20 keys, 10 clients, 100 transactions per client, and 25 operations per transaction.

Marko Mäkelä added a comment - 2022-05-02 09:01

I edited the scripts to connect to 127.0.0.1 (localhost would use Unix domain sockets) to connect to a manually started server:

diff --git a/galera.py b/galera.py

index a172c94..edc1b45 100644

--- a/galera.py

+++ b/galera.py

@@ -21,7 +21,6 @@ operation_num = 25 # operation num in each transaction

 threads_num = 10 # client number

 node_no=1 # node No. if multiple threads is enabled

 folder_num = 0 # the output folder number

-server_id = ['155.98.39.143','155.98.39.151','155.98.39.130'] # galera server nodes ip_address

 try:

     opts, args = getopt.getopt(sys.argv[1:],"hw:r:p:t:o:c:n:f:",["help","wo_rate=","ro_rate=","w_percent=","trans_num=","op_num=","client_num=","node_no=","folder_num="])

@@ -248,10 +247,9 @@ def generate_opt(hist_file, trans_num):

 def run_ops(list_of_ops, client_no):

     op_num = 0

     result_ops = []

-    server_num = random_pick([0,1,2],[0.34,0.33,0.33])

-    server = server_id[server_num]

+    server = '127.0.0.1'

     print("client_no: "+ str(client_no) + ", server_no: " + str(server))

-    connect = mariadb.connect(host=server, user="root",password="123456")

+    connect = mariadb.connect(host=server, user="root",password="")

     # Disable Auto-Commit

     connect.autocommit = False

     t_count = 0

@@ -306,7 +304,7 @@ def run_ops(list_of_ops, client_no):

     connect.close()

     if t_count < transaction_num:

         print("UNFINISH")

-    return result_ops, e_count, server_num

+    return result_ops, e_count, 0

 def write_result(result,file_path, error_num, server_num):

'''

diff --git a/init.py b/init.py

index 5e5369e..f8c8e25 100644

--- a/init.py

+++ b/init.py

@@ -1,15 +1,10 @@

-'''

-python3 init.py 155.98.39.143

-'''

 import os,sys

 import mariadb

-server = sys.argv[1]

 key = 20

-connect = mariadb.connect(host=server, user="root",password="123456")

+connect = mariadb.connect(host='127.0.0.1', user="root",password="")

 # Disable Auto-Commit

 connect.autocommit = False

I started the server, and then started the test harness as follows:

python3 init.py

seq  0 99 |while read; do python3 galera.py -f "$REPLY" -n 1 & done

I was not able to reproduce anything on 10.9 in this way, except Too many connections if I started further 100 client processes a few more times. The stack traces on the server process would show a large number of blocked lock_wait() calls. nobiplusplus or sciascid, can you please give me an exact invocation?

I also tried a 10.4 branch, and I saw the same anomaly: 100% CPU load for a short while after starting the clients, and then about 1000 threads would be stuck in lock_wait_suspend_thread(). What is the correct way to invoke the test? I tested this on a machine with 40 hardware threads.

Marko Mäkelä added a comment - 2022-05-02 09:01 I edited the scripts to connect to 127.0.0.1 ( localhost would use Unix domain sockets) to connect to a manually started server: diff --git a/galera.py b/galera.py index a172c94..edc1b45 100644 --- a/galera.py +++ b/galera.py @@ -21,7 +21,6 @@ operation_num = 25 # operation num in each transaction threads_num = 10 # client number node_no=1 # node No. if multiple threads is enabled folder_num = 0 # the output folder number -server_id = ['155.98.39.143','155.98.39.151','155.98.39.130'] # galera server nodes ip_address try: opts, args = getopt.getopt(sys.argv[1:],"hw:r:p:t:o:c:n:f:",["help","wo_rate=","ro_rate=","w_percent=","trans_num=","op_num=","client_num=","node_no=","folder_num="]) @@ -248,10 +247,9 @@ def generate_opt(hist_file, trans_num): def run_ops(list_of_ops, client_no): op_num = 0 result_ops = [] - server_num = random_pick([0,1,2],[0.34,0.33,0.33]) - server = server_id[server_num] + server = '127.0.0.1' print("client_no: "+ str(client_no) + ", server_no: " + str(server)) - connect = mariadb.connect(host=server, user="root",password="123456") + connect = mariadb.connect(host=server, user="root",password="") # Disable Auto-Commit connect.autocommit = False t_count = 0 @@ -306,7 +304,7 @@ def run_ops(list_of_ops, client_no): connect.close() if t_count < transaction_num: print("UNFINISH") - return result_ops, e_count, server_num + return result_ops, e_count, 0 def write_result(result,file_path, error_num, server_num): ''' diff --git a/init.py b/init.py index 5e5369e..f8c8e25 100644 --- a/init.py +++ b/init.py @@ -1,15 +1,10 @@ -''' -python3 init.py 155.98.39.143 -''' - import os,sys import mariadb -server = sys.argv[1] key = 20 -connect = mariadb.connect(host=server, user="root",password="123456") +connect = mariadb.connect(host='127.0.0.1', user="root",password="") # Disable Auto-Commit connect.autocommit = False I started the server, and then started the test harness as follows: python3 init.py seq 0 99 |while read; do python3 galera.py -f "$REPLY" -n 1 & done I was not able to reproduce anything on 10.9 in this way, except Too many connections if I started further 100 client processes a few more times. The stack traces on the server process would show a large number of blocked lock_wait() calls. nobiplusplus or sciascid , can you please give me an exact invocation? I also tried a 10.4 branch, and I saw the same anomaly: 100% CPU load for a short while after starting the clients, and then about 1000 threads would be stuck in lock_wait_suspend_thread() . What is the correct way to invoke the test? I tested this on a machine with 40 hardware threads.

Daniele Sciascia added a comment - 2022-05-13 13:01

marko I managed to reproduce as follows:

rm -rf client/

rm -rf output/

python init.py 127.0.0.1

python galera.py -t 100 -c 10

python group_data.py

python check_rr.py

At the end, the last script check_rr.py will sometimes report a violation of RR:

"find RR violation in trace: 0"

You can find more about the violation in output/0/rr_violation.txt.

For example, in one case I found the following:

( var: 0, old_val: 62636, new_val: 72307, tra_id: 385)
( var: 0, old_val: 72307, new_val: 62636, tra_id: 385)

Meaning that supposedly, transaction 385 violated RR.

Now checking the trace in output/0/result.txt, I can see the following operations for transaction 385:

r(5,62632,6,385)
r(0,62636,6,385)
r(5,62632,6,385)
r(4,32288,6,385)
r(0,62636,6,385)
r(16,92344,6,385)
r(1,62623,6,385)
r(4,32288,6,385)
r(1,62623,6,385)
r(11,62617,6,385
r(12,62631,6,385
r(0,72307,6,385)
r(2,62637,6,385)
r(9,32285,6,385)
r(0,62636,6,385)
r(1,62623,6,385)
w(15,62638,6,385)
w(9,62639,6,385)
w(0,62640,6,385)
w(1,62641,6,385)
w(1,62642,6,385)
w(0,62643,6,385)
w(1,62644,6,385)
w(8,62645,6,385)
w(3,62646,6,385)

It read row 0 a bunch of times. It always read value 62636, except for one time where it read 72307.
Since the transaction did not write row 0 in between those reads, it seems that RR was indeed violated.

Daniele Sciascia added a comment - 2022-05-13 13:01 marko I managed to reproduce as follows: rm -rf client/ rm -rf output/ python init.py 127.0 . 0.1 python galera.py -t 100 -c 10 python group_data.py python check_rr.py At the end, the last script check_rr.py will sometimes report a violation of RR: "find RR violation in trace: 0" You can find more about the violation in output/0/rr_violation.txt. For example, in one case I found the following: ( var: 0, old_val: 62636, new_val: 72307, tra_id: 385) ( var: 0, old_val: 72307, new_val: 62636, tra_id: 385) Meaning that supposedly, transaction 385 violated RR. Now checking the trace in output/0/result.txt, I can see the following operations for transaction 385: r(5,62632,6,385) r(0,62636,6,385) r(5,62632,6,385) r(4,32288,6,385) r(0,62636,6,385) r(16,92344,6,385) r(1,62623,6,385) r(4,32288,6,385) r(1,62623,6,385) r(11,62617,6,385 r(12,62631,6,385 r(0,72307,6,385) r(2,62637,6,385) r(9,32285,6,385) r(0,62636,6,385) r(1,62623,6,385) w(15,62638,6,385) w(9,62639,6,385) w(0,62640,6,385) w(1,62641,6,385) w(1,62642,6,385) w(0,62643,6,385) w(1,62644,6,385) w(8,62645,6,385) w(3,62646,6,385) It read row 0 a bunch of times. It always read value 62636, except for one time where it read 72307. Since the transaction did not write row 0 in between those reads, it seems that RR was indeed violated.

Daniele Sciascia added a comment - 2022-05-13 13:08

I forgot to add that I reproduced this in 10.4 (at commit 16cebed54065ad9e18953aa86d48f6007d53c2d3)
I started the server manually using

mtr --suite innodb --start-and-exit

Build type debug.

Daniele Sciascia added a comment - 2022-05-13 13:08 I forgot to add that I reproduced this in 10.4 (at commit 16cebed54065ad9e18953aa86d48f6007d53c2d3) I started the server manually using mtr --suite innodb --start-and-exit Build type debug.

Si Liu added a comment - 2022-05-13 17:48 - edited

@sciascid Thank you for answering @marko question on setting up the reproduction!

Indeed, your following interpretation of the Repeatable Read anomaly makes sense!

"It read row 0 a bunch of times. It always read value 62636, except for one time where it read 72307.
Since the transaction did not write row 0 in between those reads, it seems that RR was indeed violated."

Si Liu added a comment - 2022-05-13 17:48 - edited @sciascid Thank you for answering @marko question on setting up the reproduction! Indeed, your following interpretation of the Repeatable Read anomaly makes sense! "It read row 0 a bunch of times. It always read value 62636, except for one time where it read 72307. Since the transaction did not write row 0 in between those reads, it seems that RR was indeed violated."

Zoe Song added a comment - 2022-05-13 18:09

@Daniele Sciascia Hi, did you run it on pure MariaDB, or Galera-MariaDB with Galera disabled?

Zoe Song added a comment - 2022-05-13 18:09 @Daniele Sciascia Hi, did you run it on pure MariaDB, or Galera-MariaDB with Galera disabled?

Daniele Sciascia added a comment - 2022-05-14 09:36

zoooesong Hi, there are no separate distributions for MariaDB / Galera-MariaDB. I reproduced with MariaDB with Galera disabled.

Daniele Sciascia added a comment - 2022-05-14 09:36 zoooesong Hi, there are no separate distributions for MariaDB / Galera-MariaDB. I reproduced with MariaDB with Galera disabled.

Marko Mäkelä added a comment - 2022-05-16 09:16

Thank you both. For a 10.6-based branch, I got the following output to galera.py -t 100 -c 10:

10.6 f67d65e33100adfeae6f819e2ece422a5f07da30
client_no: 1, server_no: 127.0.0.1
client_no: 2, server_no: 127.0.0.1
client_no: 0, server_no: 127.0.0.1
client_no: 3, server_no: 127.0.0.1
client_no: 4, server_no: 127.0.0.1
client_no: 7, server_no: 127.0.0.1
client_no: 6, server_no: 127.0.0.1
client_no: 8, server_no: 127.0.0.1
client_no: 5, server_no: 127.0.0.1
client_no: 9, server_no: 127.0.0.1
./output/0/result_7server_0.txt is completed, contain error: 164
./output/0/result_1server_0.txt is completed, contain error: 148
./output/0/result_0server_0.txt is completed, contain error: 158
./output/0/result_8server_0.txt is completed, contain error: 174
./output/0/result_5server_0.txt is completed, contain error: 174
./output/0/result_6server_0.txt is completed, contain error: 172
./output/0/result_2server_0.txt is completed, contain error: 175
./output/0/result_4server_0.txt is completed, contain error: 181
./output/0/result_9server_0.txt is completed, contain error: 182
./output/0/result_3server_0.txt is completed, contain error: 187

The script group_data.py did not report anything, and check_rr.py did not generate any file. I am not a Python expert, but it looks like the script would always output a line like the following because None is apparently not equal to the empty list:

find RR violation in trace: 0

I tried to patch the script as follows, to get any violations reported directly to the standard output:

diff --git a/check_rr.py b/check_rr.py

index b1d5139..2348787 100644

--- a/check_rr.py

+++ b/check_rr.py

@@ -39,7 +39,7 @@ class RRChecker:

     def check_repeatable_read(self,i):

-        rr_violation = []

+        rr_violation = None

         for var,r_t_set in self.r_nodes.items():

             for r_tra_id in r_t_set:

                 bef_r = False

@@ -49,10 +49,7 @@ class RRChecker:

                         if bef_w == False and bef_r == True:

                             if r_op['val'] != r_bef_op['val']:

                                 e_txn = {'var': r_op['var'], 'old_val': r_bef_op['val'], 'new_val': r_op['val'], 'tra_id': r_op['tra_id']}

-                                rr_violation.append(e_txn)

-                                file = open('output/'+str(i)+'/rr_violation.txt','a');

-                                file.write('( var: ' + str(e_txn['var']) + ', old_val: ' + str(e_txn['old_val']) + ', new_val: ' + str(e_txn['new_val']) + ', tra_id: ' + str(e_txn['tra_id']) + ')\n')

-                                file.close();

+                                print('( var: ' + str(e_txn['var']) + ', old_val: ' + str(e_txn['old_val']) + ', new_val: ' + str(e_txn['new_val']) + ', tra_id: ' + str(e_txn['tra_id']) + ')\n')

                         bef_r = True

                         bef_w = False

                         r_bef_op = r_op

I tested with both the default innodb_page_size=16k and the minimum innodb_page_size=4k.
It currently seems to me that there is no problem with the revision that I tested. I see that this revision of 10.6 does not even include a fix of ~~MDEV-28422~~, which I thought might have fixed this. I tested also 10.4 16cebed54065ad9e18953aa86d48f6007d53c2d3 (the same revision as sciascid), but was not able to reproduce the problem.

To diagnose and fix this, I would need an rr record trace (https://rr-project.org/) of a server that generated an invalid run. Preferably the server should be built with cmake -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF because conditional branches make rr much slower. The -DPLUGIN_PERFSCHEMA=NO is useful also for -DCMAKE_BUILD_TYPE=RelWithDebInfo. I tested with debug builds.

Marko Mäkelä added a comment - 2022-05-16 09:16 Thank you both. For a 10.6-based branch, I got the following output to galera.py -t 100 -c 10 : 10.6 f67d65e33100adfeae6f819e2ece422a5f07da30 client_no: 1, server_no: 127.0.0.1 client_no: 2, server_no: 127.0.0.1 client_no: 0, server_no: 127.0.0.1 client_no: 3, server_no: 127.0.0.1 client_no: 4, server_no: 127.0.0.1 client_no: 7, server_no: 127.0.0.1 client_no: 6, server_no: 127.0.0.1 client_no: 8, server_no: 127.0.0.1 client_no: 5, server_no: 127.0.0.1 client_no: 9, server_no: 127.0.0.1 ./output/0/result_7server_0.txt is completed, contain error: 164 ./output/0/result_1server_0.txt is completed, contain error: 148 ./output/0/result_0server_0.txt is completed, contain error: 158 ./output/0/result_8server_0.txt is completed, contain error: 174 ./output/0/result_5server_0.txt is completed, contain error: 174 ./output/0/result_6server_0.txt is completed, contain error: 172 ./output/0/result_2server_0.txt is completed, contain error: 175 ./output/0/result_4server_0.txt is completed, contain error: 181 ./output/0/result_9server_0.txt is completed, contain error: 182 ./output/0/result_3server_0.txt is completed, contain error: 187 The script group_data.py did not report anything, and check_rr.py did not generate any file. I am not a Python expert, but it looks like the script would always output a line like the following because None is apparently not equal to the empty list: find RR violation in trace: 0 I tried to patch the script as follows, to get any violations reported directly to the standard output: diff --git a/check_rr.py b/check_rr.py index b1d5139..2348787 100644 --- a/check_rr.py +++ b/check_rr.py @@ -39,7 +39,7 @@ class RRChecker: } def check_repeatable_read(self,i): - rr_violation = [] + rr_violation = None for var,r_t_set in self.r_nodes.items(): for r_tra_id in r_t_set: bef_r = False @@ -49,10 +49,7 @@ class RRChecker: if bef_w == False and bef_r == True: if r_op['val'] != r_bef_op['val']: e_txn = {'var': r_op['var'], 'old_val': r_bef_op['val'], 'new_val': r_op['val'], 'tra_id': r_op['tra_id']} - rr_violation.append(e_txn) - file = open('output/'+str(i)+'/rr_violation.txt','a'); - file.write('( var: ' + str(e_txn['var']) + ', old_val: ' + str(e_txn['old_val']) + ', new_val: ' + str(e_txn['new_val']) + ', tra_id: ' + str(e_txn['tra_id']) + ')\n') - file.close(); + print('( var: ' + str(e_txn['var']) + ', old_val: ' + str(e_txn['old_val']) + ', new_val: ' + str(e_txn['new_val']) + ', tra_id: ' + str(e_txn['tra_id']) + ')\n') bef_r = True bef_w = False r_bef_op = r_op I tested with both the default innodb_page_size=16k and the minimum innodb_page_size=4k . It currently seems to me that there is no problem with the revision that I tested. I see that this revision of 10.6 does not even include a fix of MDEV-28422 , which I thought might have fixed this. I tested also 10.4 16cebed54065ad9e18953aa86d48f6007d53c2d3 (the same revision as sciascid ), but was not able to reproduce the problem. To diagnose and fix this, I would need an rr record trace ( https://rr-project.org/ ) of a server that generated an invalid run. Preferably the server should be built with cmake -DPLUGIN_PERFSCHEMA=NO -DWITH_DBUG_TRACE=OFF because conditional branches make rr much slower. The -DPLUGIN_PERFSCHEMA=NO is useful also for -DCMAKE_BUILD_TYPE=RelWithDebInfo . I tested with debug builds.

Zoe Song added a comment - 2022-05-20 21:18 - edited

@Marko Mäkelä Hi, sorry for the wrong if condition of empty list in python script. I create a new repo for mariadb data generation here: mariadb_data_generation

To set up the MariaDB, I tried to build it with CMake. Unfortunately, it continually returned errors, so I use the method in this script to setup the MariaDB-Server.

This week I tried MariaDB in two different ways, both with the following parameters:

key_num = 20

client_num = 10

transactions_per_client = 100

operations_per_transaction = 25

running_times = 300

And the results are as below:

1. Setup MariaDB locally and use 127.0.0.1 to connect and generate workloads: After 300 runs, no violation was found

2. Setup the server and client of MariaDB on two seperate machines ( both from EmuLab ): After 300 runs, RR violation appeared in multiple runs.

Zoe Song added a comment - 2022-05-20 21:18 - edited @Marko Mäkelä Hi, sorry for the wrong if condition of empty list in python script. I create a new repo for mariadb data generation here: mariadb_data_generation To set up the MariaDB, I tried to build it with CMake. Unfortunately, it continually returned errors, so I use the method in this script to setup the MariaDB-Server. This week I tried MariaDB in two different ways, both with the following parameters: key_num = 20 client_num = 10 transactions_per_client = 100 operations_per_transaction = 25 running_times = 300 And the results are as below: 1. Setup MariaDB locally and use 127.0.0.1 to connect and generate workloads: After 300 runs, no violation was found 2. Setup the server and client of MariaDB on two seperate machines ( both from EmuLab ): After 300 runs, RR violation appeared in multiple runs.

Marko Mäkelä added a comment - 2022-06-16 09:58

Sorry for the delay. For a 10.6 debug build, I initially was unable to reproduce any error. I used the scripts as is, just changing the hardcoded password:

diff --git a/init.py b/init.py

index 755c546..3c570dd 100644

--- a/init.py

+++ b/init.py

@@ -8,7 +8,7 @@ import mariadb

 server = sys.argv[1]

 key = 20

-connect = mariadb.connect(host=server, user="root",password="123456")

+connect = mariadb.connect(host=server, user="root",password="")

 # Disable Auto-Commit

 connect.autocommit = False

diff --git a/maria.py b/maria.py

index 995c0fa..671e6c0 100644

--- a/maria.py

+++ b/maria.py

@@ -240,7 +240,7 @@ def generate_opt(hist_file, trans_num):

 def run_ops(list_of_ops, client_no):

     op_num = 0

     result_ops = []

-    connect = mariadb.connect(host=server, user="root",password="123456")

+    connect = mariadb.connect(host=server, user="root",password="")

     # Disable Auto-Commit

     connect.autocommit = False

     t_count = 0

pip3 install mariadb

python3 init.py 127.0.0.1

bash run.sh 127.0.0.1 10

I understand that the issue is rather sporadic. The timing characteristics between 10.4 and 10.6 could be rather different due to performance improvements. I reproduced this on a recent 10.4:

10.4 c89e3b70a740f486db2c60270892ba2893bb5610
…
./output/9/result_5.txt is completed, contain error: 277
./output/9/result_18.txt is completed, contain error: 276
find RR violation in trace: 6
find RR violation in trace: 8

My build options included the following, which may be of importance (at least it will make the test complete much faster than with a debug build):

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_PERFSCHEMA=NO …

Finally, I also reproduced this for a 10.6 non-debug build (no failure with a debug build):

10.6 5bb90cb2ac7b93f40c8ca4672ec7e2c18f0ea238
…
./output/9/result_6.txt is completed, contain error: 274
./output/9/result_15.txt is completed, contain error: 266
./output/9/result_8.txt is completed, contain error: 279
Traceback (most recent call last):
File "/dev/shm/mariadb_data_generation/group_data.py", line 45, in <module>
op_dict = get_op(op)
File "/dev/shm/mariadb_data_generation/group_data.py", line 30, in get_op
'client_id': int(arr[2]),
ValueError: invalid literal for int() with base 10: ' new_val: 122526'
find RR violation in trace: 8

I did not double check the validity of these results. I did not even check if the parsing error that triggered the Python stack trace would cause a bogus report of RR violation. For the 10.4 run, there was no such stack trace shown.

vlad.lesin, can you please try generate and debug rr record trace of the server where the failure is repeated?

Marko Mäkelä added a comment - 2022-06-16 09:58 Sorry for the delay. For a 10.6 debug build, I initially was unable to reproduce any error. I used the scripts as is, just changing the hardcoded password: diff --git a/init.py b/init.py index 755c546..3c570dd 100644 --- a/init.py +++ b/init.py @@ -8,7 +8,7 @@ import mariadb server = sys.argv[1] key = 20 -connect = mariadb.connect(host=server, user="root",password="123456") +connect = mariadb.connect(host=server, user="root",password="") # Disable Auto-Commit connect.autocommit = False diff --git a/maria.py b/maria.py index 995c0fa..671e6c0 100644 --- a/maria.py +++ b/maria.py @@ -240,7 +240,7 @@ def generate_opt(hist_file, trans_num): def run_ops(list_of_ops, client_no): op_num = 0 result_ops = [] - connect = mariadb.connect(host=server, user="root",password="123456") + connect = mariadb.connect(host=server, user="root",password="") # Disable Auto-Commit connect.autocommit = False t_count = 0 pip3 install mariadb python3 init.py 127.0.0.1 bash run.sh 127.0.0.1 10 I understand that the issue is rather sporadic. The timing characteristics between 10.4 and 10.6 could be rather different due to performance improvements. I reproduced this on a recent 10.4: 10.4 c89e3b70a740f486db2c60270892ba2893bb5610 … ./output/9/result_5.txt is completed, contain error: 277 ./output/9/result_18.txt is completed, contain error: 276 find RR violation in trace: 6 find RR violation in trace: 8 My build options included the following, which may be of importance (at least it will make the test complete much faster than with a debug build): cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DPLUGIN_PERFSCHEMA=NO … Finally, I also reproduced this for a 10.6 non-debug build (no failure with a debug build): 10.6 5bb90cb2ac7b93f40c8ca4672ec7e2c18f0ea238 … ./output/9/result_6.txt is completed, contain error: 274 ./output/9/result_15.txt is completed, contain error: 266 ./output/9/result_8.txt is completed, contain error: 279 Traceback (most recent call last): File "/dev/shm/mariadb_data_generation/group_data.py", line 45, in <module> op_dict = get_op(op) File "/dev/shm/mariadb_data_generation/group_data.py", line 30, in get_op 'client_id': int(arr[2]), ValueError: invalid literal for int() with base 10: ' new_val: 122526' find RR violation in trace: 8 I did not double check the validity of these results. I did not even check if the parsing error that triggered the Python stack trace would cause a bogus report of RR violation. For the 10.4 run, there was no such stack trace shown. vlad.lesin , can you please try generate and debug rr record trace of the server where the failure is repeated?

Vladislav Lesin added a comment - 2022-08-31 15:05

The test case MDEV-27927-test.diff shows the issue exists both in 10.4 and 10.6.

The problem is that despite the following comment:

dberr_t row_search_mvcc(...) {

...

                        switch (row_sel_try_search_shortcut_for_mysql(

                                        &rec, prebuilt, &offsets, &heap,

                                        &mtr)) {

                        case SEL_FOUND:

                                /* At this point, rec is protected by

                                a page latch that was acquired by

                                row_sel_try_search_shortcut_for_mysql().

                                The latch will not be released until

                                mtr.commit(). */

...

row_sel_try_search_shortcut_for_mysql() does not protect the record's page with its latch after the record was found in the table's AHI.

Useful comment from Marko:
"As far as I understand, updates of the AHI should be protected by a combination of a page latch (S, U or X) and the AHI partition latch. Anything that would remove something from the page (so that the AHI could point to something non-existent on the page) must drop the AHI entries first. This is why an AHI lookup should be OK without holding a page latch. But, as soon as we release the AHI latch, the lookup result could be invalid unless we were holding a page latch."

So we need to acquire page latch before releasing AHI latch in row_sel_try_search_shortcut_for_mysql(). The unanswered question to me is why the comment explains row_sel_try_search_shortcut_for_mysql() latching order, but row_sel_try_search_shortcut_for_mysql() does not correspond to the comment. Wasn't page latch removed intentionally for some reason?

If somebody is affected, the workaround is to disable AHI until the bug is fixed.

Vladislav Lesin added a comment - 2022-08-31 15:05 The test case MDEV-27927-test.diff shows the issue exists both in 10.4 and 10.6. The problem is that despite the following comment: dberr_t row_search_mvcc(...) { ... switch (row_sel_try_search_shortcut_for_mysql( &rec, prebuilt, &offsets, &heap, &mtr)) { case SEL_FOUND: /* At this point, rec is protected by a page latch that was acquired by row_sel_try_search_shortcut_for_mysql(). The latch will not be released until mtr.commit(). */ ... } row_sel_try_search_shortcut_for_mysql() does not protect the record's page with its latch after the record was found in the table's AHI. Useful comment from Marko: "As far as I understand, updates of the AHI should be protected by a combination of a page latch (S, U or X) and the AHI partition latch. Anything that would remove something from the page (so that the AHI could point to something non-existent on the page) must drop the AHI entries first. This is why an AHI lookup should be OK without holding a page latch. But, as soon as we release the AHI latch, the lookup result could be invalid unless we were holding a page latch." So we need to acquire page latch before releasing AHI latch in row_sel_try_search_shortcut_for_mysql(). The unanswered question to me is why the comment explains row_sel_try_search_shortcut_for_mysql() latching order, but row_sel_try_search_shortcut_for_mysql() does not correspond to the comment. Wasn't page latch removed intentionally for some reason? If somebody is affected, the workaround is to disable AHI until the bug is fixed.

Marko Mäkelä added a comment - 2022-09-06 11:45

vlad.lesin, thank you for the analysis. I understood that btr_search_guess_on_hash() would only acquire an index page latch if it is invoked with ahi_latch=NULL. A possible fix would be that btr_search_guess_on_hash(…, ahi_latch=NULL, …) would only buffer-fix the page, and it would always be the caller’s responsibility to acquire the page latch after releasing the ahi_latch. Also a "trylock" acquisition is possible while holding the ahi_latch, if a failure to acquire the page latch will be treated as "AHI lookup failed".

The adaptive hash index is completely indifferent to any non-key fields or to the delete-mark flag. In the clustered index, any fields starting with DB_TRX_ID are not part of the key, so the AHI does not have to be invalidated if they are updated (in place). An MVCC read will have to read DB_TRX_ID and possibly DB_ROLL_PTR, and it is only safe to do that while holding a latch on the index page.

Marko Mäkelä added a comment - 2022-09-06 11:45 vlad.lesin , thank you for the analysis. I understood that btr_search_guess_on_hash() would only acquire an index page latch if it is invoked with ahi_latch=NULL . A possible fix would be that btr_search_guess_on_hash(…, ahi_latch=NULL, …) would only buffer-fix the page, and it would always be the caller’s responsibility to acquire the page latch after releasing the ahi_latch . Also a "trylock" acquisition is possible while holding the ahi_latch , if a failure to acquire the page latch will be treated as "AHI lookup failed". The adaptive hash index is completely indifferent to any non-key fields or to the delete-mark flag. In the clustered index, any fields starting with DB_TRX_ID are not part of the key, so the AHI does not have to be invalidated if they are updated (in place). An MVCC read will have to read DB_TRX_ID and possibly DB_ROLL_PTR , and it is only safe to do that while holding a latch on the index page.

Vladislav Lesin added a comment - 2022-09-07 15:58

ahi_latch argument of btr_cur_search_to_nth_level_func() and btr_pcur_open_with_no_init_func() is used only for row_sel_try_search_shortcut_for_mysql() (10.4). btr_cur_search_to_nth_level_func(..., ahi_latch !=0, ...) is invoked only from btr_pcur_open_with_no_init_func(..., ahi_latch !=0, ...), which, in turns, is invoked only from row_sel_try_search_shortcut_for_mysql().

I suppose that separate case with ahi_latch!=0 was intentionally implemented to protect row_sel_store_mysql_rec() call in row_search_mvcc() just after row_sel_try_search_shortcut_for_mysql() call. After the ahi_latch was moved from row_seach_mvcc() to row_sel_try_search_shortcut_for_mysql(), there is not need in it at all if btr_search_guess_on_hash() invokes buf_page_get_known_nowait() unconditionally. And if btr_search_guess_on_hash() latched the page, any access to the record in row_sel_try_search_shortcut_for_mysql() after btr_pcur_open_with_no_init() call will be protected with page latch.

So the fix I propose is just to remove ahi_latch argument from btr_pcur_open_with_no_init_func(), btr_cur_search_to_nth_level_func() and btr_search_guess_on_hash(). The test I provided will hang after the fix because the sync point will be under page latch produced my SELECT, while UPDATE will hang trying to latch the page. So there will not be a test for the fix, at least I don't see the way to develop it.

Vladislav Lesin added a comment - 2022-09-07 15:58 ahi_latch argument of btr_cur_search_to_nth_level_func() and btr_pcur_open_with_no_init_func() is used only for row_sel_try_search_shortcut_for_mysql() (10.4). btr_cur_search_to_nth_level_func(..., ahi_latch !=0, ...) is invoked only from btr_pcur_open_with_no_init_func(..., ahi_latch !=0, ...), which, in turns, is invoked only from row_sel_try_search_shortcut_for_mysql(). I suppose that separate case with ahi_latch!=0 was intentionally implemented to protect row_sel_store_mysql_rec() call in row_search_mvcc() just after row_sel_try_search_shortcut_for_mysql() call. After the ahi_latch was moved from row_seach_mvcc() to row_sel_try_search_shortcut_for_mysql(), there is not need in it at all if btr_search_guess_on_hash() invokes buf_page_get_known_nowait() unconditionally. And if btr_search_guess_on_hash() latched the page, any access to the record in row_sel_try_search_shortcut_for_mysql() after btr_pcur_open_with_no_init() call will be protected with page latch. So the fix I propose is just to remove ahi_latch argument from btr_pcur_open_with_no_init_func(), btr_cur_search_to_nth_level_func() and btr_search_guess_on_hash(). The test I provided will hang after the fix because the sync point will be under page latch produced my SELECT, while UPDATE will hang trying to latch the page. So there will not be a test for the fix, at least I don't see the way to develop it.

Marko Mäkelä added a comment - 2022-09-08 06:48

Thank you vlad.lesin, your analysis is plausible and suggestion is reasonable. Let us remove the ahi_latch argument.

Marko Mäkelä added a comment - 2022-09-08 06:48 Thank you vlad.lesin , your analysis is plausible and suggestion is reasonable. Let us remove the ahi_latch argument.

Axel Schwenke added a comment - 2022-09-18 15:01

Attached benchmark results: MDEV-27927.pdf

The good news: there are no regressions in 10.4 branch (red vs. green) nor in the 10.6 branch (pink vs. light blue). But I also tested 10.6 with AHI turned off (the default) and it was much better for batched UPDATE workloads.

Axel Schwenke added a comment - 2022-09-18 15:01 Attached benchmark results: MDEV-27927.pdf The good news: there are no regressions in 10.4 branch (red vs. green) nor in the 10.6 branch (pink vs. light blue). But I also tested 10.6 with AHI turned off (the default) and it was much better for batched UPDATE workloads.

Vladislav Lesin added a comment - 2022-09-26 12:34

There were two bugs found during 10.6 fix version testing with RQG(bb-10.6-~~MDEV-27927~~-RR-anomaly branch): ~~MDEV-29635~~ and ~~MDEV-29622~~. But they don't relate to the fix. So, I would say the testing looks good.

Vladislav Lesin added a comment - 2022-09-26 12:34 There were two bugs found during 10.6 fix version testing with RQG(bb-10.6- MDEV-27927 -RR-anomaly branch): MDEV-29635 and MDEV-29622 . But they don't relate to the fix. So, I would say the testing looks good.

Vladislav Lesin added a comment - 2022-10-03 19:16

Some update. Pushed bb-10.[36]-~~MDEV-27927~~-RR-anomaly for final testing and code review.

Vladislav Lesin added a comment - 2022-10-03 19:16 Some update. Pushed bb-10. [36] - MDEV-27927 -RR-anomaly for final testing and code review.

Marko Mäkelä added a comment - 2022-10-04 06:12

Thank you, the 10.3 version looks good to me. I think that the debug assertion that you would adjust in buf_page_get_known_nowait() can be safely removed altogether, and that function signature can remain unchanged. In ~~MDEV-19514~~, the assertion was removed when the function was replaced with equivalent logic. The ‘freed’ status of a file page should not change while a thread is holding a buffer-fix or a latch on the page.

Marko Mäkelä added a comment - 2022-10-04 06:12 Thank you, the 10.3 version looks good to me. I think that the debug assertion that you would adjust in buf_page_get_known_nowait() can be safely removed altogether, and that function signature can remain unchanged. In MDEV-19514 , the assertion was removed when the function was replaced with equivalent logic. The ‘freed’ status of a file page should not change while a thread is holding a buffer-fix or a latch on the page.

MariaDB Server

row_sel_try_search_shortcut_for_mysql() does not latch a page, violating read view isolation

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration