[MDEV-28776] rpl.rpl_mark_optimize_tbl_ddl fails with timeout on sync_with_master Created: 2022-06-08 Updated: 2023-10-18 Resolved: 2023-08-15 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication, Tests |
| Affects Version/s: | 10.2, 10.3, 10.5, 10.6, 10.7, 10.8, 10.9 |
| Fix Version/s: | 10.4.32, 10.5.23, 10.6.16, 10.9.8, 10.10.7, 10.11.6, 11.0.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Angelique Sklavounos (Inactive) | Assignee: | Kristian Nielsen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
https://buildbot.mariadb.org/#/builders/192/builds/10092 The test occasionally fails with:
|
| Comments |
| Comment by Sergei Golubchik [ 2022-06-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
fails quite often in 10.10 | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2022-06-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Initial problem analysis shows this to be a deadlock on the `gtid_slave_pos` table. When the `IO thread` starts up, during `get_master_version_and_clock()`'s `rpl_global_gtid_slave_state->load()` will first get an MDL lock on the table, and then try to get the table locks. Then when an SQL executor thread tries to execute `ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB`, it first gets the table lock, and then tries to get the MDL lock. If the IO thread has the MDL lock and one of the SQL executor threads has the table lock, the outcome is a deadlock and the test will timeout. Looks like this is more prevalent in 10.10 because the online alter work may have increased the window for deadlock. For reference, the relevant thread backtraces are pasted below:
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-01 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Looks like this bug was fixed as part of So is it ok if I close this bug? - Kristian. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi knielsen!
Another observation (which aligns with | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Right, that makes sense. One is somewhat worrying, the slave fails with this error in the log: 2023-06-08 7:31:36 11 [ERROR] Slave worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. This from this recent build: https://buildbot.mariadb.org/#/builders/318/builds/12630 Happens in mixed and statement-mode, as you say. The test does INSERT ... SELECT on an innodb table without primary key, I wonder if that uses auto-increment behind the scenes and could be related to The other kind of failure happens exclusively on freebsd, and seems not restricted to this particular test case. The IO thread stops, with this error from the master: Last_IO_Error Got fatal error 1236 from master when reading data from binary log: 'bogus data in log event; the first event 'master-bin.000001' at 614, the last event read from 'master-bin.000002' at 343, the last byte read from 'master-bin.000002' at 362.' Apparently there is some problem on the master with reading the binlog on freebsd? Maybe some missing check of short read that only occurs on freebsd, or something, would be good to check (how does one access the freebsd builder host?) I actually see a third kind of failure only in my local tests on slightly modified rpl_mark_optimize_tbl_ddl.test, this warning is seen in the error log: 2023-07-02 14:26:09 32102 [Warning] Slave: Out of memory during slave state maintenance. Some no longer necessary rows in table mysql.gtid_slave_pos may be left undeleted. I doubt this is actual out-of-memory, I'm trying to track down if it could be insufficient locking on the hash being updated. A bit tricky as it happens only around 1-in-10000 (seen twice so far). | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
To the first failure, I had a similar thought about a deadlock 10 times in a row, so locally I set the retry count to 1 to see if i could find a single deadlock, and never got a failure locally at all (though admittedly I don't remember how many repetitions I actually tried..). Another thought (which i never got around to testing) was that it may be something that puts the server in somewhat of a broken state that would cause any number of retries to fail, no matter what. It seems you have a test case already for The FreeBSD failures are likely contained within another ticket, I haven't seen your "Out of memory during slave state maintenance" failure, though I'm interested to see what you dig up about it | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I found the cause of the "third kind of failure", filed as | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Running with --slave-transaction-retries=1 is a good idea. I got a failure this way after 34 iterations (--mem --parallel=3 --retry=10000). But I'm not sure how this could happen 10 times in a row. Each time InnoDB will rollback the deadlock victim and release its locks. If the victim is ever the second transaction, the retry will wait for the first transaction to commit, preventing further innodb deadlocks. So the first transaction will have to restart and take some lock, after which the second transaction has to take another lock and cause a deadlock. How this can happen 10 times to the second transaction is a mystery, this is a table with just one row. So this suggests that your thought "a broken state that would cause any number of retries to fail" could be the explanation, though I have no concrete idea what it could be... | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-04 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I actually found something. When the VATS scheduling was put into InnoDB, they errorneously removed the call to thd_deadlock_victim_preference() to select the correct victim when two parallel replication worker threads deadlock against each other. This will cause unnecessary extra deadlocks and retries. This should be fixed, I'll probably file another MDEV for it. However, I still do not understand how it is possible to get 10 unsuccessful retries in this test, needs to be understood. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Still working on this and with (slow) progress. - Kristian. | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-14 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok, I believe I solved this. It's caused by
Here we clearly see that T1 (GTID 0-1-6) is repeatedly chosen as a deadlock victim (4 times in this case) until T2 (GTID 0-1-7) gets deadlock killed, then T1 completes ok. A similar printout from a failed run shows T1 being chosen as a victim a total of 10 times (default of --slave-transaction-retries), causing the test to fail. I'm still trying to see if I can come up with a testcase that doesn't require running 100k+ times to fail, we will see if I'm successful. But it looks like this bug is a duplicate of | ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-07-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
I managed to create a (hacky, not suitable for pushing) testcase, attached to this issue.
| ||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kristian Nielsen [ 2023-08-15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||
|
Pushed to 10.4 |