[MDEV-18464] Port kill_one_trx fixes from 10.4 to 10.1 Created: 2019-02-04 Updated: 2023-06-26 Resolved: 2020-03-11 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera, Storage Engine - InnoDB, Storage Engine - XtraDB |
| Affects Version/s: | None |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Jan Lindström (Inactive) | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Won't Do | Votes: | 2 |
| Labels: | Test_disabled, test_disabled_ES | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Description |
|
There following issues here:
Whenever one of these actions is executed we will hold number of InnoDB internal mutexes and thd mutexes. Refs
|
| Comments |
| Comment by Jan Lindström (Inactive) [ 2019-02-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
No need to review Galera only parts, but could you please review kill and lock changes from below: https://github.com/MariaDB/server/commit/c889c8de46dfa1216a46656700e17ba5e1adfdc1 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-03-18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I posted some initial review comments. It is very hard to see what the actual change is, because the commit includes many unrelated changes. Can you please submit a minimal commit for review? While you are doing this, I would suggest replacing trx_t::abort_type with bool replication_abort. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2019-03-22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
http://lists.askmonty.org/pipermail/commits/2019-March/013579.html | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-03-26 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Please address my review comments and push to buildbot for final validation. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-03-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I am afraid that this code is prone to race conditions:
We cannot safely read trx->victim without holding one of the mutexes. (This new field is documented to be protected by the two mutexes.) An easy fix ought to be to do lock_mutex_enter() before reading the field, but we cannot do that if we happen to hold the mutex already. It could be cleaner to pass a parameter to this function to tell that the mutexes have already been acquired. Here are a few other suggested fixes:
I think that the correct place to reset the flag would be in lock_trx_release_locks() when marking the transaction committed. The resetting in trx_rollback_to_savepoint_for_mysql_low() looks definitely wrong. Not only it is not holding either mutex, but it is also resetting the flag while performing a partial rollback (rolling back the current row operation, the current statement, or executing ROLLBACK TO SAVEPOINT). I think that this fix needs to be tested thoroughly, once it has passed my review. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-03-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I reverted the change, because I believe that it could cause a regression for replication when Galera is not being used (in addition to suffering from some race conditions with Galera). I think that it would be cleanest to rename trx->lock.was_chosen_as_deadlock_victim to cover both replication and Galera, and to add a parameter to handlerton::kill_query() that informs innobase_kill_query() that the abort was initiated from InnoDB. This new parameter probably should not be a Boolean, but instead const handlerton*. InnoDB would pass it as innodb_hton when it is the one that initiated abort (and is holding the InnoDB mutexes). We should consider a multi-storage-engine scenario. A storage engine should not initiate asynchronous rollback for a transaction if it was already initiated by some other storage engine. Instead of adding a parameter to handlerton::kill_query(), it might also work to add a data member, something like const handlerton* THD::abort_initiator. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-06-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Replication issues should've been fixed in 10.2. I'd leave it up to the replication team to get this backported if needed.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2019-06-11 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jplindst, I can suggest 3 options: Otherwise I wouldn't dare refactoring THD::awake() in 10.1. The only idea I have is to introduce THD::ref_count, which (I feel) is going to be quite intrusive. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2019-06-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Does this bug really affect MariaDB 10.4? Why is this now a Task and not a Bug? What are the affected versions? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2019-06-19 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Affected versions are 10.1, 10.2, 10.3. Yes, this was originally "Port kill_one_trx fixes from 10.4 to 10.1". However, while I was doing that after first review I found more critical problem on same area of the code. It seems that 10.4 is not effected. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2019-06-20 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
10.4 contains dirty read to trx->lock.was_chosen_as_wsrep_victim variable suspect to race
Note that LOCK thd_data protects from two concurrent KILLs. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jan Lindström (Inactive) [ 2020-01-16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
https://github.com/MariaDB/server/commit/82c44f8298831ad167130741b6e48e7316ef8e47 This should fix issue on 10.2/10.3. 10.4 fix will have same idea. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2020-03-02 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Given that there already was a commit in 10.1.39, 10.2.24, 10.3.14, 10.4.4, 10.5.0 that references this ticket, I think that to avoid further confusion, a separate ticket should be filed for the remaining fix and the original title of this ticket should be restored. Furthermore, given that 10.4 introduced Galera 4 and that there are significant differences around trx_sys between 10.2 and 10.3, I think that for a proper review, we would need thoroughly tested fixes for all of 10.2, 10.3, 10.4. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Julien Fritsch [ 2020-12-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Found in disabled.def that it was disabled, also I'm adding the missing label. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Julien Fritsch [ 2020-12-03 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Found in disabled.def for ES that it was disabled, also I'm adding the missing label. |