[MDEV-29575] Access to innodb_trx, innodb_locks and innodb_lock_waits along with detached XA's can cause SIGSEGV Created: 2022-09-20 Updated: 2022-11-21 Resolved: 2022-10-04 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Information Schema, Storage Engine - InnoDB, XA |
| Affects Version/s: | 10.4, 10.5, 10.6, 10.7, 10.8, 10.10, 10.11 |
| Fix Version/s: | 10.3.37, 10.4.27, 10.5.18, 10.6.11, 10.7.7, 10.8.6, 10.9.4, 10.10.2, 10.11.1 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Vladislav Lesin | Assignee: | Vladislav Lesin |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Description |
|
trx->mysql_thd can be zeroed-out between thd_get_thread_id() and thd_query_safe() calls in fill_trx_row(). trx_disconnect_prepared() zeroes out trx->mysql_thd. And this can cause null pointer dereferencing in fill_trx_row(). The bug can be reproduced with the following new sync point:
and the following test case:
It does not affect 10.3 as 10.3 does not detach XA on disconnection (compare THD::cleanup() in 10.3 and 10.4+ and see trans_xa_detach() in 10.4+ for details). Until |
| Comments |
| Comment by Vladislav Lesin [ 2022-09-20 ] | |||||||||
|
Useful comment from marko: A possible fix might be to acquire THD::LOCK_thd_data in a safe way, and to ensure that a disconnect would be blocked it. Currently it is not the case; see The function fill_trx_row() does check if trx_t::mysql_thd is a null pointer, but race conditions after that point are possible. Holding exclusive lock_sys.latch (or before 10.6, lock_sys.mutex) will block lock_release() during trx_t::release_locks() but not the transaction state change. When it comes to race conditions with innobase_close_connection(), there is no way to synchronize with trx_disconnect_prepared(). The innobase_rollback_trx() would be blocked by lock_sys.latch and trx_t::mutex. | |||||||||
| Comment by Vladislav Lesin [ 2022-09-30 ] | |||||||||
|
10.6 code analyses: fill_trx_row() is invoked from fetch_data_into_cache(), which, in turns, iterates transactions with trx_sys.trx_list.for_each() function, which holds trx_sys.trx_list.mutex during the iteration. At the other hand, innobase_close_connection() invokes trx->free(), which removes transaction from transactions container trx_sys.deregister_trx(this), which acquires trx_sys.trx_list.mutex too. And after this trx_t::free() zeroes out trx->mysql_thd. That's why we don't catch the bug with usual disconnection. trx_disconnect_prepared() is invoked only from innobase_close_connection(). I think trx_disconnect_prepared() must zero out trx->mysql_thd under trx_sys.trx_list.mutex(), copying the above logic of trx_t::free(). thread_safe_trx_ilist_t::freeze()/unfreeze() functions could be used for that. The same is true for 10.4. See fetch_data_into_cache() for details. | |||||||||
| Comment by Vladislav Lesin [ 2022-10-03 ] | |||||||||
|
The bug does not affect 10.3, because the following commit:
exists only for 10.4. And trx_disconnect_prepared() is just not invoked with the above test in 10.3. But we will also fix 10.3 to be sure, it will not be broken, if somebody decide to backport the above commit to 10.3. | |||||||||
| Comment by Vladislav Lesin [ 2022-10-03 ] | |||||||||
|
The bug is caused by the following commit:
Because before the fix trx_disconnect_from_mysql() contained the critical section. But after the fix, that function was replaced with trx_disconnect_prepared(), which does not contain the critical section. |