[MDEV-23536] Race condition between KILL and transaction commit - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6
Fix Version/s: 10.2.37, 10.3.28, 10.4.18, 10.5.9, 10.6.0
Component/s: Locking, Server
Labels:
- ASAN
- race

Description

A race condition may occur between the execution of transaction commit, and an execution of a KILL statement that would attempt to abort that transaction.

~~MDEV-17092~~ worked around this race condition by modifying InnoDB code. After that issue was closed, svoj pointed out that this race condition would better be fixed above the storage engine layer:

If you look carefully into the above, you can conclude that thd->free_connection() can be called concurrently with KILL/thd->awake(). Which is the bug. And it is partially fixed in THD::~THD(), that is destructor waits for KILL completion:

He is quoting this code of THD::~THD() in 10.5:

/*

    Other threads may have a lock on LOCK_thd_kill to ensure that this

    THD is not deleted while they access it. The following mutex_lock

    ensures that no one else is using this THD and it's now safe to delete

*/

  if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data);

  mysql_mutex_lock(&LOCK_thd_kill);

  mysql_mutex_unlock(&LOCK_thd_kill);

  if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data);

  if (!free_connection_done)

    free_connection();

And he seems to suggest that the empty critical section should be moved to THD::free_connection(). Note: in 10.2 and 10.3, that code is slightly different:

  mysql_mutex_lock(&LOCK_thd_kill);

  mysql_mutex_unlock(&LOCK_thd_kill);

#ifdef WITH_WSREP

  delete wsrep_rgi;

#endif

  if (!free_connection_done)

    free_connection();

Nevertheless, it seems that we might want to do something like

#if MYSQL_VERSION_ID >= 100400

  if (WSREP_NNULL(this)) mysql_mutex_lock(&LOCK_thd_data);

#endif

  if (!free_connection_done)

    free_connection();

  else

    mysql_mutex_lock(&LOCK_thd_kill);

    mysql_mutex_unlock(&LOCK_thd_kill);

#if MYSQL_VERSION_ID >= 100400

  if (WSREP_NNULL(this)) mysql_mutex_unlock(&LOCK_thd_data);

#elif defined WITH_WSREP

  delete wsrep_rgi;

#endif

It might turn out that the else branch is not needed. The empty lock/unlock pair would of course be added to THD::free_connection().

It might also turn out that all the Galera-specific changes need to be done in THD::free_connection(). (In that case, we would likely want to assign wsrep_rgi= NULL).

As part of this fix, the trx_t::free() instrumentation that was modified in ~~MDEV-22782~~ should be tightened: trx_t::mysql_thd and trx_t::state must be poisoned, because innobase_kill_connection() should no longer be invoked on a freed transaction of a freed connection. This should of course be validated with an RQG run similar to the one that reproduced ~~MDEV-17092~~.

Attachments

Issue Links

blocks

MDEV-23328 Server hang due to Galera lock conflict resolution

Closed

relates to

MDEV-24624 main.lock_tables_lost_commit failed in buildbot with timeout, Trying to lock mutex at sql/sql_base.cc, line 958, when the mutex was already locked

Closed

MDEV-26014 Race condition between KILL and client disconnect

Closed

MDEV-17092 ERROR: AddressSanitizer: use-after-poison around lock_trx_handle_wait_low

Closed

MDEV-22782 SUMMARY: AddressSanitizer: unknown-crash storage/innobase/trx/trx0trx.cc:566 in trx_t::commit_state()

Closed

MDEV-23328 Server hang due to Galera lock conflict resolution

Closed

MDEV-24561 rpl.rpl_semi_sync_uninstall_plugin fails in bb with "Found wrong usage of mutex 'LOCK_thd_kill' and 'LOCK_show_status'"

Closed

MDEV-29368 Assertion `trx->mysql_thd == thd' failed in innobase_kill_query from process_timers/timer_handler and use-after-poison in innobase_kill_query

Closed

(3 relates to)

Activity

Ascending order - Click to sort in descending order

View 13 older comments

Jan Lindström (Inactive) added a comment - 2021-01-08 11:46

To help merge 10.4: https://github.com/MariaDB/server/commit/545cdf433146782c088062ba3a15d50eb3ddcfdb

Jan Lindström (Inactive) added a comment - 2021-01-08 11:46 To help merge 10.4: https://github.com/MariaDB/server/commit/545cdf433146782c088062ba3a15d50eb3ddcfdb

Matthias Leich added a comment - 2021-01-08 14:37 - edited

The RQG testing with the replay test on bb-10.3-MDEV-23536 + the os_thread_sleep patch showed two new issues.

pluto:/data/Results/1610048719/000148.log

 mysqld: sql/sql_table.cc:2428: int mysql_rm_table_no_locks(THD*, TABLE_LIST*, bool, bool, bool, bool, bool, bool): Assertion `thd->mdl_context.is_lock_owner(MDL_key::TABLE, table->db.str, table->table_name.str, MDL_SHARED)' failed.

pluto:/data/Results/1610048719/002598.log  (there are more of such results)

  # 2021-01-07T20:52:16 [2879386] ==2926968==LeakSanitizer has encountered a fatal error.

 # 2021-01-07T20:52:16 [2879386] |in/mysqld(_ZN8I_P_ListI10MDL_ticket16I_P_List_adapterIS0_XadL_ZNS0_12next_in_lockEEEXadL_ZNS0_12prev_in_lockEEEE21I_P_List_null_counter23I_P_List_fast_push_backIS0_EE6removeEPS0_+0x140)[0x561e0c85aa4c]

 # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN8MDL_lock11Ticket_list13remove_ticketEP10MDL_ticket+0x27)[0x561e0c85144f]

# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN8MDL_lock13remove_ticketEP7LF_PINSMS_NS_11Ticket_listEP10MDL_ticket+0x54)[0x561e0c851e8a]

# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context12release_lockE17enum_mdl_durationP10MDL_ticket+0x23b)[0x561e0c856787]

# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context27release_locks_stored_beforeE17enum_mdl_durationP10MDL_ticket+0x208)[0x561e0c856bce]

# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context27release_transactional_locksEv+0x1f0)[0x561e0c857a6c]

# 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN3THD27release_transactional_locksEv+0x72)[0x561e0c274476]

# 2021-01-07T20:52:16 [2879386] | [bin/mysqld(_Z21mysql_execute_commandP3THD+0x14e8c)[0x561e0c48de3b]

# 2021-01-07T20:52:16 [2879386] | sql/sql_plist.h:129(I_P_List<MDL_ticket, I_P_List_adapter<MDL_ticket, &MDL_ticket::next_in_lock, &MDL_ticket::prev_in_lock>, I_P_List_null_counter, I_P_List_fast_push_back<MDL_ticket> >::remove(MDL_ticket*))[0x561e0c499674]

# 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:1232(MDL_lock::Ticket_list::remove_ticket(MDL_ticket*))[0x561e0c4704b8]

# 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:2632(MDL_context::release_lock(enum_mdl_duration, MDL_ticket*))[0x561e0c46cfef]

# 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:2853(MDL_context::release_transactional_locks())[0x561e0c837872]

# 2021-01-07T20:52:16 [2879386] | /Server_bin/bb-10.3-MDEV-23536_asan/bin/mysqld(handle_one_connection+0x59)[0x561e0c83712c]

...

  during DROP TABLE IF EXISTS ...

I tried the replay test on actual 10.3 + the os_thread_sleep patch too.

Issues like above did not show up. But there was a significant higher fraction of replay tests failing.

They all showed some ASAN failure during KILL CONNECTION.

Given the fact that

- a higher fraction of replay tests failed on actual 10.3

- bb-10.3-MDEV-23536 performed well in the test battery for broad range coverage of functionality

bb-10.3-MDEV-23536 seems to be better than the actual 10.3.

Maybe the ASAN failure observed on actual 10.3 just prevents to hit the two new issues observed

on bb-10.3-MDEV-23536.

RQG testing on bb-10.6-MDEV-23536 + the os_thread_sleep patch

All replay tests passed.

bb-10.6-MDEV-23536 also performed well in the test battery for broad range coverage of functionality.

I vote for merge+push.

Matthias Leich added a comment - 2021-01-08 14:37 - edited The RQG testing with the replay test on bb-10.3-MDEV-23536 + the os_thread_sleep patch showed two new issues. pluto:/data/Results/1610048719/000148.log mysqld: sql/sql_table.cc:2428: int mysql_rm_table_no_locks(THD*, TABLE_LIST*, bool, bool, bool, bool, bool, bool): Assertion `thd->mdl_context.is_lock_owner(MDL_key::TABLE, table->db.str, table->table_name.str, MDL_SHARED)' failed. pluto:/data/Results/1610048719/002598.log (there are more of such results) # 2021-01-07T20:52:16 [2879386] ==2926968==LeakSanitizer has encountered a fatal error. # 2021-01-07T20:52:16 [2879386] |in/mysqld(_ZN8I_P_ListI10MDL_ticket16I_P_List_adapterIS0_XadL_ZNS0_12next_in_lockEEEXadL_ZNS0_12prev_in_lockEEEE21I_P_List_null_counter23I_P_List_fast_push_backIS0_EE6removeEPS0_+0x140)[0x561e0c85aa4c] # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN8MDL_lock11Ticket_list13remove_ticketEP10MDL_ticket+0x27)[0x561e0c85144f] # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN8MDL_lock13remove_ticketEP7LF_PINSMS_NS_11Ticket_listEP10MDL_ticket+0x54)[0x561e0c851e8a] # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context12release_lockE17enum_mdl_durationP10MDL_ticket+0x23b)[0x561e0c856787] # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context27release_locks_stored_beforeE17enum_mdl_durationP10MDL_ticket+0x208)[0x561e0c856bce] # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN11MDL_context27release_transactional_locksEv+0x1f0)[0x561e0c857a6c] # 2021-01-07T20:52:16 [2879386] | bin/mysqld(_ZN3THD27release_transactional_locksEv+0x72)[0x561e0c274476] # 2021-01-07T20:52:16 [2879386] | [bin/mysqld(_Z21mysql_execute_commandP3THD+0x14e8c)[0x561e0c48de3b] # 2021-01-07T20:52:16 [2879386] | sql/sql_plist.h:129(I_P_List<MDL_ticket, I_P_List_adapter<MDL_ticket, &MDL_ticket::next_in_lock, &MDL_ticket::prev_in_lock>, I_P_List_null_counter, I_P_List_fast_push_back<MDL_ticket> >::remove(MDL_ticket*))[0x561e0c499674] # 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:1232(MDL_lock::Ticket_list::remove_ticket(MDL_ticket*))[0x561e0c4704b8] # 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:2632(MDL_context::release_lock(enum_mdl_duration, MDL_ticket*))[0x561e0c46cfef] # 2021-01-07T20:52:16 [2879386] | sql/mdl.cc:2853(MDL_context::release_transactional_locks())[0x561e0c837872] # 2021-01-07T20:52:16 [2879386] | /Server_bin/bb-10.3-MDEV-23536_asan/bin/mysqld(handle_one_connection+0x59)[0x561e0c83712c] ... during DROP TABLE IF EXISTS ... I tried the replay test on actual 10.3 + the os_thread_sleep patch too. Issues like above did not show up. But there was a significant higher fraction of replay tests failing. They all showed some ASAN failure during KILL CONNECTION. Given the fact that - a higher fraction of replay tests failed on actual 10.3 - bb-10.3-MDEV-23536 performed well in the test battery for broad range coverage of functionality bb-10.3-MDEV-23536 seems to be better than the actual 10.3. Maybe the ASAN failure observed on actual 10.3 just prevents to hit the two new issues observed on bb-10.3-MDEV-23536. RQG testing on bb-10.6-MDEV-23536 + the os_thread_sleep patch All replay tests passed. bb-10.6-MDEV-23536 also performed well in the test battery for broad range coverage of functionality. I vote for merge+push.

Roel Van de Paar added a comment - 2021-01-12 03:26 - edited

I also did a run against this branch. Here are the things I am seeing, let me know if anything looks related and I can reduce:

One line per issue seen. First field (| separated) can be assert message, then signal, then first 4 frames.

!table->fts|SIGABRT|dict_make_room_in_cache|srv_master_evict_from_table_cache|srv_master_do_idle_tasks|srv_master_callback

[REF_NEXT_COMMENT] SIGSEGV|my_scan_weight_utf8mb3_general_ci|my_strnncollsp_utf8mb3_general_ci|charset_info_st::strnncollsp|sortcmp_lex_string

SIGSEGV|my_strndup|Event_queue_element_for_exec::init|Event_queue::get_top_for_execution_if_time|Event_scheduler::run

[REF_NEXT_COMMENT] binlog_table_maps == 0 || locked_tables_mode == LTM_LOCK_TABLES|SIGABRT|THD::reset_for_next_command|mysql_parse|dispatch_command|do_command

[REF_NEXT_COMMENT] thd->transaction->stmt.is_empty()|SIGABRT|trans_check|trans_rollback|THD::cleanup|unlink_thd

_ZN13MYSQL_BIN_LOG7cleanupEv   (limited stack info, error log only)

I've cleaned up this list to exclude all bugs seen in normal 6.0 runs.

Roel Van de Paar added a comment - 2021-01-12 03:26 - edited I also did a run against this branch. Here are the things I am seeing, let me know if anything looks related and I can reduce: One line per issue seen. First field ( | separated) can be assert message, then signal, then first 4 frames. !table->fts|SIGABRT|dict_make_room_in_cache|srv_master_evict_from_table_cache|srv_master_do_idle_tasks|srv_master_callback [REF_NEXT_COMMENT] SIGSEGV|my_scan_weight_utf8mb3_general_ci|my_strnncollsp_utf8mb3_general_ci|charset_info_st::strnncollsp|sortcmp_lex_string SIGSEGV|my_strndup|Event_queue_element_for_exec::init|Event_queue::get_top_for_execution_if_time|Event_scheduler::run [REF_NEXT_COMMENT] binlog_table_maps == 0 || locked_tables_mode == LTM_LOCK_TABLES|SIGABRT|THD::reset_for_next_command|mysql_parse|dispatch_command|do_command [REF_NEXT_COMMENT] thd->transaction->stmt.is_empty()|SIGABRT|trans_check|trans_rollback|THD::cleanup|unlink_thd _ZN13MYSQL_BIN_LOG7cleanupEv (limited stack info, error log only) I've cleaned up this list to exclude all bugs seen in normal 6.0 runs.

Marko Mäkelä added a comment - 2021-01-18 18:40

Roel, which exact branch did you test? The last 3 assertions look potentially related to this change; the first 3 assertion failures look unrelated.

As far as I can tell, serg pushed a revised fix to 10.2. It has not been merged to later branches yet.

Marko Mäkelä added a comment - 2021-01-18 18:40 Roel , which exact branch did you test? The last 3 assertions look potentially related to this change; the first 3 assertion failures look unrelated. As far as I can tell, serg pushed a revised fix to 10.2. It has not been merged to later branches yet.

Roel Van de Paar added a comment - 2021-01-21 05:27 - edited

marko Thank you!

bb-10.6-MDEV-23536 at rev 5b25f651a961e06d377f2f8ff78ac38dc40f3fb3

Further findings:

SIGSEGV|my_scan_weight_utf8mb3_general_ci|my_strnncollsp_utf8mb3_general_ci|charset_info_st::strnncollsp|sortcmp_lex_string

binlog_table_maps == 0 || locked_tables_mode == LTM_LOCK_TABLES|SIGABRT|THD::reset_for_next_command|mysql_parse|dispatch_command|do_command

thd->transaction->stmt.is_empty()|SIGABRT|trans_check|trans_rollback|THD::cleanup|unlink_thd

> Were found in normal 6.0 also: MDEV-22915, MDEV-22915, MDEV-24706

Roel Van de Paar added a comment - 2021-01-21 05:27 - edited marko Thank you! bb-10.6-MDEV-23536 at rev 5b25f651a961e06d377f2f8ff78ac38dc40f3fb3 Further findings: SIGSEGV|my_scan_weight_utf8mb3_general_ci|my_strnncollsp_utf8mb3_general_ci|charset_info_st::strnncollsp|sortcmp_lex_string binlog_table_maps == 0 || locked_tables_mode == LTM_LOCK_TABLES|SIGABRT|THD::reset_for_next_command|mysql_parse|dispatch_command|do_command thd->transaction->stmt.is_empty()|SIGABRT|trans_check|trans_rollback|THD::cleanup|unlink_thd > Were found in normal 6.0 also: MDEV-22915 , MDEV-22915 , MDEV-24706

MariaDB Server

Race condition between KILL and transaction commit

Details

Description

Attachments

Issue Links

Activity

People

Dates

Git Integration