[MDEV-27850] MTR tests can hang due to DEBUG_SYNC race condition Created: 2022-02-15  Updated: 2024-01-26

Status: Stalled
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.2, 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.4, 10.5, 10.6, 10.11

Type: Bug Priority: Major
Reporter: Brandon Nesterenko Assignee: Brandon Nesterenko
Resolution: Unresolved Votes: 0
Labels: None

Issue Links:
Relates
relates to MDEV-32651 Lost Debug_sync signal in rpl_sql_thd... Closed

 Description   

If a DEBUG_SYNC signal is overwritten before the target thread acknowledges the signal, the thread will become stuck (until timeout) due to awaiting the missed signal. rpl.rpl_seconds_behind_master_spike highlights this problem with an example fix commit cdf19cd.

Other tests which may be impacted by this issue are rpl.rpl_dump_request_retry_warning, main.query_cache_debug, and main.partition_debug_sync. A comprehensive list of effected tests should be created, and then they should be fixed.

Edit:
The following is an (ongoing) list of tests which are potentially impacted by this race condition along with a message if fixed. Note that part of this work extended the debug_sync mechanism to automatically detect when an unacknowledged signal is overwritten or reset, and this list contains all tests which fail from that detection:

  • innodb.innodb-table-online
  • innodb.innodb-index-online
  • binlog_encryption.rpl_parallel
  • binlog_encryption.rpl_parallel_ignored_errors
  • rpl.rpl_get_master_version_and_clock
  • rpl.rpl_parallel
  • rpl.rpl_parallel_ignored_errors
  • rpl.kill_race_condition
  • rpl.rpl_seconds_behind_master_spike (fixed)
  • rpl.rpl_dump_request_retry_warning (fixed)
  • main.query_cache_debug (fixed)
  • main.partition_debug_sync (fixed)


 Comments   
Comment by Brandon Nesterenko [ 2022-02-22 ]

Hey Andrei!

Can you review my patch for fixing tests main.query_cache_debug, main.partition_debug_sync, and rpl.rpl_dump_request_retry_warning?

Commit: 883fe83

Buildbot: bb-10.2-MDEV-27850

Thanks!

Comment by Andrei Elkin [ 2022-02-25 ]

The test changes look good. Let's address DEBUG_SYNC 's single signal limitation as a followup.

Generated at Thu Feb 08 09:56:03 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.