[MDEV-16678] Use MDL for innodb background threads instead of dict_operation_lock Created: 2018-07-03  Updated: 2021-09-30  Resolved: 2019-12-10

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Fix Version/s: 10.5.1

Type: Task Priority: Critical
Reporter: Thirunarayanan Balathandayuthapani Assignee: Marko Mäkelä
Resolution: Fixed Votes: 2
Labels: None

Attachments: File MDEV-16678.ods     File MDEV-16678_1.tgz     File MDEV-16678_2.test    
Issue Links:
Problem/Incident
causes MDEV-21282 Assertion 'mariadb_table' failed in g... Closed
causes MDEV-21327 Notify tpool threadpool about MDL waits Closed
causes MDEV-21546 main.backup_stages occasionally fails... Closed
causes MDEV-22060 MSAN use-of-uninitialized-value in ma... Closed
causes MDEV-22751 Uninitialized tbl_len in dict_acquire... Closed
causes MDEV-22824 Buffer overflow in dict_table_t::pars... Closed
causes MDEV-23422 innodb_zip.restart fails in buildbot ... Closed
Relates
relates to MDEV-21175 Remove dict_table_t::n_foreign_key_ch... Closed
relates to MDEV-21281 main.backup_interaction fails intermi... Closed
relates to MDEV-21400 encryption.innochecksum fails during ... Closed
relates to MDEV-23026 Server hangs on purge or failing asse... Closed
relates to MDEV-24661 The test innodb.innodb_wl6326_big oft... Closed
relates to MDEV-18654 Failing assertion: sym_node->table !=... Closed
relates to MDEV-20874 Wrong handling of 'table was dropped'... Stalled
relates to MDEV-20876 Remove node->vcol_op_failed() method Closed
relates to MDEV-21283 InnoDB: MySQL is trying to drop datab... Closed
relates to MDEV-21344 Valgrind uninitialised value warnings... Closed
relates to MDEV-22867 Assertion `instant.n_core_fields == n... Closed
relates to MDEV-22958 innodb.instant_alter_debug fails in b... Closed

 Description   

Purge thread does take shared lock on innodb dictionary lock while processing the undo log record to avoid the dropping of table. But it also blocks DDL for
the InnoDB. There are few issues exist for virtual column computation.
Because purge thread acquires mdl lock for virtual column computation and could
have deadlock with DDL. (fixed in 10.2+)

Allow InnoDB background thread to take MDL lock on the table. In that case, it blocks DDL only for that table.

For FOREIGN KEY constraint checks, we would prefer not to acquire dict_operation_lock S-latch, and rely on the correct acquisition of MDL on the SQL layer (to be covered by MDEV-21175).



 Comments   
Comment by Thirunarayanan Balathandayuthapani [ 2019-08-20 ]

In `row_update_for_mysql()`, there is no need to take data dictionary lock to initialize fts_doc_id. Because marko mentioned that SQL layer takes MDL lock when foreign key
is involved.

Comment by Matthias Leich [ 2019-11-15 ]

MDEV-16678_1.tgz - Archive with files for replaying the problem
mysqld:  storage/innobase/fts/fts0fts.cc:4290: dberr_t fts_sync(fts_sync_t*, bool, bool): Assertion `sync->unlock_cache' failed.
 
How to install and run:
git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich1
cd RQG_mleich1
tar xvzf <path_to>/MDEV-16678_1.tgz
./MDEV-16678_1.sh <path to MariaDB binaries>

Comment by Matthias Leich [ 2019-11-18 ]

There are not that rare RQG runs which end up with
DEADLOCK of threads detected!
....
[ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!
[ERROR] mysqld got signal 6 ;
 
I am working on some simplified replay testcase.

Comment by Matthias Leich [ 2019-11-19 ]

MDEV-16678_2.test - MTR based test which throws
[ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!
 
I tried the same test on 10.5.   10 attempts but no replay.

Comment by Matthias Leich [ 2019-11-25 ]

bb-10.5-MDEV-16678-rebase2 commit commit 6333bd7b334b821d9688b5eee4e79066241e036b
1. mysqld:  storage/innobase/fts/fts0fts.cc:4290: dberr_t fts_sync(fts_sync_t*, bool, bool): Assertion `sync->unlock_cache' failed.
    and
    [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!
    were never observed again.
2. Remaining failures (need to check if already in JIRA)
            The frequency/numbers were taken from some grammar simplification campaign(4700 RQG runs) and permanent changing RQG grammars.
     - 1 * storage/innobase/trx/trx0rec.cc:238: byte* trx_undo_log_v_idx(buf_block_t*, const dict_table_t*, ulint, byte*, bool): Assertion `n_idx > 0' failed.
     - 1 * storage/innobase/include/dict0mem.h:738: void dict_v_col_t::detach(const dict_index_t&): Assertion `n == n_v_indexes' failed.
     - 11 * storage/innobase/handler/handler0alter.cc:560: bool dict_table_t::instant_column(const dict_table_t&, const ulint*): Assertion `v.v_indexes.empty()' failed.
     - 1 * storage/innobase/handler/handler0alter.cc:11077: virtual bool ha_innobase::commit_inplace_alter_table(TABLE*, Alter_inplace_info*, bool): Assertion `ctx0->old_table->get_ref_count() == 1' failed.
    - 3 * bb-10.5-MDEV-16678-rebase/storage/innobase/dict/dict0load.cc:1955: void dict_load_virtual_one_col(dict_table_t*, ulint, dict_v_col_t*, mem_heap_t*): Assertion `pos == vcol_pos' failed.
   - 2 * storage/innobase/dict/dict0load.cc:1649: const char* dict_load_column_low(dict_table_t*, mem_heap_t*, dict_col_t*, table_id_t*, const char**, const rec_t*, ulint*): Assertion `vcol->v_pos == dict_get_v_col_pos(pos)' failed.
  - frequent: storage/innobase/btr/btr0cur.cc:507: dberr_t btr_cur_instant_init_low(dict_index_t*, mtr_t*): Assertion `index->n_core_fields + n_add >= index->n_fields' failed.
    https://jira.mariadb.org/browse/MDEV-21148   Problem is in actual 10.5 too.
  - 1 * storage/innobase/btr/btr0cur.cc:1476: dberr_t btr_cur_search_to_nth_level_func(dict_index_t*, ulint, const dtuple_t*, page_cur_mode_t, ulint, btr_cur_t*, rw_lock_t*, const char*, unsigned int, mtr_t*, ib_uint64_t): Assertion `rw_lock_own(dict_index_get_lock(index), RW_LOCK_S)' failed.
    https://jira.mariadb.org/browse/MDEV-20038 Problem is in 10.n too
  - frequent  [ERROR] InnoDB: Table test/t4 contains 7 indexes inside InnoDB, which is different from the number of indexes 8 defined in the MariaDB
     which is surprising because the test does not invoke crash recovery
 - frequent test fails where RQG means to have met a server freeze/deadlock or the server
   did not shut down properly
   There is a good probability of false alarm by RQG + these effects are known to be in 10.5 too.

Comment by Matthias Leich [ 2019-11-27 ]

Test round on origin/bb-10.5-MDEV-16678-rebase2 b51478b219a9b347b496b2460c8b77a83dad1aa2 2019-11-27
with main focus on replaying "Assertion `ctx0->old_table->get_ref_count() == 1' failed"
- none of the virtual column related asserts was replayed
  == Looks like the fix for MDEV-21148 did some exceptional good job
- 7 times  mysqld: storage/innobase/rem/rem0rec.cc:507: bool rec_offs_validate(const rec_t*, const dict_index_t*, const ulint*): Assertion `ulint(rec) == offsets[2]' failed.
  AFAIK not MDEV-16678 specific
- 1 time mysqld: storage/innobase/dict/dict0dict.cc:4559: void dict_table_check_for_dup_indexes(const dict_table_t*, check_name): Assertion `index1->is_committed() != index2->is_committed() || strcmp(index1->name, index2->name) != 0' failed.
  Not found in JIRA
 
So in case the last assert is not MDEV-16678 specific than  MDEV-16678 should be now ok.

Comment by Marko Mäkelä [ 2019-12-05 ]

I pushed some cleanup, mainly to the MDL acquisition code.
There was a race condition where the dict_table_t::name was being freed and renamed while we converting the table name to a MDL ticket name. This race could affect 10.2‥10.4 as well.

axel, please run the normal R/W benchmarks on the branch and compare to the latest 10.5. We would like to ensure that there is no performance degradation for DML workloads.

Comment by Axel Schwenke [ 2019-12-10 ]

I ran standard OLTP workloads. Numbers and diagrams attached in MDEV-16678.ods

Observations:

  • the builds from 10.5 master (baseline) and bb-10.5-MDEV-16678-rebase2 (new) behave very similarly; the new code tends to scale better at high thread counts but is a little slower at 16 or 32 threads
  • disabling backup logs has a positive effect on performance for write-heavy workloads
Generated at Thu Feb 08 08:30:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.