Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-16678

Use MDL for innodb background threads instead of dict_operation_lock

Details

    Description

      Purge thread does take shared lock on innodb dictionary lock while processing the undo log record to avoid the dropping of table. But it also blocks DDL for
      the InnoDB. There are few issues exist for virtual column computation.
      Because purge thread acquires mdl lock for virtual column computation and could
      have deadlock with DDL. (fixed in 10.2+)

      Allow InnoDB background thread to take MDL lock on the table. In that case, it blocks DDL only for that table.

      For FOREIGN KEY constraint checks, we would prefer not to acquire dict_operation_lock S-latch, and rely on the correct acquisition of MDL on the SQL layer (to be covered by MDEV-21175).

      Attachments

        1. MDEV-16678_1.tgz
          17 kB
        2. MDEV-16678_2.test
          5 kB
        3. MDEV-16678.ods
          71 kB

        Issue Links

          Activity

            In `row_update_for_mysql()`, there is no need to take data dictionary lock to initialize fts_doc_id. Because marko mentioned that SQL layer takes MDL lock when foreign key
            is involved.

            thiru Thirunarayanan Balathandayuthapani added a comment - In `row_update_for_mysql()`, there is no need to take data dictionary lock to initialize fts_doc_id. Because marko mentioned that SQL layer takes MDL lock when foreign key is involved.

            MDEV-16678_1.tgz - Archive with files for replaying the problem
            mysqld:  storage/innobase/fts/fts0fts.cc:4290: dberr_t fts_sync(fts_sync_t*, bool, bool): Assertion `sync->unlock_cache' failed.
             
            How to install and run:
            git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich1
            cd RQG_mleich1
            tar xvzf <path_to>/MDEV-16678_1.tgz
            ./MDEV-16678_1.sh <path to MariaDB binaries>
            

            mleich Matthias Leich added a comment - MDEV-16678_1.tgz - Archive with files for replaying the problem mysqld: storage/innobase/fts/fts0fts.cc:4290: dberr_t fts_sync(fts_sync_t*, bool, bool): Assertion `sync->unlock_cache' failed.   How to install and run: git clone https://github.com/mleich1/rqg --branch experimental RQG_mleich1 cd RQG_mleich1 tar xvzf <path_to>/MDEV-16678_1.tgz ./MDEV-16678_1.sh <path to MariaDB binaries>

            There are not that rare RQG runs which end up with
            DEADLOCK of threads detected!
            ....
            [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!
            [ERROR] mysqld got signal 6 ;
             
            I am working on some simplified replay testcase.
            

            mleich Matthias Leich added a comment - There are not that rare RQG runs which end up with DEADLOCK of threads detected! .... [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected! [ERROR] mysqld got signal 6 ;   I am working on some simplified replay testcase.
            mleich Matthias Leich added a comment - - edited

            MDEV-16678_2.test - MTR based test which throws
            [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!
             
            I tried the same test on 10.5.   10 attempts but no replay.
            

            mleich Matthias Leich added a comment - - edited MDEV-16678_2.test - MTR based test which throws [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!   I tried the same test on 10.5. 10 attempts but no replay.
            mleich Matthias Leich added a comment - - edited

            bb-10.5-MDEV-16678-rebase2 commit commit 6333bd7b334b821d9688b5eee4e79066241e036b
            1. mysqld:  storage/innobase/fts/fts0fts.cc:4290: dberr_t fts_sync(fts_sync_t*, bool, bool): Assertion `sync->unlock_cache' failed.
                and
                [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected!
                were never observed again.
            2. Remaining failures (need to check if already in JIRA)
                        The frequency/numbers were taken from some grammar simplification campaign(4700 RQG runs) and permanent changing RQG grammars.
                 - 1 * storage/innobase/trx/trx0rec.cc:238: byte* trx_undo_log_v_idx(buf_block_t*, const dict_table_t*, ulint, byte*, bool): Assertion `n_idx > 0' failed.
                 - 1 * storage/innobase/include/dict0mem.h:738: void dict_v_col_t::detach(const dict_index_t&): Assertion `n == n_v_indexes' failed.
                 - 11 * storage/innobase/handler/handler0alter.cc:560: bool dict_table_t::instant_column(const dict_table_t&, const ulint*): Assertion `v.v_indexes.empty()' failed.
                 - 1 * storage/innobase/handler/handler0alter.cc:11077: virtual bool ha_innobase::commit_inplace_alter_table(TABLE*, Alter_inplace_info*, bool): Assertion `ctx0->old_table->get_ref_count() == 1' failed.
                - 3 * bb-10.5-MDEV-16678-rebase/storage/innobase/dict/dict0load.cc:1955: void dict_load_virtual_one_col(dict_table_t*, ulint, dict_v_col_t*, mem_heap_t*): Assertion `pos == vcol_pos' failed.
               - 2 * storage/innobase/dict/dict0load.cc:1649: const char* dict_load_column_low(dict_table_t*, mem_heap_t*, dict_col_t*, table_id_t*, const char**, const rec_t*, ulint*): Assertion `vcol->v_pos == dict_get_v_col_pos(pos)' failed.
              - frequent: storage/innobase/btr/btr0cur.cc:507: dberr_t btr_cur_instant_init_low(dict_index_t*, mtr_t*): Assertion `index->n_core_fields + n_add >= index->n_fields' failed.
                https://jira.mariadb.org/browse/MDEV-21148   Problem is in actual 10.5 too.
              - 1 * storage/innobase/btr/btr0cur.cc:1476: dberr_t btr_cur_search_to_nth_level_func(dict_index_t*, ulint, const dtuple_t*, page_cur_mode_t, ulint, btr_cur_t*, rw_lock_t*, const char*, unsigned int, mtr_t*, ib_uint64_t): Assertion `rw_lock_own(dict_index_get_lock(index), RW_LOCK_S)' failed.
                https://jira.mariadb.org/browse/MDEV-20038 Problem is in 10.n too
              - frequent  [ERROR] InnoDB: Table test/t4 contains 7 indexes inside InnoDB, which is different from the number of indexes 8 defined in the MariaDB
                 which is surprising because the test does not invoke crash recovery
             - frequent test fails where RQG means to have met a server freeze/deadlock or the server
               did not shut down properly
               There is a good probability of false alarm by RQG + these effects are known to be in 10.5 too.
            
            

            mleich Matthias Leich added a comment - - edited bb-10.5-MDEV-16678-rebase2 commit commit 6333bd7b334b821d9688b5eee4e79066241e036b 1. mysqld: storage/innobase/fts/fts0fts.cc:4290: dberr_t fts_sync(fts_sync_t*, bool, bool): Assertion `sync->unlock_cache' failed. and [ERROR] [FATAL] InnoDB: ######################################## Deadlock Detected! were never observed again. 2. Remaining failures (need to check if already in JIRA) The frequency/numbers were taken from some grammar simplification campaign(4700 RQG runs) and permanent changing RQG grammars. - 1 * storage/innobase/trx/trx0rec.cc:238: byte* trx_undo_log_v_idx(buf_block_t*, const dict_table_t*, ulint, byte*, bool): Assertion `n_idx > 0' failed. - 1 * storage/innobase/include/dict0mem.h:738: void dict_v_col_t::detach(const dict_index_t&): Assertion `n == n_v_indexes' failed. - 11 * storage/innobase/handler/handler0alter.cc:560: bool dict_table_t::instant_column(const dict_table_t&, const ulint*): Assertion `v.v_indexes.empty()' failed. - 1 * storage/innobase/handler/handler0alter.cc:11077: virtual bool ha_innobase::commit_inplace_alter_table(TABLE*, Alter_inplace_info*, bool): Assertion `ctx0->old_table->get_ref_count() == 1' failed. - 3 * bb-10.5-MDEV-16678-rebase/storage/innobase/dict/dict0load.cc:1955: void dict_load_virtual_one_col(dict_table_t*, ulint, dict_v_col_t*, mem_heap_t*): Assertion `pos == vcol_pos' failed. - 2 * storage/innobase/dict/dict0load.cc:1649: const char* dict_load_column_low(dict_table_t*, mem_heap_t*, dict_col_t*, table_id_t*, const char**, const rec_t*, ulint*): Assertion `vcol->v_pos == dict_get_v_col_pos(pos)' failed. - frequent: storage/innobase/btr/btr0cur.cc:507: dberr_t btr_cur_instant_init_low(dict_index_t*, mtr_t*): Assertion `index->n_core_fields + n_add >= index->n_fields' failed. https://jira.mariadb.org/browse/MDEV-21148 Problem is in actual 10.5 too. - 1 * storage/innobase/btr/btr0cur.cc:1476: dberr_t btr_cur_search_to_nth_level_func(dict_index_t*, ulint, const dtuple_t*, page_cur_mode_t, ulint, btr_cur_t*, rw_lock_t*, const char*, unsigned int, mtr_t*, ib_uint64_t): Assertion `rw_lock_own(dict_index_get_lock(index), RW_LOCK_S)' failed. https://jira.mariadb.org/browse/MDEV-20038 Problem is in 10.n too - frequent [ERROR] InnoDB: Table test/t4 contains 7 indexes inside InnoDB, which is different from the number of indexes 8 defined in the MariaDB which is surprising because the test does not invoke crash recovery - frequent test fails where RQG means to have met a server freeze/deadlock or the server did not shut down properly There is a good probability of false alarm by RQG + these effects are known to be in 10.5 too.

            Test round on origin/bb-10.5-MDEV-16678-rebase2 b51478b219a9b347b496b2460c8b77a83dad1aa2 2019-11-27
            with main focus on replaying "Assertion `ctx0->old_table->get_ref_count() == 1' failed"
            - none of the virtual column related asserts was replayed
              == Looks like the fix for MDEV-21148 did some exceptional good job
            - 7 times  mysqld: storage/innobase/rem/rem0rec.cc:507: bool rec_offs_validate(const rec_t*, const dict_index_t*, const ulint*): Assertion `ulint(rec) == offsets[2]' failed.
              AFAIK not MDEV-16678 specific
            - 1 time mysqld: storage/innobase/dict/dict0dict.cc:4559: void dict_table_check_for_dup_indexes(const dict_table_t*, check_name): Assertion `index1->is_committed() != index2->is_committed() || strcmp(index1->name, index2->name) != 0' failed.
              Not found in JIRA
             
            So in case the last assert is not MDEV-16678 specific than  MDEV-16678 should be now ok.
            
            

            mleich Matthias Leich added a comment - Test round on origin/bb-10.5-MDEV-16678-rebase2 b51478b219a9b347b496b2460c8b77a83dad1aa2 2019-11-27 with main focus on replaying "Assertion `ctx0->old_table->get_ref_count() == 1' failed" - none of the virtual column related asserts was replayed == Looks like the fix for MDEV-21148 did some exceptional good job - 7 times mysqld: storage/innobase/rem/rem0rec.cc:507: bool rec_offs_validate(const rec_t*, const dict_index_t*, const ulint*): Assertion `ulint(rec) == offsets[2]' failed. AFAIK not MDEV-16678 specific - 1 time mysqld: storage/innobase/dict/dict0dict.cc:4559: void dict_table_check_for_dup_indexes(const dict_table_t*, check_name): Assertion `index1->is_committed() != index2->is_committed() || strcmp(index1->name, index2->name) != 0' failed. Not found in JIRA   So in case the last assert is not MDEV-16678 specific than MDEV-16678 should be now ok.

            I pushed some cleanup, mainly to the MDL acquisition code.
            There was a race condition where the dict_table_t::name was being freed and renamed while we converting the table name to a MDL ticket name. This race could affect 10.2‥10.4 as well.

            axel, please run the normal R/W benchmarks on the branch and compare to the latest 10.5. We would like to ensure that there is no performance degradation for DML workloads.

            marko Marko Mäkelä added a comment - I pushed some cleanup, mainly to the MDL acquisition code. There was a race condition where the dict_table_t::name was being freed and renamed while we converting the table name to a MDL ticket name. This race could affect 10.2‥10.4 as well. axel , please run the normal R/W benchmarks on the branch and compare to the latest 10.5. We would like to ensure that there is no performance degradation for DML workloads.
            axel Axel Schwenke added a comment -

            I ran standard OLTP workloads. Numbers and diagrams attached in MDEV-16678.ods

            Observations:

            • the builds from 10.5 master (baseline) and bb-10.5-MDEV-16678-rebase2 (new) behave very similarly; the new code tends to scale better at high thread counts but is a little slower at 16 or 32 threads
            • disabling backup logs has a positive effect on performance for write-heavy workloads
            axel Axel Schwenke added a comment - I ran standard OLTP workloads. Numbers and diagrams attached in MDEV-16678.ods Observations: the builds from 10.5 master (baseline) and bb-10.5- MDEV-16678 -rebase2 (new) behave very similarly; the new code tends to scale better at high thread counts but is a little slower at 16 or 32 threads disabling backup logs has a positive effect on performance for write-heavy workloads

            People

              marko Marko Mäkelä
              thiru Thirunarayanan Balathandayuthapani
              Votes:
              2 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.