Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19749

MDL scalability regression after backup locks

Details

    Description

      In 7a9dfdd GLOBAL and COMMIT namespaces were combined into BACKUP, which doubled load on BACKUP lock mutex.

      Can be fixed by implementing something similar to MySQL WL#7306 "Improve MDL performance and scalability by implementing lock-free lock acquisition for DML".

      Attachments

        Issue Links

          Activity

            The following patch has been used during some performance testing to remove the scalability bottleneck due to backup locks, so that we can better highlight scalability bottlenecks inside storage engines.

            diff --git a/sql/handler.cc b/sql/handler.cc
            index eb7b5b8..012ef20 100644
            --- a/sql/handler.cc
            +++ b/sql/handler.cc
            @@ -1567,7 +1567,7 @@ int ha_commit_trans(THD *thd, bool all)
               DBUG_PRINT("info", ("is_real_trans: %d  rw_trans:  %d  rw_ha_count: %d",
                                   is_real_trans, rw_trans, rw_ha_count));
             
            -  if (rw_trans)
            +  if (0 && rw_trans)
               {
                 /*
                   Acquire a metadata lock which will ensure that COMMIT is blocked
            diff --git a/sql/sql_base.cc b/sql/sql_base.cc
            index c41e08e..f9e3f34 100644
            --- a/sql/sql_base.cc
            +++ b/sql/sql_base.cc
            @@ -2100,7 +2100,7 @@ bool open_table(THD *thd, TABLE_LIST *table_list, Open_table_context *ot_ctx)
               }
             
               if (!(flags & MYSQL_OPEN_HAS_MDL_LOCK) &&
            -      table->s->table_category < TABLE_CATEGORY_INFORMATION)
            +      table->s->table_category < TABLE_CATEGORY_INFORMATION && 0)
               {
                 /*
                   We are not under LOCK TABLES and going to acquire write-lock/
            
            

            marko Marko Mäkelä added a comment - The following patch has been used during some performance testing to remove the scalability bottleneck due to backup locks, so that we can better highlight scalability bottlenecks inside storage engines. diff --git a/sql/handler.cc b/sql/handler.cc index eb7b5b8..012ef20 100644 --- a/sql/handler.cc +++ b/sql/handler.cc @@ -1567,7 +1567,7 @@ int ha_commit_trans(THD *thd, bool all) DBUG_PRINT("info", ("is_real_trans: %d rw_trans: %d rw_ha_count: %d", is_real_trans, rw_trans, rw_ha_count));   - if (rw_trans) + if (0 && rw_trans) { /* Acquire a metadata lock which will ensure that COMMIT is blocked diff --git a/sql/sql_base.cc b/sql/sql_base.cc index c41e08e..f9e3f34 100644 --- a/sql/sql_base.cc +++ b/sql/sql_base.cc @@ -2100,7 +2100,7 @@ bool open_table(THD *thd, TABLE_LIST *table_list, Open_table_context *ot_ctx) }   if (!(flags & MYSQL_OPEN_HAS_MDL_LOCK) && - table->s->table_category < TABLE_CATEGORY_INFORMATION) + table->s->table_category < TABLE_CATEGORY_INFORMATION && 0) { /* We are not under LOCK TABLES and going to acquire write-lock/

            Possibly, this locking can be removed as part of implementing backup in the server process (MDEV-14992).

            marko Marko Mäkelä added a comment - Possibly, this locking can be removed as part of implementing backup in the server process ( MDEV-14992 ).
            monty Michael Widenius added a comment - - edited

            I do not think it is possible to remove the backup lock in ha_commit_trans.
            This lock is essential to ensure that we can get a consistent backup, in any scenario (as far as I know).
            As long we support external backup tools, we need to have it, even if we have in server backup.
            It is also needed the lock to be able to create a consistent snapshot of the server.

            The lock is also critically needed for binlogs.

            It would be nice to be able to 'know in advance' that a backup will take place so that we could ignore all backup locks when there will not be any lock for existing transactions.
            What could be possible be done is to have a flag that disables all the backup locks. When we do BACKUP STAGE START, we would then enable the lock and wait for all transactions that was started outside of the lock to complete before continue.

            monty Michael Widenius added a comment - - edited I do not think it is possible to remove the backup lock in ha_commit_trans. This lock is essential to ensure that we can get a consistent backup, in any scenario (as far as I know). As long we support external backup tools, we need to have it, even if we have in server backup. It is also needed the lock to be able to create a consistent snapshot of the server. The lock is also critically needed for binlogs. It would be nice to be able to 'know in advance' that a backup will take place so that we could ignore all backup locks when there will not be any lock for existing transactions. What could be possible be done is to have a flag that disables all the backup locks. When we do BACKUP STAGE START, we would then enable the lock and wait for all transactions that was started outside of the lock to complete before continue.
            monty Michael Widenius added a comment - - edited

            Disabling backup locks at rw_trans would make backups, both internal and externa impossible.
            This required to get a constant backup point for all engines and replication.
            It would also disable flush tables with read lock.

            One possible solution is to not do any backup locks before 'backup start' and have 'backup start' inform all threads that a backup is about to start.
            The there would be a wait until all current transactions has ended and all transactions are 'backup aware' and have started to take backup locks.

            Rough estimate for getting this done is about one week.

            Another option is to see if we can speed up the backup lock in ha_commit_trans() so that if there are no conflicting locks there would never be a thread context switch.
            Maybe this could be done faster with two atomic increments when backup is not running

            monty Michael Widenius added a comment - - edited Disabling backup locks at rw_trans would make backups, both internal and externa impossible. This required to get a constant backup point for all engines and replication. It would also disable flush tables with read lock. One possible solution is to not do any backup locks before 'backup start' and have 'backup start' inform all threads that a backup is about to start. The there would be a wait until all current transactions has ended and all transactions are 'backup aware' and have started to take backup locks. Rough estimate for getting this done is about one week. Another option is to see if we can speed up the backup lock in ha_commit_trans() so that if there are no conflicting locks there would never be a thread context switch. Maybe this could be done faster with two atomic increments when backup is not running

            I’m not familiar with this code, so it could take me significantly more than a week to understand the logic deep enough in order to fix this.

            I have a rough idea how we could minimize atomic writes or read-modify-writes of a shared memory location in the fast path. It would go something like this:

            1. The BACKUP statement sets a global flag or switches function pointers to note that a critical phase of backup is about to start.
            2. As a result of this, ha_commit_trans() and open_table() will start to register the THD as "backup aware" and acquire the locks as they would if the work-around patch from 2021-07-21 were not present.
            3. The BACKUP statement will poll until all active THD objects are registered "backup aware".
            4. Once the backup lock is released, the global atomic flag will be reset. As a result, open_table() and ha_commit_trans() will resume fast operation.
            marko Marko Mäkelä added a comment - I’m not familiar with this code, so it could take me significantly more than a week to understand the logic deep enough in order to fix this. I have a rough idea how we could minimize atomic writes or read-modify-writes of a shared memory location in the fast path. It would go something like this: The BACKUP statement sets a global flag or switches function pointers to note that a critical phase of backup is about to start. As a result of this, ha_commit_trans() and open_table() will start to register the THD as "backup aware" and acquire the locks as they would if the work-around patch from 2021-07-21 were not present. The BACKUP statement will poll until all active THD objects are registered "backup aware". Once the backup lock is released, the global atomic flag will be reset. As a result, open_table() and ha_commit_trans() will resume fast operation.

            I am working on the following solution (which is very much like Marko's suggestion) to remove some of the issues with BACKUP STAGE BLOCK_COMMIT;

            Adding before MDL_REQUEST_INIT() in ha_commit_trans()

            If (!(thd->backup_state= backup_is_running))

            { my_atomic_add32(&thd_in_commit, 1, MY_MEMORY_ORDER_RELAXED)) }

            else

            { MDL_REQUEST_INIT(...) }

            and when releasing:

            if (mdl_backup.ticket)

            { .... }

            else if (thd->backup_state)

            { my_atomic_add32(&thd_in_commit, -1, MY_MEMORY_ORDER_RELAXED)) thd->backup_state= 0; }

            If we have the above, we could in BACKUP_STAGE_START do something like

            backup_running= 1;
            do

            { while (thd_in_commit) sleep(1); }

            while (some_thd_has_backup_state_set())

            The effect would be to replace the MDL_backup lock in ha_commit_trans() with two atomic increment and two thread-local memory assignments when backup is not running. The cost is a slightly slower start of the backup.

            monty Michael Widenius added a comment - I am working on the following solution (which is very much like Marko's suggestion) to remove some of the issues with BACKUP STAGE BLOCK_COMMIT; Adding before MDL_REQUEST_INIT() in ha_commit_trans() If (!(thd->backup_state= backup_is_running)) { my_atomic_add32(&thd_in_commit, 1, MY_MEMORY_ORDER_RELAXED)) } else { MDL_REQUEST_INIT(...) } and when releasing: if (mdl_backup.ticket) { .... } else if (thd->backup_state) { my_atomic_add32(&thd_in_commit, -1, MY_MEMORY_ORDER_RELAXED)) thd->backup_state= 0; } If we have the above, we could in BACKUP_STAGE_START do something like backup_running= 1; do { while (thd_in_commit) sleep(1); } while (some_thd_has_backup_state_set()) The effect would be to replace the MDL_backup lock in ha_commit_trans() with two atomic increment and two thread-local memory assignments when backup is not running. The cost is a slightly slower start of the backup.

            monty, my idea was that the "fast" code path (no backup is running) would not run any atomic read-modify-write operations. This idea would work if the logic that handles the BACKUP statement could somehow count all active THD objects and wait until the atomic counter has reached that value, instead of the counter being 0.

            marko Marko Mäkelä added a comment - monty , my idea was that the "fast" code path (no backup is running) would not run any atomic read-modify-write operations. This idea would work if the logic that handles the BACKUP statement could somehow count all active THD objects and wait until the atomic counter has reached that value, instead of the counter being 0.

            Pushed to bb-10.11-monty for testing

            monty Michael Widenius added a comment - Pushed to bb-10.11-monty for testing

            monty, I revised the logic a little and created https://github.com/MariaDB/server/pull/3966 for this, currently as a draft. It seems to me that the global atomic counter may be redundant. I am still reviewing the logic for potential race conditions.

            marko Marko Mäkelä added a comment - monty , I revised the logic a little and created https://github.com/MariaDB/server/pull/3966 for this, currently as a draft. It seems to me that the global atomic counter may be redundant. I am still reviewing the logic for potential race conditions.

            As far as I understand from the preliminary performance test results, monty replaced the MDL_key::BACKUP acquisition in ha_commit_trans() but not in open_tables(), where the relevant section is starting as follows:

                enum enum_mdl_type mdl_type= MDL_BACKUP_DML;
             
                if (table->s->table_category != TABLE_CATEGORY_USER)
                  mdl_type= MDL_BACKUP_SYS_DML;
                else if (table->s->online_backup)
                  mdl_type= MDL_BACKUP_TRANS_DML;
             
                if (table_list->mdl_request.is_write_lock_request() &&
                    ! (flags & (MYSQL_OPEN_IGNORE_GLOBAL_READ_LOCK |
                                MYSQL_OPEN_FORCE_SHARED_MDL |
                                MYSQL_OPEN_FORCE_SHARED_HIGH_PRIO_MDL |
                                MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK)) &&
                    ! ot_ctx->has_protection_against_grl(mdl_type))
                {
                  MDL_request protection_request;
                  MDL_deadlock_handler mdl_deadlock_handler(ot_ctx);
             
                  if (thd->has_read_only_protection())
                  {
                    MYSQL_UNBIND_TABLE(table->file);
                    table->vcol_cleanup_expr(thd);
                    tc_release_table(table);
                    DBUG_RETURN(TRUE);
                  }
             
                  MDL_REQUEST_INIT(&protection_request, MDL_key::BACKUP, "", "", mdl_type,
                                   MDL_STATEMENT);
            

            It seems to me that it could be useful to check flags and possibly the presence of any pending BACKUP STAGE or FLUSH TABLES WITH READ LOCK request before assigning mdl_type or executing the rest of the above logic.

            marko Marko Mäkelä added a comment - As far as I understand from the preliminary performance test results, monty replaced the MDL_key::BACKUP acquisition in ha_commit_trans() but not in open_tables() , where the relevant section is starting as follows: enum enum_mdl_type mdl_type= MDL_BACKUP_DML;   if (table->s->table_category != TABLE_CATEGORY_USER) mdl_type= MDL_BACKUP_SYS_DML; else if (table->s->online_backup) mdl_type= MDL_BACKUP_TRANS_DML;   if (table_list->mdl_request.is_write_lock_request() && ! (flags & (MYSQL_OPEN_IGNORE_GLOBAL_READ_LOCK | MYSQL_OPEN_FORCE_SHARED_MDL | MYSQL_OPEN_FORCE_SHARED_HIGH_PRIO_MDL | MYSQL_OPEN_SKIP_SCOPED_MDL_LOCK)) && ! ot_ctx->has_protection_against_grl(mdl_type)) { MDL_request protection_request; MDL_deadlock_handler mdl_deadlock_handler(ot_ctx);   if (thd->has_read_only_protection()) { MYSQL_UNBIND_TABLE(table->file); table->vcol_cleanup_expr(thd); tc_release_table(table); DBUG_RETURN(TRUE); }   MDL_REQUEST_INIT(&protection_request, MDL_key::BACKUP, "" , "" , mdl_type, MDL_STATEMENT); It seems to me that it could be useful to check flags and possibly the presence of any pending BACKUP STAGE or FLUSH TABLES WITH READ LOCK request before assigning mdl_type or executing the rest of the above logic.
            monty Michael Widenius added a comment -

            Eugine, please do not use std::array() as this allocates memory that is not tracked.
            Instead use:
            uint32 m_type_counter[MDL_BACKUP_END];

            We should also use uint32 instead of uint32_t in MariaDB code.

            monty Michael Widenius added a comment - Eugine, please do not use std::array() as this allocates memory that is not tracked. Instead use: uint32 m_type_counter [MDL_BACKUP_END] ; We should also use uint32 instead of uint32_t in MariaDB code.
            marko Marko Mäkelä added a comment -

            To my understanding, std::array is a fixed-size container similar to the basic array data type in C or C++. The following existing declaration in sql/mdl.cc that had been added in an earlier attempt to do something about this performance regression

            std::array<uint32_t, MDL_BACKUP_END> m_type_counters;
            

            should be equivalent to the ‘traditional’ array that you are proposing. As far as I understand, a benefit of std::array over ‘traditional’ arrays might be the compatibility with all of #include <algorithm>. Any allocation would be managed outside the std::array. Unlike std::vector, a std::array cannot be grown afterwards, and it does not keep track of the number of actually used elements.

            marko Marko Mäkelä added a comment - To my understanding, std::array is a fixed-size container similar to the basic array data type in C or C++. The following existing declaration in sql/mdl.cc that had been added in an earlier attempt to do something about this performance regression std::array<uint32_t, MDL_BACKUP_END> m_type_counters; should be equivalent to the ‘traditional’ array that you are proposing. As far as I understand, a benefit of std::array over ‘traditional’ arrays might be the compatibility with all of #include <algorithm> . Any allocation would be managed outside the std::array . Unlike std::vector , a std::array cannot be grown afterwards, and it does not keep track of the number of actually used elements.
            marko Marko Mäkelä added a comment -

            The numeric value of MDL_BACKUP_END is 14. I was thinking if the array m_type_counters and the m_bitmap could be combined somehow. As far as I understand, the m_bitmap is a ‘lookup table’ of m_type_counters, which in turn is a ‘lookup table’ of m_list. We do seem to need at least uint16_t for each element of m_type_counters, because it is imaginable to have 2¹⁶ concurrent requests (from concurrently executing threads) on a particular metadata lock.

            I was thinking how hard it would be to remove m_bitmap. This would involve replacing hog_lock_types_bitmap(), incompatible_waiting_types_bitmap(), incompatible_granted_types_bitmap() and refactoring their callers MDL_lock::reschedule_waiters() and MDL_lock::can_grant_lock(). The idea would be to consult m_type_counters directly. This looks rather complex. We might upgrade our build requirements from generic x86-64 to x86-64-v2 (which includes SSE4.2) or create target specific SIMD optimizations, such making use of the SSE4.1 pmaxuw instruction to determine if any pending requests of the lock types of interest exist. This would require the replacement of m_type_counters to store uint16_t instead of uint32_t, and the elements would have to be ordered in a particular way, because only AVX-512 introduced some mask registers.

            marko Marko Mäkelä added a comment - The numeric value of MDL_BACKUP_END is 14. I was thinking if the array m_type_counters and the m_bitmap could be combined somehow. As far as I understand, the m_bitmap is a ‘lookup table’ of m_type_counters , which in turn is a ‘lookup table’ of m_list . We do seem to need at least uint16_t for each element of m_type_counters , because it is imaginable to have 2¹⁶ concurrent requests (from concurrently executing threads) on a particular metadata lock. I was thinking how hard it would be to remove m_bitmap . This would involve replacing hog_lock_types_bitmap() , incompatible_waiting_types_bitmap() , incompatible_granted_types_bitmap() and refactoring their callers MDL_lock::reschedule_waiters() and MDL_lock::can_grant_lock() . The idea would be to consult m_type_counters directly. This looks rather complex. We might upgrade our build requirements from generic x86-64 to x86-64-v2 (which includes SSE4.2) or create target specific SIMD optimizations, such making use of the SSE4.1 pmaxuw instruction to determine if any pending requests of the lock types of interest exist. This would require the replacement of m_type_counters to store uint16_t instead of uint32_t , and the elements would have to be ordered in a particular way, because only AVX-512 introduced some mask registers.
            svoj Sergey Vojtovich added a comment -

            marko, what you suggest makes sense, but it is unrelated to this issue.

            svoj Sergey Vojtovich added a comment - marko , what you suggest makes sense, but it is unrelated to this issue.
            marko Marko Mäkelä added a comment -

            svoj, right, my comments from today are only related to an earlier change that had been tagged with MDEV-19749 and that monty was apparently commenting on.

            marko Marko Mäkelä added a comment - svoj , right, my comments from today are only related to an earlier change that had been tagged with MDEV-19749 and that monty was apparently commenting on.
            monty Michael Widenius added a comment -

            Proposed fix for improving the BACKUP BLOCK_COMMIT lock pushed to bb-10.11-monty

            monty Michael Widenius added a comment - Proposed fix for improving the BACKUP BLOCK_COMMIT lock pushed to bb-10.11-monty
            monty Michael Widenius added a comment -

            @marko, there is no point in using std::array for something that does not need it!
            We should avoid using std:: operators as these are harder to debug than what we already have and can cause more code to be generated because of usage of templates.

            monty Michael Widenius added a comment - @marko, there is no point in using std::array for something that does not need it! We should avoid using std:: operators as these are harder to debug than what we already have and can cause more code to be generated because of usage of templates.

            People

              monty Michael Widenius
              svoj Sergey Vojtovich
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.