Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-35000

dict_table_close() breaks innodb_stats_auto_recalc

Details

    Description

      In an email exchange with maxk, the following performance bottleneck was mentioned:

      /** Decrement the count of open handles */
      void dict_table_close(dict_table_t *table)
      {
        if (table->get_ref_count() == 1 &&
            dict_stats_is_persistent_enabled(table) &&
            strchr(table->name.m_name, '/'))
        {
          /* It looks like we are closing the last handle. The user could
          have executed FLUSH TABLES in order to have the statistics reloaded
          from the InnoDB persistent statistics tables. We must acquire
          exclusive dict_sys.latch to prevent a race condition with another
          thread concurrently acquiring a handle on the table. */
          dict_sys.lock(SRW_LOCK_CALL);
          if (table->release())
          {
            table->stats_mutex_lock();
            if (table->get_ref_count() == 0)
              dict_stats_deinit(table);
            table->stats_mutex_unlock();
          }
          dict_sys.unlock();
        }
        else
          table->release();
      }
      

      The intention of the exclusive dict_sys.latch is to prevent concurrent ha_innobase::open() from accessing a table whose statistics have been deinitialized when the last handle is being closed.

      I am thinking that we should replace this obscure logic that is intended to react on FLUSH TABLES with an additional function that would only be executed as part of FLUSH TABLES right after all table handles have been closed. That function could then acquire exclusive dict_sys.latch and invalidate the cached InnoDB persistent statistics for all tables whose reference count is 0. The intention here is to allow users to force a re-read of mysql.innodb_index_stats and mysql.innodb_tables_stats after they were modified by the DBA.

      Attachments

        Issue Links

          Activity

            As part of this, I plan to clean up also some data types related to statistics. Because the page number is stored in 32 bits in data files, the size of an index or table in pages must fit in uint32_t.

            marko Marko Mäkelä added a comment - As part of this, I plan to clean up also some data types related to statistics. Because the page number is stored in 32 bits in data files, the size of an index or table in pages must fit in uint32_t .

            svoj’s idea of overriding Handler_share::~Handler_share() works, but we will need a way to invalidate the table object in the derived class InnoDB_share when it is being evicted from the cache. How to do that, I don’t currently know. Perhaps, have a pointer in dict_table_t, and a dict_table_t* inside InnoDB_share that we’d either set nullptr or update to another table object that will replace the old one on DDL?

            In the current state of the code, we can observe a massive amount of heap-use-after-free failures in certain tests, such as innodb.alter_not_null.

            marko Marko Mäkelä added a comment - svoj ’s idea of overriding Handler_share::~Handler_share() works, but we will need a way to invalidate the table object in the derived class InnoDB_share when it is being evicted from the cache. How to do that, I don’t currently know. Perhaps, have a pointer in dict_table_t , and a dict_table_t* inside InnoDB_share that we’d either set nullptr or update to another table object that will replace the old one on DDL? In the current state of the code, we can observe a massive amount of heap-use-after-free failures in certain tests, such as innodb.alter_not_null .

            The interface handler::get_ha_share_ptr() does seem to be inadequate for partitioned tables, because they are attached to handler::table_share. There is ha_partition::set_ha_share_ref(), and Partition_share::partitions_share_refs does seem to keep track of any underlying Handler_share that could be assigned by invoking handler::set_ha_share_ptr(). We would seem to need a refined interface so that InnoDB can access its derived Handler_share object inside operations that would rebuild a partition such as ha_innobase::truncate() in ALTER TABLE…TRUNCATE PARTITION. The handler::get_ha_share_ptr() would return a Partition_share in this case.

            marko Marko Mäkelä added a comment - The interface handler::get_ha_share_ptr() does seem to be inadequate for partitioned tables, because they are attached to handler::table_share . There is ha_partition::set_ha_share_ref() , and Partition_share::partitions_share_refs does seem to keep track of any underlying Handler_share that could be assigned by invoking handler::set_ha_share_ptr() . We would seem to need a refined interface so that InnoDB can access its derived Handler_share object inside operations that would rebuild a partition such as ha_innobase::truncate() in ALTER TABLE…TRUNCATE PARTITION . The handler::get_ha_share_ptr() would return a Partition_share in this case.

            Another problem that I am hitting seems to share a root cause with MDEV-17805: The error handling of some DDL operation would invoke ha_innobase::delete_table() on a different object than the one that ha_innobase::create() was invoked with. That different object would not have any associated table_share, so we’d miss a call to Handler_share::~Handler_share() via handler::ha_share. In my Handler_share derived object, I would acquire a dict_table_t reference in order to prevent a premature eviction of the table from the InnoDB dict_sys cache. The dangling table reference would cause some DDL operations to fail, because tables would not be properly dropped. I tried to work around this by the following kind of a condition:

            @@ -13523,7 +13511,8 @@ ha_innobase::create(const char *name, TABLE *form, HA_CREATE_INFO *create_info,
                     if (!info.table()->is_temporary())
                     {
                       log_write_up_to(trx->commit_lsn, true);
            -          if (info.table()->stats_is_persistent())
            +          if (info.table()->stats_is_persistent() &&
            +              table_share->tmp_table == NO_TMP_TABLE)
                         innodb_share_register(*info.table());
                     }
                     info.table()->release();
            

            Unfortunately, during a plain CREATE TABLE this function turns out to be invoked with an initial table_share object that was created by init_tmp_table_share(), with table_share->tmp_table == INTERNAL_TMP_TABLE. Because of that, we’d fail to invalidate the InnoDB table statistics on a subsequent FLUSH TABLES, and the test innodb.innodb_stats_auto_recalc_on_nonexistent would return a wrong result.

            marko Marko Mäkelä added a comment - Another problem that I am hitting seems to share a root cause with MDEV-17805 : The error handling of some DDL operation would invoke ha_innobase::delete_table() on a different object than the one that ha_innobase::create() was invoked with. That different object would not have any associated table_share , so we’d miss a call to Handler_share::~Handler_share() via handler::ha_share . In my Handler_share derived object, I would acquire a dict_table_t reference in order to prevent a premature eviction of the table from the InnoDB dict_sys cache. The dangling table reference would cause some DDL operations to fail, because tables would not be properly dropped. I tried to work around this by the following kind of a condition: @@ -13523,7 +13511,8 @@ ha_innobase::create(const char *name, TABLE *form, HA_CREATE_INFO *create_info, if (!info.table()->is_temporary()) { log_write_up_to(trx->commit_lsn, true); - if (info.table()->stats_is_persistent()) + if (info.table()->stats_is_persistent() && + table_share->tmp_table == NO_TMP_TABLE) innodb_share_register(*info.table()); } info.table()->release(); Unfortunately, during a plain CREATE TABLE this function turns out to be invoked with an initial table_share object that was created by init_tmp_table_share() , with table_share->tmp_table == INTERNAL_TMP_TABLE . Because of that, we’d fail to invalidate the InnoDB table statistics on a subsequent FLUSH TABLES , and the test innodb.innodb_stats_auto_recalc_on_nonexistent would return a wrong result.

            It seems that the above code in ha_innobase::create() only affects the test innodb.innodb_stats_auto_recalc_on_nonexistent and nothing else. The main open question is how to fix this for partitioned tables. We currently do not have any test that would cover reloading the statistics for partitioned tables.

            marko Marko Mäkelä added a comment - It seems that the above code in ha_innobase::create() only affects the test innodb.innodb_stats_auto_recalc_on_nonexistent and nothing else. The main open question is how to fix this for partitioned tables. We currently do not have any test that would cover reloading the statistics for partitioned tables.

            A support customer reported that for an InnoDB table, statistics were never recalculated for more than 2 years. A plausible explanation is the logic in dict_table_close() that tries to ensure that InnoDB statistics will be reloaded after the statistics tables have been modified using statements like INSERT, UPDATE, or DELETE. The idea was that when the table is reopened after FLUSH TABLE, the statistics would be reloaded.

            It is very unlikely that anyone would actually need such logic. A more likely course of action would be to execute ANALYZE TABLE.

            The logic in dict_table_close() is harmful in that it will effectively cause frequent resets of the counter dict_table_t::stat_modified_counter and therefore prevent the automatic recalculation of InnoDB persistent statistics, which is enabled by the configuration parameter innodb_stats_auto_recalc or the table attribute STATS_AUTO_RECALC=1.

            marko Marko Mäkelä added a comment - A support customer reported that for an InnoDB table, statistics were never recalculated for more than 2 years. A plausible explanation is the logic in dict_table_close() that tries to ensure that InnoDB statistics will be reloaded after the statistics tables have been modified using statements like INSERT , UPDATE , or DELETE . The idea was that when the table is reopened after FLUSH TABLE , the statistics would be reloaded. It is very unlikely that anyone would actually need such logic. A more likely course of action would be to execute ANALYZE TABLE . The logic in dict_table_close() is harmful in that it will effectively cause frequent resets of the counter dict_table_t::stat_modified_counter and therefore prevent the automatic recalculation of InnoDB persistent statistics, which is enabled by the configuration parameter innodb_stats_auto_recalc or the table attribute STATS_AUTO_RECALC=1 .
            marko Marko Mäkelä added a comment -

            serg pointed out that before this change, executing the following would cause InnoDB persistent statistics to be reloaded:

            RENAME TABLE t TO tmp, tmp TO t;
            

            I revised the change to allow that to work. Only after FLUSH TABLES, InnoDB will cease to reload the statistics.

            I removed 10.6 from the target versions and rebased the fix to 10.11 in order to conform to a recent policy change.

            marko Marko Mäkelä added a comment - serg pointed out that before this change, executing the following would cause InnoDB persistent statistics to be reloaded: RENAME TABLE t TO tmp, tmp TO t; I revised the change to allow that to work. Only after FLUSH TABLES , InnoDB will cease to reload the statistics. I removed 10.6 from the target versions and rebased the fix to 10.11 in order to conform to a recent policy change.

            People

              marko Marko Mäkelä
              marko Marko Mäkelä
              Votes:
              3 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.