[MDEV-24303] information_schema.INNODB_SYS_TABLESTATS stat_modified_counter thread-safety issue Created: 2020-11-27  Updated: 2022-11-06

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.4.17, 10.5.8
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Jukka Santala Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: performance
Environment:

CentOS 7



 Description   

I noticed some weird behavior with the InnoDB automatic persistent stats recalculation counter, and ran MariaDB 10.5.8 under TSAN and discovered that stat_modified_counter is accessed in thread-unsafe manner. Worst case is dict_stats_update_if_needed(_func) in dict0stats_bg.cc, where the counter is incremented and then zeroed if automatic stats update is triggered so it could zero the counter while another thread re-sets it one higher. In row0mysql.cc there are branches which increase the counter without calling dict_stats_update_if_needed(_func) as well.

Putting all of those inside dict_sys mutex solves the thread conflicts, but probably has unacceptable performance impact. Since dict_stats_update_if_needed(_func) compares the stat_modified_counter to dict_table_get_n_rows(table) which could also be being updated, the mutex may be best recourse here. In dict0stats.cc the updates are already wrapped in dict_sys mutex.

As a note row0umod.cc also increases the stat_modified_counter but according to the comments that branch is only called while already holding dict_sys mutex. This comment appears to be wrong at least in current versions, because dict_stats_update_if_needed(_func) indeed doesn't acquire the mutex. Since that code undoes a modify, it seems to me like it should reduce the modify counter (if more than zero), not increase it. Because dict_stats_update_if_needed(_func) already increases it, I'm not sure what's the best course of action is for that.

More significant change would be to have the modified counter count downwards. This would be more "fair" than the up-counter, because INSERT's and DELETE's change the threshold that's being compared, significantly affecting how often the background analysis is ran for smaller tables. It would not solve the counter atomicity, but it should remove the need of comparison to dict_table_get_n_rows(table) and allow using std::atomic, or perhaps just accept that some decrements could be lost.

Due to the unpredictability of the current counter I assume nobody is using it for anything useful that down-counting might mess with, although it does certainly modify the intended behavior.

There's some undocumented behavior in that UPD_NODE_NO_ORD_CHANGE transaction flag and innodb_stats_include_delete_marked server variable alter the automatic recalculation. Transactions which do not change any index, or by default deletes, do not cause dict_stats_update_if_needed(_func) to be called and statistics recalculated, but the modification counter is increased so that first index write triggers statistics recalculation. This is quite odd behavior, but explains why some tables aren't re-analyzed no matter the number of updates.



 Comments   
Comment by Marko Mäkelä [ 2020-11-28 ]

Thank you for the observations, which look valid to me. Would you like to contribute some changes in the form of a pull request?

I am assigning this to kevg, because he recently made some improvements in MDEV-23991.

I feel that the dict_sys.mutex is being overused. I do not want contention on it to increase.

Perhaps we could use dict_index_t::lock or std::atomic for additional protection? For relaxed loads and stores, according to https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html using std::atomic there should be no additional penalty. Read-modify-write operations are a different matter.

Comment by Jukka Santala [ 2020-11-29 ]

The code is littered with comments like:

                                /* Not protected by dict_sys.mutex for
                                performance reasons, we would rather
                                get garbage in stat_n_rows (which is
                                just an estimate anyway) than
                                protecting the following code with a
                                latch. */

So I get the idea we would not want additional mutexes on every data modifying query (Also good practice putting a big comment when thread-unsafe access is intentional). Although std::atomic also by default forces sequential order of access on the variable, preventing some optimizations, this is generally what we want. With optimizations it's hard to tell how weird things can get when another thread accesses variable during operation. The dict_table_get_n_rows(table) may be biggest problem in current code, because it's only protected by dict_sys.mutex.

If we can turn the stat_modified_counter to count down instead, then stat_n_rows will only have to be read when dict_sys.mutex is acquired for updating InnoDB persistent stats anyway (Some refactoring needed, and need to think carefully about the timing of setting stat_modified_counter, but it's probably doable). I can try to write a patch based on that, although after finding out that stat recalculation doesn't get triggered by deletes/non-key-modifying queries explaining why some of the tables weren't getting re-analyzed this doesn't look hugely critical. I'm not sure if there's a way to test how often the automatic analysis is mis-triggered due to data race only. It can depend on environment and compiler of course, so if it can be fixed without performance impact it's worth it.

Comment by Jukka Santala [ 2020-12-02 ]

After reviewing comments elsewhere in the code, the data races are indeed intentional, though comparing two live counters in non-threadsafe manner is certainly asking for trouble. In addition to obvious instruction order issues, modern CPU architectures of course don't guarantee that changes are visible to all processor threads in same order. For good measure I tested std::atomic on CentOS 7/gcc 9.3.1.
Current code:

mov    0x1d0(%rdi),%rcx
lea    0x1(%rcx),%rdx
mov    %rdx,0x1d0(%rdi)
...
movq   $0x0,0x1d0(%r12)

With std::atomic:

lea    0x1d0(%rdi),%rdx
mov    $0x1,%ecx
lock xadd %rcx,(%rdx)
...
movq   $0x0,0x1d0(%r12)
mfence

If there is low contention on the data address, the overhead of std::atomic may not be too bad: https://travisdowns.github.io/blog/2020/07/06/concurrency-costs.html However n_rows and persistent stats flags should be protected likewise, and indeed the comparison between them all, and the performance would start to be felt. But I believe we can still do better without synchronizing data access explicitly, by storing the intended change threshold at start and comparing the live counter against that. Hopefully, at most we'll lose some counter changes, and the critical region stays small.

In the pull request I also addressed the initial issues I had with persistent stats recalc either never getting ran or getting constantly ran unnecessarily, and MODIFIED_COUNTER being non-obvious. In fact this would resolve at least MDEV-8297 and MDEV-9963 and https://bugs.mysql.com/bug.php?id=71469 & https://bugs.mysql.com/bug.php?id=71470 on the upstream. Ie. MODIFIED_COUNTER is being reset unpredictably counter to documentation. This changes MODIFIED_COUNTER semantics, but it's an improvement because many people seem to expect it to keep increasing.

Because the reanalysis counter is not changed for DML that doesn't change indexes (taking into account delete-marked), tables will get re-analyzed less rarely in general, and the 10/6.25% thresholds might need adjustment.

Fix for MDEV-24275 is included in the pull request, because without it MariaDB will crash & burn when testing re-analysis of larger tables. I can make pull request with just the desired changes.

Comment by Marko Mäkelä [ 2020-12-02 ]

jsantala, I think that the comment that you found must be originally from 2012. I think that we can and should try to do better than that. Starting with MariaDB 10.4, C++11 and std::atomic are available to all storage engines.

One thing that I have been thinking about recently is: How to avoid extra overhead if multiple adjacent atomic data fields are being modified? I have not been curious enough to check the generated code. I would be mildly surprised if the compiler were clever enough to omit a memory fence instruction in the middle of multiple atomic operations. I wonder if we could help the situation by using posix_memalign() (via our aligned_malloc() wrapper) and by providing hints with my_assume_aligned.

If there are multiple data members that are to be updated in sync with each other, it is easiest to use an explicit mutex or lock around the operations. What I suggest above could at most be a potential performance optimization, or a technique to avoid some of the overhead of doing things correctly.

Sometimes, multiple fields can be folded into a single one, like I recently did to some fil_space_t members in order to reduce fil_system.mutex contention in MDEV-23855.

If the operations require adjustment (say, ‘subtract one if the number was not 0’), then I think that a proper implementation with atomic operations would involve a loop with std::atomic::compare_exchange_strong(). I do not know, but it might be useful to ‘seed’ the initial iteration of the loop by a std:atomic::load(). Such tweaks should ideally be validated with (micro)benchmarks.

Comment by Marko Mäkelä [ 2022-02-21 ]

jsantala, did you have a chance to test 10.6 with ThreadSanitizer? In MDEV-25029, a per-table mutex was introduced for protecting the statistics. When I reviewed those changes, I made some notes regarding possible future improvements:

Could use dict_table_t::lock_mutex instead of dict_sys.latch

  • dict_set_corrupted_index_cache_only(), index->type, table->corrupted, table->file_unreadable
  • fts_get_table_name() and other users of table->name (it could be better to extend the MDL protection)
  • innobase_build_v_templ()
  • dict_table_t::indexes and anything that accesses them potentially without holding MDL, e.g., dict_table_check_for_dup_indexes()

Already protected by metadata lock (MDL, table name lock)

  • row_import::set_root_by_heuristic() should be protected by MDL_EXCLUSIVE acquired for ALTER TABLE xxx IMPORT TABLESPACE as well as exclusive InnoDB table lock
  • row_quiesce_table_has_fts_index() should be protected by the MDL acquired for FLUSH TABLES xxx FOR EXPORT

Should be made protected by dict_table_t::lock_mutex:

  • dict_table_n_rows_inc(), dict_table_n_rows_dec(), table->stat_n_rows (currently not protected and thus inaccurate)

This ticket could be about the last category.

Comment by Jukka Santala [ 2022-02-28 ]

I noticed there were some changes to the area, but the behavior with regards to this is still the same. Most of the problem is with stat_modified_counter:

        ulonglong       counter = table->stat_modified_counter++;
        ulonglong       n_rows = dict_table_get_n_rows(table);
 
...
 
                if (counter > n_rows / 10 /* 10% */
                    && dict_stats_auto_recalc_is_enabled(table)) {
 
...
 
                        dict_stats_recalc_pool_add(table->id);
                        table->stat_modified_counter = 0;
                }

There are additonal related issues:

  • people are using stat_modified_counter as change counter, without realizing it will be reset (MDEV-9963, MDEV-8297 for example)
  • the counter isn't persistent, so larger tables never have their stats recalculated
  • and really small tables are unnecessarily re-analyzed all the time
  • but since compared vs. current rows, DELETE's cause recalc 20% more often than INSERT's
  • though dict_stats_update_if_needed isn't called on every change, so the recalc is unpredictable and modified_counter can overshoot target

Most of those are minor issues, though it's hard to tell if it works. I had originally used countdown to avoid comparing two non-threadsafe variables and few of the corner-cases, added std::atomic, but not sure how to best do the queue+reset and make the counter persist. The mutex would certainly make that easier, at some performance-tradeoff.

Generated at Thu Feb 08 09:28:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.