Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-33046

race condition in InnoDB dict_stats_schedule()

Details

    Description

      dict_stats_schedule() can be invoked with ms=0. In that case dict_stats_func is
      invoked immediately, it calls dict_stats_process_entry_from_recalc_pool()
      which at the end might try to call dict_stats_schedule() again to
      queue another recalc. And it can happen that the first
      dict_stats_schedule(0) call didn't release dict_stats_mutex yet,
      so the second dict_stats_schedule() won't queue a recalc. And as a result the table
      won't have its stats recalculated at all, not now, not later.

      This causes innodb.innodb_stats_auto_recalc to fail sporadically.

      The fix could be to revert commit 7c7f9bef28a, use mutex locks, not trylocks, and resolve the deadlock somehow differently.

      Attachments

        Activity

          The test innodb.innodb_stats_auto_recalc is failing very often on mandatory builders of 10.6 and later branches despite the workaround of changing the parameter to ms=10.

          marko Marko Mäkelä added a comment - The test innodb.innodb_stats_auto_recalc is failing very often on mandatory builders of 10.6 and later branches despite the workaround of changing the parameter to ms=10 .
          wlad Vladislav Vaintroub added a comment - - edited

          I'm not sure the initial analysis was 100% correct. if timer is scheduled by 2 threads the very same time, and this schedule is protected by try_lock, one try_lock succeeds, another one fails. timer will schedule either now or in 10 seconds from now, and at the given time, it would fire. So "neither now, not later" seems incorrect, it is more "either now, or later".

          Having said that, the protection does not seem to make much sense anyway, timer::set_time implementations both have internal synchronization.

          Why is this filed against 10.6? 10.5 has that logic, too. It has the same test innodb.innodb_stats_auto_recalc, which , according to buildbot cross-reference never failed in 10.5

          wlad Vladislav Vaintroub added a comment - - edited I'm not sure the initial analysis was 100% correct. if timer is scheduled by 2 threads the very same time, and this schedule is protected by try_lock, one try_lock succeeds, another one fails. timer will schedule either now or in 10 seconds from now, and at the given time, it would fire. So "neither now, not later" seems incorrect, it is more "either now, or later". Having said that, the protection does not seem to make much sense anyway, timer::set_time implementations both have internal synchronization. Why is this filed against 10.6? 10.5 has that logic, too. It has the same test innodb.innodb_stats_auto_recalc, which , according to buildbot cross-reference never failed in 10.5

          it is "either now or later", and sometimes it happens to be "now". That is dict_stats_process_entry_from_recalc_pool() is called now. But inside it has this condition

            const bool update_now=
              difftime(time(nullptr), table->stats_last_recalc) >= MIN_RECALC_INTERVAL;
          

          and in the test the time(NULL) is equal to table->stats_last_recalc on the first invocation, so update does not happen, it's queued to happen later. That is, it's not queued, because trylock fails. So "now" happens, indeed, but doesn't do anything, and "later" is not queued.

          serg Sergei Golubchik added a comment - it is "either now or later", and sometimes it happens to be "now". That is dict_stats_process_entry_from_recalc_pool() is called now. But inside it has this condition const bool update_now= difftime ( time (nullptr), table->stats_last_recalc) >= MIN_RECALC_INTERVAL; and in the test the time(NULL) is equal to table->stats_last_recalc on the first invocation, so update does not happen, it's queued to happen later. That is, it's not queued, because trylock fails. So "now" happens, indeed, but doesn't do anything, and "later" is not queued.
          wlad Vladislav Vaintroub added a comment - - edited

          the thing is this. There was no race condition involving timers. It was like this in 10.5, and it worked in 10.5. The important part in the is :
          if recalc_pool still has tables, dict_stats_func or functions it calls need to reschedule. This was true in 10.5, but 10.6 , much of the logic has changed due to dictionary using THDs and MDL acquisition, and this is no longer the case. I'll fix that.

          wlad Vladislav Vaintroub added a comment - - edited the thing is this. There was no race condition involving timers. It was like this in 10.5, and it worked in 10.5. The important part in the is : if recalc_pool still has tables, dict_stats_func or functions it calls need to reschedule. This was true in 10.5, but 10.6 , much of the logic has changed due to dictionary using THDs and MDL acquisition, and this is no longer the case. I'll fix that.

          People

            wlad Vladislav Vaintroub
            serg Sergei Golubchik
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.