Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-20414

"a long semaphore wait" After Upgrading to 10.1.41

Details

    Description

      Ever since we upgraded to 10.1.41, we have been having non-stop issues in several of our large servers where the logs start filling up with semaphore errors and working with any InnoDB table seems to lock up.

      Doing research, I found people talking about disabling the `innodb_adaptive_hash_index` by setting it to 0 to fix the issue, and I have tried this without success. With it disabled, the issue still crops back up.

      Doing some more research, I found issue #16467 (https://jira.mariadb.org/browse/MDEV-16467) which was in the update log for MariaDB 10.1.41, so I am wondering if that is related to our issues.

      Any help will be awesome. I have attached the log output we see.

      Attachments

        Issue Links

          Activity

            james.coleman-ah James Coleman added a comment -

            We are still examining this issue to this day. As an update, it seems to only popup when MariaDB is under heavy load. Rather it be lots of cron jobs running heavy queries at the same time, or lots of traffic on a site.

            james.coleman-ah James Coleman added a comment - We are still examining this issue to this day. As an update, it seems to only popup when MariaDB is under heavy load. Rather it be lots of cron jobs running heavy queries at the same time, or lots of traffic on a site.

            james.coleman-ah, could you attach gdb to the mysqld process while it is hung, and provide a stack of all threads, and also information about the dictionary latches?

            gdb -ex "set pagination 0" -ex "thread apply all bt" -ex "print *dict_operation_lock" -ex "print *dict_sys->mutex" --batch -p $(pgrep -x mysqld)
            

            We want to be able to detect the chain of blocked mutexes or rw-latches and threads.

            marko Marko Mäkelä added a comment - james.coleman-ah , could you attach gdb to the mysqld process while it is hung, and provide a stack of all threads, and also information about the dictionary latches? gdb -ex "set pagination 0" -ex "thread apply all bt" -ex "print *dict_operation_lock" -ex "print *dict_sys->mutex" --batch -p $(pgrep -x mysqld) We want to be able to detect the chain of blocked mutexes or rw-latches and threads.
            james.coleman-ah James Coleman added a comment -

            I was finally able to get a dump for you attached.

            james.coleman-ah James Coleman added a comment - I was finally able to get a dump for you attached.

            james.coleman-ah, thank you. Unfortunately, the 2 print commands were rejected, because no debugging symbols were available. Often they are part of a separate package, such as MariaDB-server-debuginfo.
            My guess that it could be one of the InnoDB data dictionary latches seems to be right, but because those print commands were rejected, we cannot know for sure. In dump.txt we see several threads waiting for dict_operation_lock S-latch, mostly on FOREIGN KEY check, but also at least once on rollback (thread 130).
            Among the 170 threads, it is not trivial to see which thread is actually holding (or waiting for) the dict_operation_lock X-latch and thus blocking all the S-latch requests. One X-latch request is pending (not already granted). It could be blocked by a thread that is already holding the X-latch or an S-latch. Subsequent X-latch requests should be blocked by this pending X-latch request (if they were not already blocked by a X-latch held by some thread).

            Thread 154 (Thread 0x7fd5823fd700 (LWP 7368)):
            #0  0x00007fdc309309f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
            #1  0x0000560a16fb5ebb in os_event_wait_low(os_event*, long) ()
            #2  0x0000560a17030263 in sync_array_wait_event(sync_array_t*, unsigned long) ()
            #3  0x0000560a17032e27 in rw_lock_x_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long, bool, bool) ()
            #4  0x0000560a170ed396 in dict_stats_save(dict_table_t*, unsigned long const*) [clone .part.73] ()
            #5  0x0000560a170edb98 in dict_stats_update(dict_table_t*, dict_stats_upd_option_t) ()
            #6  0x0000560a170effa0 in dict_stats_thread ()
            #7  0x00007fdc3092ce65 in start_thread () from /lib64/libpthread.so.0
            #8  0x00007fdc2ea9988d in clone () from /lib64/libc.so.6
            

            I think that valerii has mentioned that setting STATS_AUTO_RECALC=1 STATS_PERSISTENT=1 for InnoDB tables can cause problems. I failed to find an open bug report for that one, though.

            Given that you seem to have lots of concurrent operations that involve FOREIGN KEY constraints, MDEV-16467 might indeed have something to do with this.

            Without having more information available, it is very difficult to diagnose this. Next time the hang occurs, could you please generate a core dump? Then, with the appropriate debugging symbols installed, we should be able to extract more data from it.

            marko Marko Mäkelä added a comment - james.coleman-ah , thank you. Unfortunately, the 2 print commands were rejected, because no debugging symbols were available. Often they are part of a separate package, such as MariaDB-server-debuginfo . My guess that it could be one of the InnoDB data dictionary latches seems to be right, but because those print commands were rejected, we cannot know for sure. In dump.txt we see several threads waiting for dict_operation_lock S-latch, mostly on FOREIGN KEY check, but also at least once on rollback (thread 130). Among the 170 threads, it is not trivial to see which thread is actually holding (or waiting for) the dict_operation_lock X-latch and thus blocking all the S-latch requests. One X-latch request is pending (not already granted). It could be blocked by a thread that is already holding the X-latch or an S-latch. Subsequent X-latch requests should be blocked by this pending X-latch request (if they were not already blocked by a X-latch held by some thread). Thread 154 (Thread 0x7fd5823fd700 (LWP 7368)): #0 0x00007fdc309309f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000560a16fb5ebb in os_event_wait_low(os_event*, long) () #2 0x0000560a17030263 in sync_array_wait_event(sync_array_t*, unsigned long) () #3 0x0000560a17032e27 in rw_lock_x_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long, bool, bool) () #4 0x0000560a170ed396 in dict_stats_save(dict_table_t*, unsigned long const*) [clone .part.73] () #5 0x0000560a170edb98 in dict_stats_update(dict_table_t*, dict_stats_upd_option_t) () #6 0x0000560a170effa0 in dict_stats_thread () #7 0x00007fdc3092ce65 in start_thread () from /lib64/libpthread.so.0 #8 0x00007fdc2ea9988d in clone () from /lib64/libc.so.6 I think that valerii has mentioned that setting STATS_AUTO_RECALC=1 STATS_PERSISTENT=1 for InnoDB tables can cause problems. I failed to find an open bug report for that one, though. Given that you seem to have lots of concurrent operations that involve FOREIGN KEY constraints, MDEV-16467 might indeed have something to do with this. Without having more information available, it is very difficult to diagnose this. Next time the hang occurs, could you please generate a core dump? Then, with the appropriate debugging symbols installed, we should be able to extract more data from it.

            Some statistics related hangs were fixed in MDEV-15020, MDEV-29883, and MDEV-30638.

            marko Marko Mäkelä added a comment - Some statistics related hangs were fixed in MDEV-15020 , MDEV-29883 , and MDEV-30638 .

            People

              marko Marko Mäkelä
              james.coleman-ah James Coleman
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.