[MDEV-20414] "a long semaphore wait" After Upgrading to 10.1.41 Created: 2019-08-23  Updated: 2023-02-17

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.1.41
Fix Version/s: 10.1

Type: Bug Priority: Major
Reporter: James Coleman Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 0
Labels: None
Environment:

CentOS Linux release 7.6.1810


Attachments: Text File InnoDB Monitor Output Log.txt     Text File dump.txt    
Issue Links:
Relates
relates to MDEV-21253 10.1.43 crash Closed
relates to MDEV-20243 semaphore hang and crash in 10.3.16 w... Closed

 Description   

Ever since we upgraded to 10.1.41, we have been having non-stop issues in several of our large servers where the logs start filling up with semaphore errors and working with any InnoDB table seems to lock up.

Doing research, I found people talking about disabling the `innodb_adaptive_hash_index` by setting it to 0 to fix the issue, and I have tried this without success. With it disabled, the issue still crops back up.

Doing some more research, I found issue #16467 (https://jira.mariadb.org/browse/MDEV-16467) which was in the update log for MariaDB 10.1.41, so I am wondering if that is related to our issues.

Any help will be awesome. I have attached the log output we see.



 Comments   
Comment by James Coleman [ 2019-10-23 ]

We are still examining this issue to this day. As an update, it seems to only popup when MariaDB is under heavy load. Rather it be lots of cron jobs running heavy queries at the same time, or lots of traffic on a site.

Comment by Marko Mäkelä [ 2019-10-23 ]

james.coleman-ah, could you attach gdb to the mysqld process while it is hung, and provide a stack of all threads, and also information about the dictionary latches?

gdb -ex "set pagination 0" -ex "thread apply all bt" -ex "print *dict_operation_lock" -ex "print *dict_sys->mutex" --batch -p $(pgrep -x mysqld)

We want to be able to detect the chain of blocked mutexes or rw-latches and threads.

Comment by James Coleman [ 2019-11-08 ]

I was finally able to get a dump for you attached.

Comment by Marko Mäkelä [ 2019-11-11 ]

james.coleman-ah, thank you. Unfortunately, the 2 print commands were rejected, because no debugging symbols were available. Often they are part of a separate package, such as MariaDB-server-debuginfo.
My guess that it could be one of the InnoDB data dictionary latches seems to be right, but because those print commands were rejected, we cannot know for sure. In dump.txt we see several threads waiting for dict_operation_lock S-latch, mostly on FOREIGN KEY check, but also at least once on rollback (thread 130).
Among the 170 threads, it is not trivial to see which thread is actually holding (or waiting for) the dict_operation_lock X-latch and thus blocking all the S-latch requests. One X-latch request is pending (not already granted). It could be blocked by a thread that is already holding the X-latch or an S-latch. Subsequent X-latch requests should be blocked by this pending X-latch request (if they were not already blocked by a X-latch held by some thread).

Thread 154 (Thread 0x7fd5823fd700 (LWP 7368)):
#0  0x00007fdc309309f5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000560a16fb5ebb in os_event_wait_low(os_event*, long) ()
#2  0x0000560a17030263 in sync_array_wait_event(sync_array_t*, unsigned long) ()
#3  0x0000560a17032e27 in rw_lock_x_lock_func(rw_lock_t*, unsigned long, char const*, unsigned long, bool, bool) ()
#4  0x0000560a170ed396 in dict_stats_save(dict_table_t*, unsigned long const*) [clone .part.73] ()
#5  0x0000560a170edb98 in dict_stats_update(dict_table_t*, dict_stats_upd_option_t) ()
#6  0x0000560a170effa0 in dict_stats_thread ()
#7  0x00007fdc3092ce65 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fdc2ea9988d in clone () from /lib64/libc.so.6

I think that valerii has mentioned that setting STATS_AUTO_RECALC=1 STATS_PERSISTENT=1 for InnoDB tables can cause problems. I failed to find an open bug report for that one, though.

Given that you seem to have lots of concurrent operations that involve FOREIGN KEY constraints, MDEV-16467 might indeed have something to do with this.

Without having more information available, it is very difficult to diagnose this. Next time the hang occurs, could you please generate a core dump? Then, with the appropriate debugging symbols installed, we should be able to extract more data from it.

Comment by Marko Mäkelä [ 2023-02-17 ]

Some statistics related hangs were fixed in MDEV-15020, MDEV-29883, and MDEV-30638.

Generated at Thu Feb 08 08:59:18 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.