[MDEV-15506] Semaphore Wait locking database queries Created: 2018-03-07  Updated: 2020-02-13

Status: Open
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Storage Engine - InnoDB
Affects Version/s: 10.2.13
Fix Version/s: 10.2

Type: Bug Priority: Major
Reporter: Steve Norwood Assignee: Marko Mäkelä
Resolution: Unresolved Votes: 4
Labels: innodb, recovery, semaphore
Environment:

Windows Server 2012


Attachments: Text File MariaDB_report_MDEV_15506.txt     Text File error_file.txt     File my.ini    

 Description   

Following on from an initial issue discussed on this stack overflow post (https://stackoverflow.com/questions/49131991/can-you-rollback-a-query-in-state-committing-alter-table-to-storage-engine) where we had an alter table statement that appeared to be stuck, we have killed the query, and had server issues since.

We restarted the server, as queries were locking up in the whole server. While normally the queries would time out and die, these were all timing out, and then becoming stuck at a 'Killed' state. Restarting the server removed these hanging queries, and everything seemed to be back up and running. However, the queries built back up again, and it seemed that the rollback was still happening, despite the fact that the table that was initially being altered appeared to be in tact. We found that there was a temporary table still in the database, and thought that the issue was that the temporary table had not been removed after rolling back the alter table statement.

We have since used innodb_force_recovery=3 to remove this temporary table, and the issue has persisted. From looking deeper into the error logs, it appears that there a lot of semaphore locks in the database, that may have stemmed from users still trying to query the server while an ALTER TABLE statement was locking the server. After studying the error logs, we have found a lot of semaphore waits in the server. The one that stuck out the most was this one:

2018-03-07 12:27:43 5968 [Note] InnoDB: A semaphore wait:
--Thread 4304 has waited at row0mysql.cc line 3383 for 219.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 00000103B9E9D870 created in file dict0dict.cc line 1107
a writer (thread id 4304) has reserved it in mode wait exclusive
number of readers 4, waiters flag 0, lock_word: fffffffffffffffc
Last time read locked in file row0purge.cc line 853
Last time write locked in file fts0fts.cc line 7190

It appears here that the thread is waiting on itself, and could be the reason the whole server is locking up. However, we're not sure about this for certain. It has been discussed a little on this mysql bug report (https://bugs.mysql.com/bug.php?id=73890) but not a lot has been provided in terms of a solution.

After attempting various solutions, we have had to grab a backup of the server and run on older data while we update what we can, as this server is currently unusable outside of running in recovery mode. I've posted our error logs our my.ini (before the server went into recovery mode), and the extract of our error logs from when errors started occuring. I've had to remove database names and the like, and replace them with placeholders where necessary to protect our data as much as possible.

For a bit of context, this has happened on our production server as we believed with the latest updates to MariaDB, these sort of updates would be possible. The update occured on a database which ibd file of 56.8GB.



 Comments   
Comment by Marko Mäkelä [ 2019-09-30 ]

I am terribly sorry for missing this report for more than a year.

Usually, for hangs, I would like to see the stack traces of all threads during the hang:

gdb -ex "set pagination 0" -ex "thread apply all bt" -ex "print dict_operation_lock" -ex "print dict_sys.mutex" --batch -p $(pgrep mysqld)

This command should also display the holders of the InnoDB data dictionary latches. Those often seem to cause hangs or degraded performance.

If the ALTER TABLE involved any FULLTEXT INDEX, then there are some known hangs, which we are working on, such as MDEV-19529.

Comment by Bernardo Perez [ 2019-11-18 ]

Hello Marko,

I know what I am going to report is from MySQL mainline but it is exactly the same behavior as described here and this is the only report I've found. I think it is sharing the same issue.

This is my thread on the semaphore section locked to it self:

----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 9
--Thread 47674168452864 has waited at srv0srv.cc line 1968 for 834.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x2b5bddbf6998 created in file dict0dict.cc line 1184
a writer (thread id 47674168452864) has reserved it in mode wait exclusive
number of readers 1, waiters flag 0, lock_word: ffffffffffffffff
Last time read locked in file row0purge.cc line 861
Last time write locked in file /local/mysql-5.7.16.R2/storage/innobase/fts/fts0fts.cc line 7574
OS WAIT ARRAY INFO: signal count 6
RW-shared spins 0, rounds 7, OS waits 3
RW-excl spins 0, rounds 0, OS waits 0
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 7.00 RW-shared, 0.00 RW-excl, 0.00 RW-sx

This is the thread itself:

Thread 11 (Thread 0x2b4ff7a02700 (LWP 8773)):
#0 0x00002b4ce2e186d5 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x0000000000f6bb61 in wait (this=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/os/os0event.cc:165
#2 os_event::wait_low (this=0x2b4fd37f6b78, reset_sig_count=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/os/os0event.cc:335
#3 0x0000000000f6beaa in os_event_wait_low (event=<optimized out>, reset_sig_count=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/os/os0event.cc:534
#4 0x0000000001024a98 in sync_array_wait_event (arr=arr@entry=0x2b4ce53664d8, cell=@0x2b4ff7a01d50: 0x2b4ce66f1058) at /local/mysql-5.7.16.R2/storage/innobase/sync/sync0arr.cc:475
#5 0x0000000001026cc6 in rw_lock_x_lock_wait_func (line=1968, file_name=0x15d74f0 "/local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc", threshold=0, lock=0x2b4fd37f6998) at /local/mysql-5.7.16.R2/storage/innobase/sync/sync0rw.cc:510
#6 rw_lock_x_lock_low (line=1968, file_name=0x15d74f0 "/local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc", pass=0, lock=0x2b4fd37f6998) at /local/mysql-5.7.16.R2/storage/innobase/sync/sync0rw.cc:566
#7 rw_lock_x_lock_func (lock=lock@entry=0x2b4fd37f6998, pass=pass@entry=0, file_name=file_name@entry=0x15d74f0 "/local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc", line=line@entry=1968) at /local/mysql-5.7.16.R2/storage/innobase/sync/sync0rw.cc:729
#8 0x000000000100e606 in pfs_rw_lock_x_lock_func (pass=0, file_name=0x15d74f0 "/local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc", line=1968, lock=0x2b4fd37f6998) at /local/mysql-5.7.16.R2/storage/innobase/include/sync0rw.ic:711
#9 srv_master_evict_from_table_cache (pct_check=pct_check@entry=100) at /local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc:1968
#10 0x0000000001012db4 in srv_master_do_idle_tasks () at /local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc:2216
#11 srv_master_thread (arg=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc:2371
#12 0x00002b4ce2e14dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00002b4ce4389c9d in clone () from /lib64/libc.so.6

I've run the command you requested and attached the file with the state of all the threads at the same time.

To give some context of the issue. This is was a restore from a disk snapshot that after start up will always enter in a self deadlock scenario that will result in the process dying. The database was also showing at the same time https://jira.mariadb.org/browse/MDEV-14637 https://bugs.mysql.com/bug.php?id=80919 in particular the 3 purge threads where blocked in gis purge operations.

I reduced the number of innodb_purge_threads to 1 (as it can be see in the attached file that only shows 1 purge thread) and still the DB was crashing itself due to this single thread latched on itself. We've seen this in our systems a few times as we run a very large deployment which just increases chances for this to happen.

The system is gone so I don't have more info to provide but I am sure that we will, eventually, get another one. If you want to request us more info that we can gather on the next occurrence please let me know, we will be happy to gather it.

MariaDB_report_MDEV_15506.txt

Comment by Bernardo Perez [ 2020-02-13 ]

Regarding my previous comment. That crash is certainly related with this bug https://jira.mariadb.org/browse/MDEV-21512 that is based on this bug and fix https://github.com/mysql/mysql-server/commit/5cdbb22b51cf2b35dbdf5666a251ffbec2f84dec

Generated at Thu Feb 08 08:21:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.