[MDEV-25111] Long semaphore wait (> 800 secs), server stops responding Created: 2021-03-10  Updated: 2021-04-29  Resolved: 2021-04-29

Status: Closed
Project: MariaDB Server
Component/s: Galera, Storage Engine - InnoDB
Affects Version/s: 10.3.28
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Ere Maijala Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 1
Labels: need_feedback
Environment:

CentOS 7.9.2009

Defaults: mysqld would have been started with the following arguments:
--datadir=/data/mariadb --socket=/var/lib/mysql/mysql.sock --user=mysql --symbolic-links=0 --max_allowed_packet=16M --thread_cache_size=8 --max_connections=1550 --slow_query_log=1 --innodb_buffer_pool_size=49152M --innodb_log_file_size=2048M --innodb_log_buffer_size=16M --innodb_print_all_deadlocks=on --log-warnings=2 --plugin_load_add=auth_socket --default_storage_engine=InnoDB --innodb_file_per_table=1 --innodb_flush_log_at_trx_commit=0 --innodb_doublewrite=1 --log_slave_updates=1 --log_bin=bin-log --server_id=5555 --binlog_format=ROW --innodb_autoinc_lock_mode=2 --expire_logs_days=1 --wsrep_provider=/usr/lib64/galera/libgalera_smm.so --wsrep_cluster_address=gcomm://[redacted] --wsrep_node_address=[redacted] --wsrep_sst_method=mariabackup --wsrep_sst_auth=[redacted] --wsrep_provider_options=evs.keepalive_period = PT1S; evs.inactive_check_period = PT1S; evs.suspect_timeout = PT5S; evs.inactive_timeout = PT15S; evs.install_timeout = PT15S; gcache.size=2G --wsrep_on=ON


Attachments: Text File mariadb-semaphore-wait.log     Text File mysql-galera.log    
Issue Links:
Duplicate
is duplicated by MDEV-25190 Semaphore wait has lasted > 600 secon... Closed
Relates
relates to MDEV-24275 InnoDB persistent stats analyze force... Closed
relates to MDEV-24606 InnoDB: Semaphore wait has lasted > 6... Closed
relates to MDEV-24831 Galera test failure because of safe_m... Closed
relates to MDEV-25114 Crash: WSREP: invalid state ROLLED_BA... Closed

 Description   

We had been running MariaDB 10.3.27 with Galera cluster in production without any issues.

Less than 26 hours after updating to 10.3.28 one of three servers stopped responding.

The following errors were logged for more than 800 seconds until I restarted MariaDB:

2021-03-10 15:10:45 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140231647176448 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
2021-03-10 15:10:45 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140286587229952 has waited at lock0lock.cc line 3882 for 241.00 seconds the semaphore:
Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
[clip]
2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
--Thread 140232309577472 has waited at row0row.cc line 1133 for 239.00 seconds the semaphore:
X-lock on RW-latch at 0x7f8f0086a270 created in file buf0buf.cc line 1563
a writer (thread id 140232091498240) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time write locked in file row0row.cc line 1133
[clip]
2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
--Thread 140231088736000 has waited at lock0lock.cc line 3882 for 238.00 seconds the semaphore:
Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
--Thread 140232326362880 has waited at srv0srv.cc line 2026 for 238.00 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x55627310f3e0 created in file dict0dict.cc line 920
a writer (thread id 140232326362880) has reserved it in mode  wait exclusive
number of readers 4, waiters flag 1, lock_word: fffffffc
Last time write locked in file srv0srv.cc line 2026
2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
--Thread 140231088121600 has waited at lock0lock.cc line 3882 for 238.00 seconds the semaphore:
Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2
2021-03-10 15:10:45 0 [Note] InnoDB: A semaphore wait:
--Thread 140231087814400 has waited at lock0lock.cc line 3882 for 238.00 seconds the semaphore:
Mutex at 0x5562730bb380, Mutex LOCK_SYS created lock0lock.cc:461, lock var 2

I'll attach the full log from when the problem started. The log also contains InnoDB Monitor output.

This sounds similar to MDEV-24375, but seems to be a different issue since MDEV-24275 should have fixed that in 10.3.28 if I got things right.



 Comments   
Comment by Marko Mäkelä [ 2021-03-10 ]

emaijala, yes, this must be different from MDEV-24275 and MDEV-24188, which I suspect MDEV-24375 to be a duplicate report of. I would tend to believe that this has something to do with Galera, while the above tickets affect plain InnoDB and Galera alike.

The InnoDB diagnostic output is virtually useless for diagnosing hangs or deadlocks. (In fact, it will be removed in 10.6: MDEV-21452.) We should find InnoDB hangs in our internal stress testing, and I am not currently aware of any hangs in InnoDB code. (But, I cannot speak about Galera.)

For anything nontrivial, we really need stack traces of all threads during the hang or performance degradation. You can use some tool like http://poormansprofiler.org/ as well. But, without some stack trace output, I am afraid that we cannot proceed anywhere.

Two notable Galera-related changes in the most recent releases that might have something to do with this were MDEV-23328 and MDEV-23536.

Comment by Ere Maijala [ 2021-03-11 ]

@Marko Mäkelä, thanks for the information. I've now prepared to get stack traces in case we don't decide to downgrade to 10.3.27.

Comment by JDT [ 2021-03-19 ]

We are also experiencing this issue since upgrading to 10.3.28. We're also using Galera and CentOS 7.9.2009. I've attached a log below.
mysql-galera.log

I will also try to get some more debug information from our Galera instance. I'll try to dump the running queries to a file every minute and try to narrow down the query that triggers the problem if possible.

Comment by Marko Mäkelä [ 2021-03-24 ]

MDEV-24831 might be related to this. Those warnings can only appear in 10.6 debug builds, thanks to MDEV-21452.

Generated at Thu Feb 08 09:35:13 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.