Details
Description
When running mariadbd with --wsrep-recover, it is expected to exit. But with --plugin-load-add=ha_spider.so, it does not exit.
Given mariadbd is called with --wsrep-recover indirectly as part of the ExecStartPre in the systemd service, this can cause a hang on systemd [re]start.
Attachments
Issue Links
- is blocked by
-
MDEV-27912 --source include/restart_mysqld.inc fails in Spider test suite
-
- Closed
-
- relates to
-
MDEV-20179 Server hangs on shutdown during installation of Spider
-
- Closed
-
-
MDEV-22979 "mysqld --bootstrap" / mysql_install_db hangs when Spider is installed
-
- Closed
-
-
MDEV-27233 Server hangs when using --init-file which loads Spider and creates a Spider table
-
- Closed
-
Activity
Result from git-bisect, which is also the first bad commit for MDEV-27233 and MDEV-22979 (10.4):
b428b09997d172f29fc201b9ab05c160ef4cbc39 is the first bad commit
|
commit b428b09997d172f29fc201b9ab05c160ef4cbc39
|
Author: Kentoku SHIBA <kentokushiba@gmail.com>
|
Date: Thu Aug 1 08:59:53 2019 +0900
|
|
MDEV-20179 Server hangs on shutdown during installation of Spider (#1369)
|
|
|
Use LOCK_server_started, COND_server_started and mysqld_server_started for waiting starting the server
|
|
sql/mysqld.cc | 2 +-
|
storage/spider/spd_include.h | 3 +-
|
storage/spider/spd_table.cc | 119 +++++++++++++++++++------------------------
|
3 files changed, 54 insertions(+), 70 deletions(-)
|
The cause is similar to that of MDEV-27233[1], where the main thread calls spider_free_sts_threads which tries to lock spider_thread->mutex:
10.4 6cb8434 |
void spider_free_sts_threads( |
SPIDER_THREAD *spider_thread
|
) {
|
bool thread_killed; |
DBUG_ENTER("spider_free_sts_threads"); |
pthread_mutex_lock(&spider_thread->mutex);
|
thread_killed = spider_thread->killed;
|
The lock is held by another thread (let's call it Thread 3), in spider_table_bg_sts_action. As the main thread tries to lock the mutex, Thread 3 is pthread_cond_wait ing for spd_COND_server_started to be broadcasted or signaled, which would happen later in the main thread. Hence the deadlock
10.4 6cb8434 |
void *spider_table_bg_sts_action( |
void *arg |
) {
|
...
|
/* init start */ |
pthread_mutex_lock(&thread->mutex); // taking the lock |
...
|
if (!(*spd_mysqld_server_started) && !thd->killed) |
{
|
pthread_cond_wait(spd_COND_server_started, spd_LOCK_server_started); // waiting for spd_COND_server_started |
}
|
The trace of the main thread at hanging looks almost identical to that of MDEV-22979, at least for 10.4.
Looking more into it, the sequence of events is as follows (using commit fdcfc25)
1. The main thread inits the spider plugin (sql/mysqld.cc:5775)
- a number of (default 10) threads are created during the spider plugin init (spider_db_init()) to run spider_table_bg_sts_action(), later a number of (default 10) threads are created to run spider_table_bg_crd_action(). The two functions are almost identical, with the main difference being the sts function having an extra bit of init_command if statment:
if (thread->init_command)
{
...
if (!(*spd_mysqld_server_started) && !thd->killed)
{
...
if (!(*spd_mysqld_server_started) && !thd->killed)
{
pthread_cond_wait(spd_COND_server_started, spd_LOCK_server_started); // [1]
}
...
}
while (spider_init_queries[i].length && !thd->killed)
{
dispatch_command(COM_QUERY, thd, spider_init_queries[i].str,
(uint) spider_init_queries[i].length, FALSE, FALSE);
...
++i;
}
...
thread->init_command = FALSE;
pthread_cond_broadcast(&thread->sync_cond);
}
- the only thread with init_command set to true is the first sts thread, and it is only set false in the snippet above, after the server starts. If the server has not started, it continues to wait (see [1] above, where spd_COND_server_started is a pointer to cond_server_started)
2. The main thread tries to abort by calling unireg_abort() (with exit code 0) after finding wsrep_recover (line 5797)
- during the aborting, spider decides to free all the sts thread in spider_free_sts_threads(). All but one can be freed. No prize for guessing which one cannot be freed, as it holds the lock while waiting for the server to start
- judging from the backtrace at [2],
MDEV-22979is tripped at the same place, as it calls unireg_abort() too before the server starts (line 5868) - judging from the comment at [3][4],
MDEV-27233is slightly different, as it is spider_create_handler (instead of spider_free_sts_threads()) trying to lock a mutex held by the first sts thread. The problem seems more bizarre due to the unlikely and the same check (thread->init_command), making it look like an inevitable deadlock ([5] below).handler* spider_create_handler(
handlerton *hton,
TABLE_SHARE *table,
MEM_ROOT *mem_root
) {
...
SPIDER_THREAD *thread = &spider_table_sts_threads[0];
if (unlikely(thread->init_command))
{
...
/* wait for finishing init_command */
pthread_mutex_lock(&thread->mutex); // [5]
}
..
3. The server would start had none of the unireg_abort() been called (line 5935):
mysqld_server_started= 1;
|
mysql_cond_broadcast(&COND_server_started);
|
|
[2] https://jira.mariadb.org/browse/MDEV-22979?focusedCommentId=158235&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-158235
[3] https://jira.mariadb.org/browse/MDEV-27233?focusedCommentId=209160&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-209160
[4] https://jira.mariadb.org/browse/MDEV-27233?focusedCommentId=209175&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-209175
Based on this analysis, with --plugin-load-add=ha_spider.so, any call to unireg_abort() between the init of plugins (sql/mysqld.cc:5775) and the start of the server (sql/mysqld.cc:5935) will cause a hang. In this sense the present ticket is almost a duplicate of MDEV-22979, and a fix should also fix that one. It would be ideal if a fix for this ticket can also fix MDEV-27233, but it is not clear whether a fix naturally works for all three tickets, or naturally splits into two that fixes this+22979 and 27233 separately.
I have verified that the fix for mdev-27233 (c160a115b8b6dcd54bb3daf1a751ee9c68b7ee47) does still cause mdev-29904 when applied to the current 10.7 (1e04cafcba8), so the fix for this ticket should be tested against the testcase in mdev-29904:
--let restart_parameters=--plugin-load-add=ha_spider
|
--source include/restart_mysqld.inc
|
Curiously, the above testcase would fail with mysqltest failed but provided no output in ANY commit if placed under storage/spider/mysql-test/bugfix/t, but passes/fails as expected if placed under mysql-test/main.
The test failure in the previous comment is MDEV-27912.
With the patch at [1], and testing using restart_mysqld.inc, we get a better test:
--echo #
|
--echo # MDEV-30370 mariadbd hangs when running with --wsrep-recover and --plugin-load-add=ha_spider.so
|
--echo #
|
|
--let $restart_parameters=--wsrep-recover --plugin-load-add=ha_spider.so
|
--let $expect_file_name= $MYSQLTEST_VARDIR/tmp/mysqld.1.1.expect
|
--source include/restart_mysqld.inc
|
However, even though the test hangs at a commit (e.g. fdcfc25) containing the bad commit in its history, it outputs the same error as in MDEV-29904 at the parent commit (798080f4b3a) to the bad commit, instead of passing:
2023-01-18 13:15:59 0 [Warning] Changed limits: max_open_files: 1024 max_connections: 151 (was 151) table_cache: 421 (was 2000)
|
[ERROR] Unknown storage engine 'Aria'
|
2023-01-18 13:15:59 0 [ERROR] Plugin 'SPIDER' init function returned error.
|
2023-01-18 13:15:59 0 [ERROR] Plugin 'SPIDER' registration as a STORAGE ENGINE failed.
|
Update: this test does not work, see below[1].
This patch[1] extracts the running of spider init queries to its own thread (let's call it the iq thread), thus unblocking the main thread. However, the iq thread still gets stuck waiting for server to start.
[1] https://github.com/MariaDB/server/commit/d94be629ce3
To fix this, I can think of two possibilities
1. Add a condition variable to indicate that the server has exited, which is signaled / broadcasted in mysqld_exit.
2. Should the init queries really need to wait till the server starts to run? How do other storage engines run such queries? Perhaps learn from them.
I will start by looking into 2 first.
Thank you for putting quality thought into this. For #2, I am not sure if there are any such queries, though perhaps there are.
After reading and further research (thanks to nayuta-yanagisawa for the documentation in the form of chat logs , I believe these queries do need to wait till the server starts. I also noticed that Kentoku's fix for
MDEV-22979 works like Option 1 in the previous comment, except that it uses init_command as a flag instead of a condition variable, combined with a timedwait in case the server never starts. As it turns out it does fix this issue too.
I think a cleaner fix would still be separating out the init queries from the first sts thread, and applying the same sort of timedwait for server start, but I could not get a working patch based on this idea yet. Considering how long this issue has been hanging around, perhaps it is better to apply Kentoku's patch and close this issue as well as MDEV-22979 for now, and refactor later.
The patch also does not fix MDEV-27233, which as mentioned is a related but different scenario.
I can confirm the patch also fixes MDEV-22979, as expected.
holyfoot Can you review the patch at https://github.com/MariaDB/server/commit/ef1161e5d4f? Thanks.
The test using `restart_mysqld.inc` mentioned in a previous comment[1] would not pass even with the patch. And it will fail with "mysqltest failed but provided no output". It would pass if we remove the `--wsrep-recover`. Based on this I think it is caused by mtr expecting the server to be alive after the restart, even though `--wsrep-recover` means the server should exit immediately. This is the same problem as if one writes a test case by using a `.opt` file only:
mdev_30370_wont_work.opt |
--wsrep-recover --plugin-load-add=ha_spider.so
|
mdev_30370_wont_work.test |
--echo #
|
--echo # MDEV-30370 mariadbd hangs when running with --wsrep-recover and --plugin-load-add=ha_spider.so
|
--echo #
|
Running this test will fail with something like
worker[1] mysql-test-run: WARNING: Process [mysqld.1.1 - pid: 3316427, winpid: 3316427] died after mysql-test-run waited 2.3 seconds for /home/ycp/source/mariadb-server/mdev-30370/build/mysql-test/var/run/mysqld.1.1.pid to be created.
|
Therefore the only working test we have is still the one calling `$MYSQLD_BOOTSTRAP_CMD` as in the first comment[2]. Though that test may hang (e.g. like in this ticket), taking 15 minutes to die if mtr is run with the default timeout.
[1] https://jira.mariadb.org/browse/MDEV-30370?focusedCommentId=247747&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-247747
[2] https://jira.mariadb.org/browse/MDEV-30370?focusedCommentId=246822&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-246822
I tested this patch for MDEV-22979 in 10.5 (eeb8ebb1520), and it still hangs there. However, it works for the present issue in 10.5. I will test this patch for a few more versions.
Thanks holyfoot for the review.
I have done tests and can confirm the patch fixes the present issue for 10.6-11.0.
Pushing this patch is currently blocked by MDEV-27912, as the test for MDEV-29904 in the patch fails with "[ERROR] Server shutdown in progress" when the patch is applied to 10.8 a5b30158d4d, and I believe an MDEV-27912 patch for restart_mysqld.inc would fix that.
Pushed 284810b3e89. I removed the test for MDEV-29904 in the commit so that it does not block the push. I also tested MDEV-29904 in a testcase placed in the main suite:
mysql-test/main/mdev_29904.test |
--echo #
|
--echo # MDEV-29904 SPIDER plugin initialization fails upon startup
|
--echo #
|
|
--let $restart_parameters=--plugin-load-add=ha_spider
|
--source include/restart_mysqld.inc
|
Pushed a trivial fix to the test to 10.4 https://github.com/MariaDB/server/commit/428c7964a23
mtr testcase:
--echo #
--echo # MDEV-30370 mariadbd hangs when running with --wsrep-recover and --plugin-load-add=ha_spider.so
--echo #
--exec $MYSQLD_BOOTSTRAP_CMD --wsrep-recover --plugin-load-add=ha_spider.so
Note that to debug the hang itself, one should put the flags in an .opt file instead, without execing the bootstrap cmd.