[MDEV-30370] mariadbd hangs when running with --wsrep-recover and --plugin-load-add=ha_spider.so Created: 2023-01-09 Updated: 2023-12-07 Resolved: 2023-01-25 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - Spider |
| Affects Version/s: | 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11, 11.0 |
| Fix Version/s: | 10.11.2, 11.0.1, 10.4.28, 10.5.19, 10.6.12, 10.7.8, 10.8.7, 10.9.5, 10.10.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Yuchen Pei | Assignee: | Yuchen Pei |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | not-10.3 | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
When running mariadbd with --wsrep-recover, it is expected to exit. But with --plugin-load-add=ha_spider.so, it does not exit. Given mariadbd is called with --wsrep-recover indirectly as part of the ExecStartPre in the systemd service, this can cause a hang on systemd [re]start. |
| Comments |
| Comment by Yuchen Pei [ 2023-01-10 ] | ||||||||||||||||||||||||||||||||||||||||||
|
mtr testcase:
Note that to debug the hang itself, one should put the flags in an .opt file instead, without execing the bootstrap cmd. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-10 ] | ||||||||||||||||||||||||||||||||||||||||||
|
The problem occurs in 10.4+, but not in 10.3. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-10 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Result from git-bisect, which is also the first bad commit for
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-11 ] | ||||||||||||||||||||||||||||||||||||||||||
|
The cause is similar to that of
The lock is held by another thread (let's call it Thread 3), in spider_table_bg_sts_action. As the main thread tries to lock the mutex, Thread 3 is pthread_cond_wait ing for spd_COND_server_started to be broadcasted or signaled, which would happen later in the main thread. Hence the deadlock
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-11 ] | ||||||||||||||||||||||||||||||||||||||||||
|
The trace of the main thread at hanging looks almost identical to that of | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-13 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Looking more into it, the sequence of events is as follows (using commit fdcfc25) 1. The main thread inits the spider plugin (sql/mysqld.cc:5775)
2. The main thread tries to abort by calling unireg_abort() (with exit code 0) after finding wsrep_recover (line 5797)
3. The server would start had none of the unireg_abort() been called (line 5935):
[2] https://jira.mariadb.org/browse/MDEV-22979?focusedCommentId=158235&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-158235 Based on this analysis, with --plugin-load-add=ha_spider.so, any call to unireg_abort() between the init of plugins (sql/mysqld.cc:5775) and the start of the server (sql/mysqld.cc:5935) will cause a hang. In this sense the present ticket is almost a duplicate of | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-17 ] | ||||||||||||||||||||||||||||||||||||||||||
|
I have verified that the fix for mdev-27233 (c160a115b8b6dcd54bb3daf1a751ee9c68b7ee47) does still cause mdev-29904 when applied to the current 10.7 (1e04cafcba8), so the fix for this ticket should be tested against the testcase in mdev-29904:
Curiously, the above testcase would fail with mysqltest failed but provided no output in ANY commit if placed under storage/spider/mysql-test/bugfix/t, but passes/fails as expected if placed under mysql-test/main. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-18 ] | ||||||||||||||||||||||||||||||||||||||||||
|
The test failure in the previous comment is With the patch at [1], and testing using restart_mysqld.inc, we get a
However, even though the test hangs at a commit (e.g. fdcfc25) containing the bad commit in its history, it outputs the same error as in
Update: this test does not work, see below[1]. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-18 ] | ||||||||||||||||||||||||||||||||||||||||||
|
This patch[1] extracts the running of spider init queries to its own thread (let's call it the iq thread), thus unblocking the main thread. However, the iq thread still gets stuck waiting for server to start. [1] https://github.com/MariaDB/server/commit/d94be629ce3 To fix this, I can think of two possibilities 1. Add a condition variable to indicate that the server has exited, which is signaled / broadcasted in mysqld_exit. I will start by looking into 2 first. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-01-18 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thank you for putting quality thought into this. For #2, I am not sure if there are any such queries, though perhaps there are. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
After reading and further research (thanks to nayuta-yanagisawa for the documentation in the form of chat logs I think a cleaner fix would still be separating out the init queries from the first sts thread, and applying the same sort of timedwait for server start, but I could not get a working patch based on this idea yet. Considering how long this issue has been hanging around, perhaps it is better to apply Kentoku's patch and close this issue as well as The patch also does not fix | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
I can confirm the patch also fixes holyfoot Can you review the patch at https://github.com/MariaDB/server/commit/ef1161e5d4f? Thanks. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-20 ] | ||||||||||||||||||||||||||||||||||||||||||
|
The test using `restart_mysqld.inc` mentioned in a previous comment[1] would not pass even with the patch. And it will fail with "mysqltest failed but provided no output". It would pass if we remove the `--wsrep-recover`. Based on this I think it is caused by mtr expecting the server to be alive after the restart, even though `--wsrep-recover` means the server should exit immediately. This is the same problem as if one writes a test case by using a `.opt` file only:
Running this test will fail with something like
Therefore the only working test we have is still the one calling `$MYSQLD_BOOTSTRAP_CMD` as in the first comment[2]. Though that test may hang (e.g. like in this ticket), taking 15 minutes to die if mtr is run with the default timeout. [1] https://jira.mariadb.org/browse/MDEV-30370?focusedCommentId=247747&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-247747 | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-20 ] | ||||||||||||||||||||||||||||||||||||||||||
|
I tested this patch for | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexey Botchkov [ 2023-01-23 ] | ||||||||||||||||||||||||||||||||||||||||||
|
The abovementioned Kentoku's patch looks ok to push. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-24 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thanks holyfoot for the review. I have done tests and can confirm the patch fixes the present issue for 10.6-11.0. Pushing this patch is currently blocked by | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-01-25 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Pushed 284810b3e89. I removed the test for
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Yuchen Pei [ 2023-06-29 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Pushed a trivial fix to the test to 10.4 https://github.com/MariaDB/server/commit/428c7964a23 |