[MDEV-11947] InnoDB purge workers fail to shut down Created: 2017-01-31 Updated: 2017-02-09 Resolved: 2017-02-06 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server, Storage Engine - InnoDB, Tests |
| Affects Version/s: | 10.0, 10.2 |
| Fix Version/s: | 10.0.30, 10.1.22, 10.2.4 |
| Type: | Bug | Priority: | Blocker |
| Reporter: | Elena Stepanova | Assignee: | Marko Mäkelä |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | 10.2-rc | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Description |
|
Upd: The failure can appear in buildbot like this:
The test restarts slave server as a part of the test flow.
The problem is reproducible on corresponding VMs (e.g. vm-centos5-amd64-install), not reliably, but easily enough by executing the test with --repeat. I ran it with --repeat=20 --force-restart), hit the problem nearly every time. Increasing the timeout does not help. Could not reproduce it without InnoDB enabled on the slave. Stack trace from all threads from the hanging server is attached. The stack trace does not change within reasonable time. This is most likely the same problem in innodb_fts.innodb_fts_plugin, now without a slave: innodb_zip.innochecksum_3 has a different presentation, but probably the same root cause:
It also restart the server, and one of restarts the server hangs on shutdown. |
| Comments |
| Comment by Marko Mäkelä [ 2017-01-31 ] | |||||||||||||
|
The innodb_zip.innochecksum_3 failure should not be a direct consequence of any hang. The test is shutting down the server, and after that it is invoking innochecksum --write=none to rewrite the dummy page checksums. It does not matter if the server was forcibly killed (well, except if the forced kill happened to cause a corrupted page, because innochecksum does not use any doublewrite buffer). Curiously, the innochecksum_3.test sometimes assigns restart_options, sometimes $restart_parameters. Does restart_options have any effect? The page numbers (space 0, page 190 and 191) are the very last 2 pages of the InnoDB doublewrite buffer in the system tablespace when using the default innodb_page_size=16k.
So, it seems that we should either suppress the warning or simply zero out all doublewrite buffer pages and disable the doublewrite buffer while running this test. | |||||||||||||
| Comment by Elena Stepanova [ 2017-01-31 ] | |||||||||||||
|
marko, | |||||||||||||
| Comment by Marko Mäkelä [ 2017-01-31 ] | |||||||||||||
|
The doublewrite buffer is not emptied on a shutdown. Instead, it is emptied on startup after any doublewrite recovery has been done. I think that a reasonable fix for the innochecksum tests would be to simply disable the InnoDB doublewrite buffer. In that case, those pages should be all zero, and there would be no problem (well, except if the forced kill happened to occur during a write request). | |||||||||||||
| Comment by Marko Mäkelä [ 2017-02-02 ] | |||||||||||||
|
Let us look at the shutdown hang (forced kill+restart after 61 seconds) mentioned in the Description:
Normally InnoDB should display a message early in the shutdown process:
But there is no "InnoDB: Starting shutdown..." in the log, and also the global variables are suggesting that shutdown was never initiated:
| |||||||||||||
| Comment by Marko Mäkelä [ 2017-02-02 ] | |||||||||||||
|
The problem is that srv_purge_coordinator_thread() exited without ensuring that all worker threads actually exited. In the core dump that I examined, one worker thread is waiting for an event, which apparently was lost, and the protocol is failing to account for lost signals. Ever since We must make the purge subsystem more robust, so that signals will never be lost. This problem could explain the intermittent failures of the MySQL 5.7 test innodb.index_merge_threshold. | |||||||||||||
| Comment by Marko Mäkelä [ 2017-02-02 ] | |||||||||||||
|
https://github.com/MariaDB/server/commit/49de5997ef6dd9f3b97d2c2ea81dc50170f929c2 | |||||||||||||
| Comment by Jan Lindström (Inactive) [ 2017-02-03 ] | |||||||||||||
|
ok to push. | |||||||||||||
| Comment by Marko Mäkelä [ 2017-02-03 ] | |||||||||||||
|
I will try to backport this to 10.0, because I think that purge is sometimes ‘stuck’ with no apparent reason there. A natural explanation would be a lost signal. | |||||||||||||
| Comment by Marko Mäkelä [ 2017-02-03 ] | |||||||||||||
|
The list of unstable tests in
| |||||||||||||
| Comment by Marko Mäkelä [ 2017-02-03 ] | |||||||||||||
|
Note: innodb.innodb-get-fk in 10.0 never actually started the server as --innodb-read-only. The test could be a better telltale sign of a shutdown hang on 10.1, where the test could fail to restart the server because --innodb-read-only would refuse to restart the server in certain cases (after | |||||||||||||
| Comment by Marko Mäkelä [ 2017-02-03 ] | |||||||||||||
|
10.0 patch: https://github.com/MariaDB/server/commit/e1ad0e5a3f1c2bdd11499166ff8f14f763878cfb | |||||||||||||
| Comment by Jan Lindström (Inactive) [ 2017-02-06 ] | |||||||||||||
|
ok to push. |