[MDEV-8540] Crash on server shutdown since 10.0.16 Created: 2015-07-24 Updated: 2018-07-02 Resolved: 2018-06-27 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server |
| Affects Version/s: | 10.0 |
| Fix Version/s: | 5.5.61, 10.0.36, 10.1.35, 10.2.17, 10.3.8 |
| Type: | Bug | Priority: | Minor |
| Reporter: | Hoang Xuan Phu | Assignee: | Sergey Vojtovich |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Description |
|
This problem was first found in 10.0.20, and I bisected it back to 10.0.16. Every time the server is shut down, it crashes. I am able to reproduce this with an empty data directory.
I attached the config file in use. MariaDB is built from source. The OS is NixOS. |
| Comments |
| Comment by Elena Stepanova [ 2015-07-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Please specify your environment and the packages you use. Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-07-25 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst, I updated with the information. Thanks for looking. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-07-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, If you are building from source anyway, could you maybe build a debug server and try to reproduce the crash there? (And attach the error log with the debug stack trace). I don't have NixOS handy, and I'm not getting the crash on Wheezy. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-07-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst, I intended to do that but got carried away. Will try to get that. Is this the right way to do it? https://mariadb.com/kb/en/mariadb/compiling-mariadb-for-debugging/ | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-07-30 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
phunehehe, yes, it's fine. Of course, you need to look at the "MariaDB starting with 5.5" section, the one with cmake. Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-08-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst: the situation is a bit more complicated than I first thought. The crash doesn't occur when mariadb is started in a standalone process and stopped with kill. We have a systemd service that launches an angel process that in turn launches mariadb (yes, we use a supervisor to supervise another supervisor Here is the crash log when running a debug build.
I'm not sure if the "bytes lost" makes any sense. Strangely, the trace doesn't look much different from previously. I'm pretty sure it's a debug build though.
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-08-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
phunehehe, is there any specifics in how the angel process starts mysqld? Can you quote the command line as it's seen in the ps -ef | grep mysqld output? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-08-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst, here is the process as shown by ps -ef:
I did run mariadb with the exact same command, but that way it's not crashing. I'm not really sure how else to replicate. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-08-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
phunehehe, did you run it as mysql user? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-08-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst: yes I did, using sudo. To be specific
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-09-05 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I don't have a NixOS machine with development environment, and even if I did, it's very unlikely everything will be installed and configured exactly the same way as on yours, which might be important in order to reproduce the problem. Is it your production system? If it's a test server, would you be able to provide us a temporary access to it, so we could investigate it at the spot? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-09-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst I can replicate this in a development system. It should be possible to deploy to a temporary system where you can investigate. Can I have your public SSH key? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-09-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is my public key: https://launchpad.net/~elenst/+sshkeys | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-09-10 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
(I'm preparing the temporary system, but it's taking some time because I want to place it in an isolated network) | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-09-17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst, I have sent an email to the address in your profile with details of the system. Thanks. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2015-09-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
phunehehe, thanks a lot for providing the environment, I got what I needed, you can remove my keys now. The problem is only remotely related to the angel process or other nested watchdogs/wrappers. One is important though. Among other things, it does
That is, it closes the input. Note: In fact, there are at least three problems caused by this stdin redirection: if binary log is not enabled, either InnoDB or Aria throws errors about bad file descriptors. Aria does it on shutdown, while InnoDB might do it on startup and fail to start. Because of that, I am not sure how important it is to fix the issue with binary logging only, although it would be nice if it did something more elegant than crash. To reproduce, start the server as
wait until it starts and then shut it down.
Could only reproduce the crash on 10.0. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2015-09-21 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
elenst that was fantastic! | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-06-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
There're two problems here. Crash itself happens because we're trying to access current_thd->variables.lc_messages->errmsgs->errmsgs, whereas current_thd is NULL, when we attempt to report an error. This affects 5.5+. But the fact that we come across this error is another bug. Since 10.0 we do fclose(stdin) after binlog index file has been opened. If we were started with stdin closed, binlog index file gets 0 fd and then it gets closed. On shutdown we attempt to close binlog index file again and second close fails. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-06-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
In 10.1 things are got even more complicated with the following patch:
Looks like an incorrect attempt to workaround this problem. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Sergey Vojtovich [ 2018-06-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
sql_print_error() bug fixed in 5.5+. fclose() bug fixed in 10.0+. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Hoang Xuan Phu [ 2018-06-28 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
<3 |