[MDEV-24199] MariaDB Server fails to write a core x out of y times Created: 2020-11-12 Updated: 2023-11-11 |
|
| Status: | Stalled |
| Project: | MariaDB Server |
| Component/s: | Debug |
| Affects Version/s: | 10.5, 10.6 |
| Fix Version/s: | 10.5 |
| Type: | Bug | Priority: | Major |
| Reporter: | Roel Van de Paar | Assignee: | Daniel Black |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | affects-tests | ||
| Issue Links: |
|
||||||||||||
| Description |
|
Loosely defining this bug, as there is a
Combining all current thoughts (Marko/Roel):
Numerous attempts at clarifying the issue further have failed. Current summary (Marko+Roel): good enough info to now log a bug, but not good enough info to find a fix. This issue in the hope that others have other experiences then what is already mentioned above in the hope to get this fixed. Further thoughts
|
| Comments |
| Comment by Marko Mäkelä [ 2020-11-12 ] | |||||||||||||||||||||||||||||||||||||||||
|
For the record, the problematic SIGKILLs are identified by this patch. But, applying this patch will cause many tests to fail. I have not investigated the causes. My main motivation for using this patch has been to be able to get complete rr record traces for something. Apparently mtr is killing not only the mysqld or mariadbd process, but also the rr process, in some of these cases.
The first hunk may be unnecessary. That code is present since 10.3, and it will make the shutdown_server command send SIGABRT on initial timeout, and 5 seconds after that, SIGKILL. I did not check if that 5-second timeout is actually insufficient. I think that we should add some timeout mechanism to each of these SIGKILL, to precede them with SIGABRT. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2020-11-12 ] | |||||||||||||||||||||||||||||||||||||||||
|
With thanks to wlad, --skip-stack-trace (which negates the default-on --stack-trace) will skip writing a stacktrace to stderr. It may (or may not) provide a workaround to the issue. To be tested. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-04-02 ] | |||||||||||||||||||||||||||||||||||||||||
|
A heavily overloaded server (load average 500+) may cause the core not to be written. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-04-02 ] | |||||||||||||||||||||||||||||||||||||||||
|
In both cases (a core file is written, or a core file is not written), the output in the error log is the same:
Yet one run vs another will have the core or not (on currently a high load average (> 500) machine), and plenty of space is available. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Alexey Bychko (Inactive) [ 2021-04-02 ] | |||||||||||||||||||||||||||||||||||||||||
|
I think 500+ LA is not normal. the higher the LA - the higher timeout is needed. timeout is not configurable parameter, so we have only compiled-in value. and it can't be infinite | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-04-02 ] | |||||||||||||||||||||||||||||||||||||||||
|
It has thus far not been shown that there actually is such a hardcoded timeout, though it seems there is one. The one by Marko above relates to MTR only. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-04-14 ] | |||||||||||||||||||||||||||||||||||||||||
|
With thanks to danblack, added this to the build script for temporary testing (i.e. disable in-code SIG handling):
| |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-04-17 ] | |||||||||||||||||||||||||||||||||||||||||
|
Implementing this change at first look has at least not made things worse, and possibly improved things. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-04-27 ] | |||||||||||||||||||||||||||||||||||||||||
|
UniqueID's have proven stable post-patch. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2021-05-04 ] | |||||||||||||||||||||||||||||||||||||||||
|
Lowered prio now as situation is seems better if not 99% OK post-patch approach. Not fully as at least once caught a missing core. Also made framework improvements to cater for either patched or non-patched builds. It would be great if in time MD can be compared with MS/PS to see if their core dump solutions could be ported or if ours could be adjusted to match theirs. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-11-08 ] | |||||||||||||||||||||||||||||||||||||||||
|
Though we are patching (using Daniel's patch) for testing, which resolves the issue for testing, the underlying bug still exists. danblack Do you have further thoughts on how the code in sql/mysqld.cc could be improved to ensure cores are written correctly more often (i.e. for users / without applying this patch)? | |||||||||||||||||||||||||||||||||||||||||
| Comment by Marko Mäkelä [ 2023-11-09 ] | |||||||||||||||||||||||||||||||||||||||||
|
I think that something similar to the second hunk of danblack’s patch was implemented in this commit in August 2022. | |||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-11-11 ] | |||||||||||||||||||||||||||||||||||||||||