Status: Stalled (View Workflow)
Loosely defining this bug, as there is a
- We're take as a baseline any machine which most of the time produces cores correctly, but fails to produce one x out of y crashes
- "x out of y": Approx 1 out of every 7 to 15 crashes, depending on the scenario, no core is written. Higher on high-load machines.
- I have clearly observed this issue since I started working on MariaDB. It is not new/recent
- The problem does not exist in MySQL Server nor in Percona Server
- Marko has clearly observed the issue in his work as well
- The issue impacts testing and test reporting stability
- Issue likely present in all versions
Combining all current thoughts (Marko/Roel):
- For clarity; the issue happens both without (ref notes Roel below, CLI based) and with MTR (ref notes Marko below, MTR based)
- Marko+Roel: any tuning (like ulimit -c unlimited and all other server tuning, and any core dump config in /etc/sysctl.conf) makes zero difference. There may been a [very] small? (if any) improvement by setting a correct core pattern in /etc/sysctl.conf, but the issue remains.
- Roel: I have seen the issue happen plenty of times when only the CLI, the crashing SQL, and no mysqladmin shutdown nor any KILL's where present. Execute crashing SQL at CLI prompt, exit client quickly, check for core (with mysqld clearly crashed as per error log) and no core file is present. Try a one or more repeats and core dump will be there
- Roel: the issue (i.e. no core file generated) seems to happen more pronounced when existing a CLI quickly after executing some crashing SQL, which seems odd given that core dumps would be mysqld, not mysql bound. Perhaps some "client hold/lock/trigger/status update" exists and affects core dump writing
- Roel: core dump writing either works or doesn't work, in this way: if the core is generated, it is generated correctly as a whole, if the core is not generated, the file simply doesn't exist. No half-file-writes exists, which seems to somewhat negate my last point above - unless some "client based trigger" needs to be hit mysqld before a given timeout/situation (likely, based on what I have seen) - i.e. it is a "status" which mysqld needs from the client rather then a "lock" which requires a constant client connection
- Roel: IOW, There seems to be some sort of "delay" before a core is written, as described above. Perhaps best described not as a real delay, but as a "trigger", or "client hold/lock/trigger/status update" as described above.
- Marko: often seen in combination with MTR aborting execution (without presenting any summary) after too many test failures. (But, that ought to be fixed in 10.2 this week.)
- Marko: there are also 3 SIGKILLs in MTR that I suspect can ruin not only the core dump writes but also rr record runs (by killing the rr process)
Numerous attempts at clarifying the issue further have failed.
Current summary (Marko+Roel): good enough info to now log a bug, but not good enough info to find a fix.
This issue in the hope that others have other experiences then what is already mentioned above in the hope to get this fixed.
- Perhaps a script which quickly brings up server, crashes it with SQL at the CLI, exists immediately from the CLI and then counts the number of cores written and loops may be able to prove a better x out of y ratio, but it may not help with finding the real cause
- relates to
MDEV-24217 Add --invoke-on-crash option to mysqld allowing better and non-failing debugging traces
MDEV-25330 fflush(stderr) call improvement in signal_handler.cc