[MDEV-24199] MariaDB Server fails to write a core x out of y times - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Duplicate
Affects Version/s: 10.5(EOL), 10.6
Fix Version/s: N/A
Component/s: Debug
Labels:
- affects-tests

Description

Loosely defining this bug, as there is a

We're take as a baseline any machine which most of the time produces cores correctly, but fails to produce one x out of y crashes
"x out of y": Approx 1 out of every 7 to 15 crashes, depending on the scenario, no core is written. Higher on high-load machines.
I have clearly observed this issue since I started working on MariaDB. It is not new/recent
The problem does not exist in MySQL Server nor in Percona Server
Marko has clearly observed the issue in his work as well
The issue impacts testing and test reporting stability
Issue likely present in all versions

Combining all current thoughts (Marko/Roel):

For clarity; the issue happens both without (ref notes Roel below, CLI based) and with MTR (ref notes Marko below, MTR based)
Marko+Roel: any tuning (like ulimit -c unlimited and all other server tuning, and any core dump config in /etc/sysctl.conf) makes zero difference. There may been a [very] small? (if any) improvement by setting a correct core pattern in /etc/sysctl.conf, but the issue remains.
Roel: I have seen the issue happen plenty of times when only the CLI, the crashing SQL, and no mysqladmin shutdown nor any KILL's where present. Execute crashing SQL at CLI prompt, exit client quickly, check for core (with mysqld clearly crashed as per error log) and no core file is present. Try a one or more repeats and core dump will be there
Roel: the issue (i.e. no core file generated) seems to happen more pronounced when existing a CLI quickly after executing some crashing SQL, which seems odd given that core dumps would be mysqld, not mysql bound. Perhaps some "client hold/lock/trigger/status update" exists and affects core dump writing
Roel: core dump writing either works or doesn't work, in this way: if the core is generated, it is generated correctly as a whole, if the core is not generated, the file simply doesn't exist. No half-file-writes exists, which seems to somewhat negate my last point above - unless some "client based trigger" needs to be hit mysqld before a given timeout/situation (likely, based on what I have seen) - i.e. it is a "status" which mysqld needs from the client rather then a "lock" which requires a constant client connection
Roel: IOW, There seems to be some sort of "delay" before a core is written, as described above. Perhaps best described not as a real delay, but as a "trigger", or "client hold/lock/trigger/status update" as described above.
Marko: often seen in combination with MTR aborting execution (without presenting any summary) after too many test failures. (But, that ought to be fixed in 10.2 this week.)
Marko: there are also 3 SIGKILLs in MTR that I suspect can ruin not only the core dump writes but also rr record runs (by killing the rr process)

Numerous attempts at clarifying the issue further have failed.

Current summary (Marko+Roel): good enough info to now log a bug, but not good enough info to find a fix.

This issue in the hope that others have other experiences then what is already mentioned above in the hope to get this fixed.

Further thoughts

Perhaps a script which quickly brings up server, crashes it with SQL at the CLI, exists immediately from the CLI and then counts the number of cores written and loops may be able to prove a better x out of y ratio, but it may not help with finding the real cause

Attachments

Issue Links

duplicates

MDEV-21010 Mariadb hangs (during a backtrace), stops responding to new connections

Stalled

relates to

MDEV-21010 Mariadb hangs (during a backtrace), stops responding to new connections

Stalled

MDEV-29568 libelf (specificly libdw) based stack resolver

Open

MDEV-24217 Add --invoke-on-crash option to mysqld allowing better and non-failing debugging traces

Needs Feedback

MDEV-25330 fflush(stderr) call improvement in signal_handler.cc

Closed

Activity

People

Assignee:: Daniel Black

Reporter:: Roel Van de Paar

Votes:: 1 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 2020-11-12 06:57

Updated:: 2024-12-10 05:49

Resolved:: 2024-12-10 05:49

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1d 1h 17m

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.