[MDEV-32686] crash information to include Distro information Created: 2023-11-06  Updated: 2023-11-10

Status: In Review
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.4.31
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Daniel Black Assignee: Sergei Golubchik
Resolution: Unresolved Votes: 0
Labels: None


 Description   

When resolving bug reports, sometimes is not immediately obvious which distro is in use. Sometimes core dumps are provided, and with the right disto, this is resolvable without further user questioning.

Linux distros have converged on /etc/os-release to communicate the release information in a standard form.

There's sometimes up to 22 lines of information in this file, which is a bit verbose.

Looking at the samples in the following urls, 3 lines appears sufficient to communicate distro and version across a significant set of distros.

So lets include these in the crash handler information.



 Comments   
Comment by Sergei Golubchik [ 2023-11-06 ]

please, let's not add even more code into the crash handler and even more output to the already long crash message.
it should only print information from the process memory, about this particular crash.

OS information can be easily added to the log after the crash by some external tool. E.g. mariadbd-safe can do, like

if [[ $? -gt 128 ]]; then
  cat /proc/self/limits # admittedly, the server can have different limits, but not the core file size
  cat /proc/sys/kernel/core_pattern
  cat /proc/version
  mount -a
  pwd
  cat /etc/*[-_]{release,version}
fi

Comment by Daniel Black [ 2023-11-06 ]

mariadbd-safe, like most external tools, an obsolete script that no-one uses any more.

The purpose was with a few syscalls get the most information as quickly as possible, and since os-release is pretty much all of Linux, FreeBSD and Solaris compatible, was an easy one shot to get all of them.

Comment by Sergei Golubchik [ 2023-11-06 ]

Correct, I only used mariadbd-safe as an example, I didn't dare to write an example of how it could've been done in systemd service, you know it much better than I.

The point is still the same, it's not something that needs to be read from the memory of the crashed process.

Comment by Daniel Black [ 2023-11-07 ]

So containers execute mariadbd directly and no other process is left running. systemd services run mariadbd directly.

So handlers we can count on are kernel.core_pattern, whatever the user has it set to. This is also kernel wide and still isn't settable in a namespace.

On kernel.core_pattern handers, systemd-coredump would record enough.

If users reported with coredump like below, there's enough information to determine OS,

coredumpctl info

$ coredumpctl info
           PID: 100932 (mariadbd)
           UID: 1000 (dan)
           GID: 1000 (dan)
        Signal: 11 (SEGV)
     Timestamp: Tue 2023-11-07 12:01:09 AEDT (1h 4min ago)
  Command Line: sql/mariadbd --no-defaults --skip-networking --datadir=/tmp/build-mariadb-server-10.4-datadir --socket=/tmp/build-mariadb-server-10.4.sock --plugin-dir=/home/dan/repos>
    Executable: /home/dan/repos/build-mariadb-server-10.4/sql/mysqld
 Control Group: /user.slice/user-1000.slice/user@1000.service/app.slice/app-org.gnome.Terminal.slice/vte-spawn-2eb19cf1-f861-49ee-b491-a9a77e592174.scope
          Unit: user@1000.service
     User Unit: vte-spawn-2eb19cf1-f861-49ee-b491-a9a77e592174.scope
         Slice: user-1000.slice
     Owner UID: 1000 (dan)
       Boot ID: 5af5d0a737fc48f1811ebc61af430e63
    Machine ID: f92b0c4cdca64582a4d2b3171560f8ae
      Hostname: bark
       Storage: /var/lib/systemd/coredump/core.mariadbd.1000.5af5d0a737fc48f1811ebc61af430e63.100932.1699318869000000.zst (present)
  Size on Disk: 427.6K
       Message: Process 100932 (mariadbd) of user 1000 dumped core.
                
                Module libzstd.so.1 from rpm zstd-1.5.5-4.fc39.x86_64
                Module libcap.so.2 from rpm libcap-2.48-7.fc39.x86_64
                Module libz.so.1 from rpm zlib-1.2.13-4.fc39.x86_64
                Module libsystemd.so.0 from rpm systemd-254.5-2.fc39.x86_64
                Module libaio.so.1 from rpm libaio-0.3.111-16.fc39.x86_64
                Module libsnappy.so.1 from rpm snappy-1.1.10-2.fc39.x86_64
                Module libbz2.so.1 from rpm bzip2-1.0.8-16.fc39.x86_64
                Module liblzma.so.5 from rpm xz-5.4.4-1.fc39.x86_64
                Module liblz4.so.1 from rpm lz4-1.9.4-4.fc39.x86_64
                Module libcrypt.so.2 from rpm libxcrypt-4.4.36-2.fc39.x86_64
                Stack trace of thread 100932:
                #0  0x00007f940745cbab kill (libc.so.6 + 0x3ebab)
                #1  0x0000000000864ee5 n/a (/home/dan/repos/build-mariadb-server-10.4/sql/mysqld + 0x464ee5)
                ELF object binary architecture: AMD x86-64

On Ubuntu system abrt would have enough info, but not enabled by default.

But as we've seen, it seldom happens.

Some recent work (3 weeks ago) added delation from the systemd-coredump to a container systemd instigates. Then we'd be packing a core hander in the container, which we can do. Not the most common of use case.

On container, at least the Docker Official Image, we might (conditional on Docker Inc accepting it) be able to have a handler that happens to match common kernel.core_pattern paths.

On systemd level there is ExecStopPost which has access to the error code and some reporting information could be made here. It seems a little beyond intended purpose however.

In containers the filesystem namespace is a temporary construct tied to the process, so its mariadbd that has visibility of this and the namespace disappears with the process.

So I'm still thinking the server is the right place for this.

On output, yes its a bit long. Can we remove the following?

key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=0
max_threads=153
thread_count=5
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 467724 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Seems dated, thread_count might have some use, but a full backtrace will show this anyway.

And remove/replace this?

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

Could be shorter "Please include the information from the server start to the end information below".

Comment by Sergei Golubchik [ 2023-11-09 ]

Yes, please, by all means feel free to remove old outdated text from the crash message.

But it won't fix the problem that crash handler does too much (when the memory is possibly corrupted so anything it's doing increases a chance of something going very wrong). And prints too much. For example, limits there aren't helpful at all. Only the coredump size limit is (and only when core dumping failed, but I'm not sure you can always detect it).

I suppose if it has to be printed, this information must be collected before the crash into some fixed buffer. And then printed in case of a crash.

Comment by Daniel Black [ 2023-11-10 ]

> Yes, please, by all means feel free to remove old outdated text from the crash message.

Ok removed in update PR.

> But it won't fix the problem that crash handler does too much (when the memory is possibly corrupted so anything it's doing increases a chance of something going very wrong).

Removing outdated text now access less THD data (query, connection, optimizer_switch - had doubts about removing this).

> And prints too much. For example, limits there aren't helpful at all. Only the coredump size limit is (and only when core dumping failed, but I'm not sure you can always detect it).

I added a patch to exclude resource limits that where softlimit to unlimited which reduces the size of it by ~1/2.

> I suppose if it has to be printed, this information must be collected before the crash into some fixed buffer. And then printed in case of a crash.

Yes, reusing the same buffers that were already there.

Crash sample with change in PR:

231110 12:03:20 [ERROR] sql/mariadbd got signal 11 ;
Sorry, we probably made a mistake, and this is a bug.
 
Your assistance in bug reporting will enable us to fix this for the next release.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs about how to report
a bug on https://jira.mariadb.org/.
 
Please include the information from the server start above, to the end of the
information below.
 
Server version: 10.4.32-MariaDB-debug source revision: e83a954019f5628d594c3abf2d05bb24602772d1
 
The information page at https://mariadb.com/kb/en/how-to-produce-a-full-stack-trace-for-mariadbd/
contains instructions to obtain a better version of the backtrace below.
Following these instructions will help MariaDB developers provide a fix quicker.
 
Attempting backtrace. Include this in the bug report.
(note: Retrieving this information may fail)
 
Thread pointer: 0x7f6500000da0
stack_bottom = 0x7f65802fec70 thread_stack 0x49000
addr2line: 'sql/mariadbd': No such file
sql/mariadbd(my_print_stacktrace+0x40)[0x55faa7332574]
sql/mariadbd(handle_fatal_signal+0x252)[0x55faa69d3948]
/lib64/libc.so.6(+0x3e9a0)[0x7f658157b9a0]
addr2line: 'sql/mariadbd': No such file
sql/mariadbd(_ZNK8Item_ref12type_handlerEv+0x17)[0x55faa65de96b]
sql/mariadbd(_ZNK8Item_ref12type_handlerEv+0x3a)[0x55faa65de98e]
sql/mariadbd(_ZNK4Item11result_typeEv+0x25)[0x55faa65516ff]
sql/mariadbd(_ZN8Item_ref10check_colsEj+0x2b)[0x55faa65df0b7]
sql/mariadbd(_ZN4Item31fix_fields_if_needed_for_scalarEP3THDPPS_+0x4d)[0x55faa65516c7]
sql/mariadbd(_ZN15Item_direct_ref10fix_fieldsEP3THDPP4Item+0x3c)[0x55faa6a1b972]
sql/mariadbd(_ZN4Item20fix_fields_if_neededEP3THDPPS_+0x59)[0x55faa655166f]
sql/mariadbd(_ZN9Item_func10fix_fieldsEP3THDPP4Item+0x296)[0x55faa6a68fe4]
sql/mariadbd(_ZN4Item20fix_fields_if_neededEP3THDPPS_+0x59)[0x55faa655166f]
sql/mariadbd(_ZN4Item31fix_fields_if_needed_for_scalarEP3THDPPS_+0x2b)[0x55faa65516a5]
sql/mariadbd(_ZN4Item29fix_fields_if_needed_for_boolEP3THDPPS_+0x2b)[0x55faa65de061]
sql/mariadbd(_ZN9Item_cond10fix_fieldsEP3THDPP4Item+0x364)[0x55faa6a34c02]
sql/mariadbd(_ZN4JOIN14optimize_innerEv+0xd38)[0x55faa66c5684]
sql/mariadbd(_ZN4JOIN8optimizeEv+0xc6)[0x55faa66c401a]
sql/mariadbd(_ZN17Item_in_subselect8optimizeEPdS0_+0x120)[0x55faa6ac40ba]
sql/mariadbd(_Z21setup_jtbm_semi_joinsP4JOINP4ListI10TABLE_LISTERS1_I4ItemE+0xbe)[0x55faa68869e0]
sql/mariadbd(_ZN4JOIN14optimize_innerEv+0x1266)[0x55faa66c5bb2]
sql/mariadbd(_ZN4JOIN8optimizeEv+0xc6)[0x55faa66c401a]
sql/mariadbd(_ZN13st_select_lex31optimize_unflattened_subqueriesEb+0x255)[0x55faa66417d3]
sql/mariadbd(_ZN4JOIN31optimize_unflattened_subqueriesEv+0x24)[0x55faa6884cb0]
sql/mariadbd(_ZN4JOIN15optimize_stage2Ev+0x2038)[0x55faa66c8a20]
sql/mariadbd(_ZN4JOIN14optimize_innerEv+0x1f8e)[0x55faa66c68da]
sql/mariadbd(_ZN4JOIN8optimizeEv+0xc6)[0x55faa66c401a]
sql/mariadbd(_ZN18st_select_lex_unit8optimizeEv+0x525)[0x55faa6793f8b]
sql/mariadbd(_ZN18st_select_lex_unit4execEv+0x16d)[0x55faa67941cb]
sql/mariadbd(_Z11mysql_unionP3THDP3LEXP13select_resultP18st_select_lex_unitm+0x84)[0x55faa678e936]
sql/mariadbd(_Z13handle_selectP3THDP3LEXP13select_resultm+0xb6)[0x55faa66be7c0]
sql/mariadbd(+0x882a6c)[0x55faa667fa6c]
sql/mariadbd(_Z21mysql_execute_commandP3THD+0x1483)[0x55faa6675f5a]
sql/mariadbd(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x2cc)[0x55faa6683c73]
sql/mariadbd(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x10e9)[0x55faa666f536]
sql/mariadbd(_Z10do_commandP3THD+0x862)[0x55faa666dd9a]
sql/mariadbd(_Z24do_handle_one_connectionP7CONNECT+0x1cd)[0x55faa68116e6]
sql/mariadbd(handle_one_connection+0x30)[0x55faa681144e]
/lib64/libc.so.6(+0x8e897)[0x7f65815cb897]
/lib64/libc.so.6(+0x1156bc)[0x7f65816526bc]
 
Connection ID (thread ID): 8
Status: NOT_KILLED
Query (0x7f6500015ba8): WITH RECURSIVE x ( x ) AS ( SELECT 1.000000 ^ 1.000000 UNION SELECT 1 - x FROM x ) SELECT DISTINCT * FROM x UNION SELECT NOT ( SELECT x FROM ( SELECT 1.000000 ^ 'x' * 1.000000 / 1 ^ x = ( SELECT x FROM x WHERE x IN ( WITH x AS ( WITH x AS ( SELECT * FROM x WHERE x / 1 = x % 1 ) SELECT ( NULL = 1.000000 ) OR ( ( x % 1 ) = 1 ) OR ( x BETWEEN 1 AND 1 ) AS x , x + NULL FROM x WHERE x = CASE x WHEN 'x' THEN 'x' WHEN 1 THEN 'x' ELSE x END WINDOW x AS ( PARTITION BY x ORDER BY x DESC ) ) SELECT x FROM x AS x WHERE ( x = 'x' OR x = 'x' ) AND x IS NOT NULL GROUP BY - 'x' >= x ) UNION SELECT 1 - x FROM x ) AND ( x = CASE WHEN 'x' THEN 'x' ELSE x END ) = 1 FROM x GROUP BY x ) AS x WHERE x = 'x' AND x IN ( SELECT - 1 BETWEEN ( SELECT x FROM x AS x WHERE EXISTS ( SELECT x , x * x FROM x ORDER BY ( x BETWEEN ( SELECT DISTINCT x WHERE x BETWEEN ( WITH x AS ( SELECT * FROM x GROUP BY x HAVING ( 1 NOT IN ( ( x < 1 OR 1 - 1 ) , 1 ) ) WINDOW x AS ( ) ) SELECT x AS x FROM x AS x GROUP BY x HAVING x ) AND 1 ) AND 1 ) , x ) GROUP BY x ORDER BY x * 1 ) AND 1 AS x FROM x WHERE x = 'x' GROUP BY x HAVING x ) ) FROM x WHERE ( x = 1 ) OR ( x = 1 ) OR ( x BETWEEN 1 AND 1 ) OR ( x = 1 ) OR ( x BETWEEN 1 AND 1 ) OR 1 OR ( 1 IN ( 1 , 1 ) ) OR ( x BETWEEN 1 AND 1 ) OR ( x = 1 ) OR ( x + 1 = ( SELECT DISTINCT x IS NULL FROM x ) OR x > 1 OR ( x = 1 AND ( x = x OR x = x ) ) ) OR ( x = 1 ) GROUP BY x , x HAVING ( 1 = 1 AND ( FALSE < x ) = 1 ) ORDER BY x + x
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on
 
Operating System information:
NAME="Fedora Linux"
VERSION="39 (Workstation Edition)"
ID=fedora
 
Writing a core file...
Working directory at /tmp/build-mariadb-server-10.4-debug-datadir
Resource Limits (excludes unlimited resources):
Limit                     Soft Limit           Hard Limit           Units     
Max stack size            8388608              unlimited            bytes     
Max processes             127235               127235               processes 
Max open files            32198                32198                files     
Max locked memory         8388608              8388608              bytes     
Max pending signals       127235               127235               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Core pattern: |/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h
 
Kernel version: Linux version 6.5.9-300.fc39.x86_64 (mockbuild@61da1adf097747ceaa237204a5dfa88e) (gcc (GCC) 13.2.1 20230918 (Red Hat 13.2.1-3), GNU ld version 2.40-13.fc39) #1 SMP PREEMPT_DYNAMIC Wed Oct 25 21:39:20 UTC 2023
 
Segmentation fault (core dumped)

Generated at Thu Feb 08 10:33:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.