[MDEV-29098] mariadb crash on shutdown with signal 11 Created: 2022-07-13  Updated: 2024-01-05

Status: Open
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.5.13
Fix Version/s: 10.5

Type: Bug Priority: Major
Reporter: Matthew Walker Assignee: Oleksandr Byelkin
Resolution: Unresolved Votes: 1
Labels: None

Attachments: File variable_output.tar    

 Description   

On shutdown mariadb logs "InnoDB: FTS optimize thread exiting." then gets a signal 11 error after 3 minutes.

Seems to be a regression of MDEV-25518, but I'm not sure.

2022-06-21 18:30:26 0 [Note] InnoDB: FTS optimize thread exiting.
380621 18:33:46 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.5.13-MariaDB-log
key_buffer_size=26214400
read_buffer_size=131072
max_used_connections=24
max_threads=122
thread_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 75347 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
/ngfw/usr/sbin/mariadbd(my_print_stacktrace+0x2e)[0x55f43b80ae6e]
/ngfw/usr/sbin/mariadbd(handle_fatal_signal+0x485)[0x55f43b2a1fa5]



 Comments   
Comment by Daniel Black [ 2022-07-14 ]

Matthwal thanks for the bug report. Is this happening every shutdown? If you upgrade from a previous version, was it happening on the previous version? Did it occur immediately after upgrading to 10.5.13?

I'm hoping you can provide more details related to this bug report as currently there is too few details to go off attempting a resolution on this.

Was there addition messages below what you have put already, a more detailed stack trace would help narrow it down.

If a core dump was saved, locally or via a handler like systemd's coredumpctl extracting a backtrace using gdb after installing debug symbol packages (earlier on same link) would be useful.

Was there a reason you listed the wsrep component? Have you tried, or are you willing to try a 10.5.16 version?

Comment by Matthew Walker [ 2022-07-14 ]

>Is this happening every shutdown?
Not every shutdown, but we have seen it 3 times in our product testing.

>was it happening on the previous version
Not that we have seen, although we have encountered MDEV-25518

>Did it occur immediately after upgrading to 10.5.13?
No, this happens after the DB has been running for a long time. We don't shutdown the DB in our product often, only when upgrading to a new product version, which is when we see this error occur.

>Was there addition messages below what you have put already
No, that's everything in the mysql-server.err log related to the crash. I can upload the full file if requested, but it just contains otherwise normal start/stop logs.

>If a core dump was saved
One was, and we're working on getting the backtrace.

>Was there a reason you listed the wsrep component?
Only to mirror MDEV-25518. I've removed it now.

>Have you tried, or are you willing to try a 10.5.16 version?
We are willing, this happens in our product upgrade framework, right before steps to move to 10.5.16. So this bug is blocking us moving to that version in newer product versions.

Hope this info helps, I'll try to get that BT ASAP. If theres any other info I can grab, I'd be happy to.

Comment by Daniel Black [ 2022-07-15 ]

Looking forward to the backtrace.

I looked at what comes between the "InnoDB: FTS optimize thread exiting." message and the next message "InnoDB: Starting shutdown..." (that didn't arrive), and its a) the pre-shutdown of other plugins (probably less likely but if you've got non-default installed plugins please say so), or b) the srv_purge_shutdown, shutdown of the purge threads.

Making the assumption that its the purge shutdown for now, do you have innodb_fast_shutdown set to a non-default 1, or any other non-default innodb purge settings.

Do you have any global status information before shutdown, particularly innodb, and those like Innodb_history_list_length relevant to the purge.

I haven't found any duplicate bug reports to indicate a later version fixes this (based on purge related segfaults, MDEV-28601 might be an outlying chance for > 4G of pending transactions on shutdown).

Comment by Daniel Black [ 2022-07-15 ]

Note with the stacktrace, some data structures around the SEGV would be appreciated even if you are guessing these. Please keep the gdb session handy as with a stack trace we can provide better guidance as to what data might be useful to determine how it got into this state.

Comment by Matthew Walker [ 2022-07-18 ]

Still working on getting the backtrace, having issues becuase this is an older version of our product, so it's hard to get the exact binary that failed.

>if you've got non-default installed plugins please say so
Nope, none whatsoever.

>do you have innodb_fast_shutdown set to a non-default 1
No, we have it set to zero. I'm going to include all of our innodb settings, maybe something there will help. If theres any others that could help, just let me.

innodb-buffer-pool-size=384M
innodb_buffer_pool_chunk_size=128M
ssl-cipher=DEFAULT:!EDH
net_write_timeout=600
net_read_timeout=360
innodb_lock_wait_timeout=1800
key_buffer_size=1775900K
innodb_flush_neighbors=0
innodb = FORCE
innodb-strict-mode = 1
innodb-flush-method = O_DIRECT
innodb-fast-shutdown = 0
innodb-log-file-size = 128M
innodb-flush-log-at-trx-commit = 2
innodb-file-per-table = 1
innodb-io-capacity = 500
innodb-file-format = Barracuda
secure-auth = 0
innodb_buffer_pool_dump_at_shutdown = 0
innodb_buffer_pool_load_at_startup = 0
innodb_checksum_algorithm=full_crc32
binlog_format=row
group_concat_max_len=1024
innodb_open_files=800
innodb_purge_threads=4

Comment by Matthew Walker [ 2022-07-19 ]

Still having a bit of trouble getting a full back trace, I'll paste what I have

#0 0x000014adfdb47241 in ?? ()
#1 0x0000564a2b746400 in mem_root ()
#2 0x0000564a2b746400 in mem_root ()
#3 0x000014adf5def830 in ?? ()
#4 0x0000564a2a98dfc0 in handle_fatal_signal (sig=11) at /mariadb-x86_64/mariadb-10.5.13/sql/signal_handler.cc:344

Comment by Daniel Black [ 2022-07-20 ]

Try going up to the frame 4, and p *current_thd. It might be null as its in shutdown, and if not will contain a query and possibly a small bit of senstive data so look closely before pasting.

Other global variables that might be saved in the core that might describe some current state:

  • purge_sys
  • trx_sys
  • purge_coordinator_task
  • purge_coordinator_task_group
  • buf_lru_flush_page_count
  • buf_page_cleaner_is_active
  • buf_LRU_stat_cur
  • srv_sys_space
  • ibuf
  • log_sys
  • recv_sys
  • srv_shutdown_lsn
Comment by Matthew Walker [ 2022-07-20 ]

Here's the info on frame 4, I'm not seeing anything for current_thd unfortunatly.

(gdb) info frame 4
Stack frame at 0x1b:
rip = 0x564a2a98dfc0 in handle_fatal_signal (/mariadb-x86_64/mariadb-10.5.13/sql/signal_handler.cc:344); saved rip = <not saved>
Outermost frame: previous frame inner to this frame (corrupt stack?)
caller of frame at 0x14adf5def7a0
source language c++.
Arglist at 0xb, args: sig=11
Locals at 0xb, Previous frame's sp is 0x1b
Cannot access memory at address 0xffffffffffffffe3

I've collected the output for those vars, some of them are quite long, so I've put each into a text file titled the var name, and packaged them so this comment doesn't become unreadable. variable_output.tar

Hope this helps!

Comment by Marko Mäkelä [ 2022-09-26 ]

I do not know if this might be related, but I just came across to this shutdown crash of a 10.6-based branch:

bb-10.6-MDEV-19229 d9ae31d3a5f86eafc368356ecbed60c32c317547

2022-09-24  4:57:15 0 [Note] InnoDB: FTS optimize thread exiting.
2022-09-24  4:57:15 0 [Note] InnoDB: Starting shutdown...
2022-09-24  4:57:15 0 [Note] InnoDB: Dumping buffer pool(s) to /dev/shm/var/1/mysqld.2/data/ib_buffer_pool
2022-09-24  4:57:15 0 [Note] InnoDB: Restricted to 126 pages due to innodb_buf_pool_dump_pct=25
2022-09-24  4:57:15 0 [Note] InnoDB: Buffer pool(s) dump completed at 220924  4:57:15
220924  4:57:15 [ERROR] mysqld got signal 11 ;
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x49000
mysys/stacktrace.c:212(my_print_stacktrace)[0x49f328ac]
sql/signal_handler.cc:234(handle_fatal_signal)[0x4983b104]
addr2line: '': No such file
[0x3fffab200478]
[0x3fffa53d446c]
??:0(CRYPTO_lock)[0x3fffaacc56f4]
??:0(RAND_poll)[0x3fffaad7e7b4]
??:0(RAND_poll)[0x3fffaad7f1c4]
??:0(ERR_remove_thread_state)[0x3fffaad80bb8]
sql/sql_connect.cc:1322(handle_one_connection)[0x496af080]
2022-09-24  4:57:15 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2022-09-24  4:57:15 0 [Note] InnoDB: Shutdown completed; log sequence number 60001; transaction id 61

The output suggests that some thread crashed while InnoDB shutdown was still in progress in other threads. According to the output, InnoDB did shut down successfully despite the crash that occurred somewhere outside InnoDB.

Comment by Marko Mäkelä [ 2022-09-26 ]

The ERR_remove_thread_state() in "my" crash is defined in OpenSSL and invoked via a macro at the end of handle_one_connection():

#if defined(HAVE_OPENSSL) && !defined(EMBEDDED_LIBRARY)
  ERR_remove_state(0);
#endif
  my_thread_end();
  return 0;
}

I do not currently think that the crash involves InnoDB at all.

Comment by Marko Mäkelä [ 2024-01-05 ]

Based on the contents of variable_output.tar, InnoDB had already been shut down and the server is crashing somewhere else due to memory corruption. The contents of recv_sys looks corrupted, which could be a sign of memory overwrite somewhere. It is only used during early InnoDB startup for performing crash recovery, and it should be cleaned up after that.

Generated at Thu Feb 08 10:05:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.