[MDEV-32630] split_materialized: SIGSEGV in best_access_path() Created: 2023-10-31  Updated: 2023-12-04

Status: Open
Project: MariaDB Server
Component/s: Optimizer, Server
Affects Version/s: 10.6.15, 10.11.4
Fix Version/s: 10.6, 10.11

Type: Bug Priority: Major
Reporter: Patrick Winnem Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: split_materialized
Environment:

Thinkserver SR650
Debian 12 bookworm

Mariadb installed from Debian apt repository


Attachments: Text File backtrace.dump.log     Text File error-wo-queries.log    

 Description   

Some history:
We have had issues with this server for a while. The issues are only present on the replica, the master runs fine.

Up until last week the two servers ran the same Debian and MariaDB version, as well as config. Last week we rebuilt the server, going from Debian 11 and MariaDB 10.6.7 to Debian 12 and MariaDB 10.11.4.

We had 6 segfaults on the replica, the following is the shortened log of this, more complete logs in attachment. (I removed some noise form leftover partitions, hostnames and the queries, its the same query in all five instances and it might be relevant but also sensitive for our customer, let me know if needed and i will upload the complete error.log to ftp)

231030 10:54:21 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

231030 10:56:16 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

231030 11:01:57 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

231030 11:06:05 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

231030 13:36:02 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

231030 13:37:38 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

These segfaults broke replication and we had to manually fix records from binlog as the total datasize is too large to clone in a reasonable amount of time.

Please let me know if there is more information i can provide. We dont have core dump from these crashes as it was not enabled in the config, it is now tho. If needed i should be able to provide more debug on the next crash.
Im not experienced in debugging applications so i might need some guidance, i have however read up on how to perform a backtrace etc.



 Comments   
Comment by Marko Mäkelä [ 2023-10-31 ]

The first stack trace in error-wo-queries.log looks plausible to me. Here it is pretty-printed through c++filt:

mariadb-10.11.4

/usr/sbin/mariadbd(best_access_path(JOIN*, st_join_table*, unsigned long long, POSITION const*, unsigned int, bool, double, POSITION*, POSITION*)+0x10bf)[0x55ef1f7b61df]
/usr/sbin/mariadbd(+0x83be94)[0x55ef1f7b8e94]
/usr/sbin/mariadbd(+0x8403fa)[0x55ef1f7bd3fa]
/usr/sbin/mariadbd(choose_plan(JOIN*, unsigned long long)+0x318)[0x55ef1f7be9e8]
/usr/sbin/mariadbd(+0x86375f)[0x55ef1f7e075f]
/usr/sbin/mariadbd(JOIN::optimize_inner()+0x1395)[0x55ef1f7e5505]
/usr/sbin/mariadbd(JOIN::optimize()+0xaa)[0x55ef1f7e5a9a]
/usr/sbin/mariadbd(+0x7b207e)[0x55ef1f72f07e]
/usr/sbin/mariadbd(mysql_handle_single_derived(LEX*, TABLE_LIST*, unsigned int)+0xa2)[0x55ef1f72e8a2]
/usr/sbin/mariadbd(st_select_lex::handle_derived(LEX*, unsigned int)+0x47)[0x55ef1f74a6a7]
/usr/sbin/mariadbd(JOIN::optimize_inner()+0x10d6)[0x55ef1f7e5246]
/usr/sbin/mariadbd(JOIN::optimize()+0xaa)[0x55ef1f7e5a9a]
/usr/sbin/mariadbd(mysql_select(THD*, TABLE_LIST*, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*)+0xca)[0x55ef1f7e5b7a]
/usr/sbin/mariadbd(handle_select(THD*, LEX*, select_result*, unsigned long long)+0x13f)[0x55ef1f7e635f]
/usr/sbin/mariadbd(+0x7e8fa0)[0x55ef1f765fa0]
/usr/sbin/mariadbd(mysql_execute_command(THD*, bool)+0x3dc5)[0x55ef1f772865]
/usr/sbin/mariadbd(Prepared_statement::execute(String*, bool)+0x522)[0x55ef1f794f22]
/usr/sbin/mariadbd(Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+0x95)[0x55ef1f7950b5]
/usr/sbin/mariadbd(+0x818f25)[0x55ef1f795f25]
/usr/sbin/mariadbd(mysqld_stmt_execute(THD*, char*, unsigned int)+0x2b)[0x55ef1f7960ab]

This looks like an optimizer bug to me. We have or had some open bugs in this area, such as MDEV-31818 and MDEV-31403.

To diagnose this, we may need the table schema, the queries, and possibly some data as well. You can obfuscate the names of tables, columns and indexes, but the end result should be valid SQL and equivalent to your actual input.

Comment by Patrick Winnem [ 2023-10-31 ]

Hello, thanks for your prompt response!
I cannot supply the background data at this time, we are discussing with our customer regarding this. The query reads too many tables for me to be able to obfuscate the data easily..
Please let me know if this is absolutely needed for progression on hunting down the root cause.
I manually ran the query and i can confirm that mariadb will crash when it is executed. This query is only a read query and it is not run on the master, only the slave. Im not sure if it would also lead to a master crash if run there.

As i ran the query manually i now have a core dump and i have run this through dbd with debug symbols as stated in the guide from mariadb as well as debian and uploaded.

Comment by Sergei Petrunia [ 2023-10-31 ]

Looking at the complete stack trace:

/lib/x86_64-linux-gnu/libc.so.6(+0x3bfd0)[0x7feb2145afd0]
/usr/sbin/mariadbd(best_access_path(JOIN*, st_join_table*, unsigned long long, POSITION const*, unsigned int, bool, double, POSITION*, POSITION*)+0x10bf)[0x55ef1f7b61df]
/usr/sbin/mariadbd(+0x83be94)[0x55ef1f7b8e94]
/usr/sbin/mariadbd(+0x8403fa)[0x55ef1f7bd3fa]
/usr/sbin/mariadbd(choose_plan(JOIN*, unsigned long long)+0x318)[0x55ef1f7be9e8]
/usr/sbin/mariadbd(st_join_table::choose_best_splitting(unsigned int, unsigned long long, POSITION const*, unsigned long long*)+0xc0e)[0x55ef1f929f7e]
/usr/sbin/mariadbd(best_access_path(JOIN*, st_join_table*, unsigned long long, POSITION const*, unsigned int, bool, double, POSITION*, POSITION*)+0x170)[0x55ef1f7b5290]
/usr/sbin/mariadbd(+0x83be94)[0x55ef1f7b8e94]
/usr/sbin/mariadbd(+0x8403fa)[0x55ef1f7bd3fa]
/usr/sbin/mariadbd(+0x84129f)[0x55ef1f7be29f]
/usr/sbin/mariadbd(choose_plan(JOIN*, unsigned long long)+0x318)[0x55ef1f7be9e8]
/usr/sbin/mariadbd(+0x86375f)[0x55ef1f7e075f]
/usr/sbin/mariadbd(JOIN::optimize_inner()+0x1395)[0x55ef1f7e5505]
/usr/sbin/mariadbd(JOIN::optimize()+0xaa)[0x55ef1f7e5a9a]
/usr/sbin/mariadbd(+0x7b207e)[0x55ef1f72f07e]
/usr/sbin/mariadbd(mysql_handle_single_derived(LEX*, TABLE_LIST*, unsigned int)+0xa2)[0x55ef1f72e8a2]
/usr/sbin/mariadbd(st_select_lex::handle_derived(LEX*, unsigned int)+0x47)[0x55ef1f74a6a7]
/usr/sbin/mariadbd(JOIN::optimize_inner()+0x10d6)[0x55ef1f7e5246]
/usr/sbin/mariadbd(JOIN::optimize()+0xaa)[0x55ef1f7e5a9a]
/usr/sbin/mariadbd(mysql_select(THD*, TABLE_LIST*, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*)+0xca)[0x55ef1f7e5b7a]
/usr/sbin/mariadbd(handle_select(THD*, LEX*, select_result*, unsigned long long)+0x13f)[0x55ef1f7e635f]
/usr/sbin/mariadbd(+0x7e8fa0)[0x55ef1f765fa0]
/usr/sbin/mariadbd(mysql_execute_command(THD*, bool)+0x3dc5)[0x55ef1f772865]
/usr/sbin/mariadbd(Prepared_statement::execute(String*, bool)+0x522)[0x55ef1f794f22]
/usr/sbin/mariadbd(Prepared_statement::execute_loop(String*, bool, unsigned char*, unsigned char*)+0x95)[0x55ef1f7950b5]
/usr/sbin/mariadbd(+0x818f25)[0x55ef1f795f25]
/usr/sbin/mariadbd(mysqld_stmt_execute(THD*, char*, unsigned int)+0x2b)[0x55ef1f7960ab]
/usr/sbin/mariadbd(dispatch_command(enum_server_command, THD*, char*, unsigned int, bool)+0x1911)[0x55ef1f776561]
/usr/sbin/mariadbd(do_command(THD*, bool)+0x137)[0x55ef1f777267]
/usr/sbin/mariadbd(do_handle_one_connection(CONNECT*, bool)+0x3b7)[0x55ef1f887f27]
/usr/sbin/mariadbd(handle_one_connection+0x5d)[0x55ef1f8882bd]

Things to note:
There is always this:

/usr/sbin/mariadbd(st_join_table::choose_best_splitting(unsigned int, unsigned long long, POSITION const*, unsigned long long*)+0xc0e)[0x55ef1f929f7e]

We are considering split-materialized.

and this: we're executing a prepared statement:

/usr/sbin/mariadbd(Prepared_statement::execute(String*, bool)+0x522)[0x55ef1f794f22]

Comment by Sergei Petrunia [ 2023-10-31 ]

nem, could you please check if this workaround helps to avoid the crash: Put in my.cnf:

optimizer_switch='split_materialized=off'

Comment by Sergei Petrunia [ 2023-10-31 ]

Note: the contents of the attached backtrace.dump.log do not seem to make any sense:

#11 0x0000000000726576 in calculate_cond_selectivity_for_table (thd=0x5645945270b8, table=0x3af, 
cond=0x1935d1fbb) at ./sql/opt_range.cc:3524
        param = {<RANGE_OPT_PARAM> = {thd = 0x7f8107ffec20, table = 0x5645936cb075, prev_tables = 0, 
 read_tables = 0, current_table = 140192161719384, key_parts = 0x7f81c8020f25,
 key_parts_end = 0x7f81c8012520, mem_root = 0x7f8107ffe370, old_root = 0x7f81c80124c8, keys = 2490729776,
 using_real_indexes = 69, remove_jump_scans = 86, remove_false_where_parts = false, real_keynr = {2499638641, 
 ..

the values of {[table}}, keys, real_keynr look totally meaningless...

Comment by Patrick Winnem [ 2023-10-31 ]

Hello,

The query runs correctly when split_materialized is set to off.

Generated at Thu Feb 08 10:32:48 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.