[MDEV-16618] Executing stored procedure crashes mysqld Created: 2018-06-28  Updated: 2018-09-20  Resolved: 2018-09-20

Status: Closed
Project: MariaDB Server
Component/s: Server
Affects Version/s: 10.1.24
Fix Version/s: 10.1.29, 10.2.11, 10.3.3

Type: Bug Priority: Critical
Reporter: Lori Patrick Assignee: Unassigned
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS 7.3.


Issue Links:
Relates
relates to MDEV-13936 Server crashes in Time_and_counter_tr... Closed

 Description   

we recently upgraded a host from 5.5.50 to 10.1.24; and now get the following error when a certain stored procedure is executed. It does not happen every time and immediately after restart - the stored procedure call finishes without issue. I dropped and recreated all the stored procedures on the host in case it was some sort of corruption, but after a couple days - the server crashed again reporting the following in the error log:

180628  7:01:35 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
 
Server version: 10.1.24-MariaDB
key_buffer_size=5368709120
read_buffer_size=270336
max_used_connections=251
max_threads=1252
thread_count=236
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 8163227 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7f270138f008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f23da2410b0 thread_stack 0x48400
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x7f2c6dacc33e]
/usr/sbin/mysqld(handle_fatal_signal+0x305)[0x7f2c6d5f20a5]
/lib64/libpthread.so.0(+0xf370)[0x7f2c6cc0a370]
/usr/sbin/mysqld(_ZN4JOIN4execEv+0x11)[0x7f2c6d4c86e1]
/usr/sbin/mysqld(_Z12mysql_selectP3THDPPP4ItemP10TABLE_LISTjR4ListIS1_ES2_jP8st_orderSB_S2_SB_yP13select_resultP18st_select_lex_unitP13st_select_lex+0x112)[0x7f2c6d4c4dc2]
mysys/stacktrace.c:268(my_print_stacktrace)[0x7f2c6d44a0bb]
sql/sql_analyze_stmt.h:97(Time_and_counter_tracker::incr_loops())[0x7f2c6d44b3d4]
sql/sql_derived.cc:942(mysql_derived_fill(THD*, LEX*, TABLE_LIST*))[0x7f2c6d4a53df]
sql/sql_select.cc:21464(create_sort_index(THD*, JOIN*, st_order*, unsigned long long, unsigned long long, bool))[0x7f2c6d4be698]
sql/sql_select.cc:3186(JOIN::exec_inner())[0x7f2c6d4c6637]
sql/sql_select.cc:2512(JOIN::exec())[0x7f2c6d4c8724]
sql/sql_select.cc:3447(mysql_select(THD*, Item***, TABLE_LIST*, unsigned int, List<Item>&, Item*, unsigned int, st_order*, st_order*, Item*, st_order*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*))[0x7f2c6d4c4dc2]
sql/sql_select.cc:384(handle_select(THD*, LEX*, select_result*, unsigned long))[0x7f2c6d4c58a5]
sql/sql_parse.cc:5923(execute_sqlcom_select(THD*, TABLE_LIST*))[0x7f2c6d464971]
sql/sql_parse.cc:2979(mysql_execute_command(THD*))[0x7f2c6d470fbb]
sql/sp_head.cc:3213(sp_instr_stmt::exec_core(THD*, unsigned int*))[0x7f2c6d700d86]
sql/sp_head.cc:2978(sp_lex_keeper::reset_lex_and_exec_core(THD*, unsigned int*, bool, sp_instr*))[0x7f2c6d707853]
sql/sp_head.cc:3129(sp_instr_stmt::execute(THD*, unsigned int*))[0x7f2c6d707e05]
sql/sp_head.cc:1318(sp_head::execute(THD*, bool))[0x7f2c6d703d46]
sql/sp_head.cc:2103(sp_head::execute_procedure(THD*, List<Item>*))[0x7f2c6d705397]
sql/sql_parse.cc:2415(mysql_execute_command(THD*))[0x7f2c6d4714c8]
sql/sql_parse.cc:7340(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x7f2c6d474592]
sql/sql_parse.cc:1490(dispatch_command(enum_server_command, THD*, char*, unsigned int))[0x7f2c6d477a6a]
sql/sql_parse.cc:1111(do_command(THD*))[0x7f2c6d4782ca]
sql/sql_connect.cc:1349(do_handle_one_connection(THD*))[0x7f2c6d54039a]
sql/sql_connect.cc:1263(handle_one_connection)[0x7f2c6d540540]
/lib64/libpthread.so.0(+0x7dc5)[0x7f2c6cc02dc5]
/lib64/libc.so.6(clone+0x6d)[0x7f2c6b02176d]
 
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f2402f8b3c8): is an invalid pointer
Connection ID (thread ID): 19291
Status: NOT_KILLED
 
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=off
 
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
 
We think the query pointer is invalid, but we will try to print it anyway.
Query: select *
from
((select
r.trade_date, r.user_id, r.context,
cast(@runningNet := @runningNet + r.net_pnl as decimal(12,2)) as running_net, r.net_pnl,
cast(@runningGross := @runningGross + r.gross_pnl as decimal(12,2)) as running_gross, r.gross_pnl,
cast(@runningCosts := @runningCosts + r.costs as decimal(12,2)) as running_costs, r.costs,
cast(r.round_turns as signed) as round_turns,
cast((@runningGross - @runningCosts) / (@count := @count + 1) as decimal(12,2)) as avg_net,
cast(@runningCosts * 100 / @runningGross as decimal(6,2)) as cost_ratio
from
((select
trade_date, user_id, context,
sum(net_pnl) as net_pnl,
sum(gross_pnl) as gross_pnl,
sum(costs) as costs,
((sum(buys) + sum(sells)) / 2) as round_turns
from daily where context =  NAME_CONST('ctxt',_latin1'Tahoe' COLLATE 'latin1_swedish_ci') group by trade_date order by trade_date asc) as r)) as a) order by trade_date desc
 
180628 07:01:36 mysqld_safe Number of processes running now: 0
180628 07:01:36 mysqld_safe mysqld restarted



 Comments   
Comment by Lori Patrick [ 2018-06-28 ]

my.cnf for the host:

#
# This  started with MariaDB example config file for systems with 4GB of memory
# running mostly MariaDB using InnoDB only tables and performing complex
# queries with few connections.
#
 
#
# The following options will be read by MariaDB client applications.
[client]
port            = 3306
socket          = /srv/db_trn/run/mysql.sock
 
 
[mysqld]
 
# generic configuration options
port            = 3306
socket          = /srv/db_trn/run/mysql.sock
pid-file        = /srv/db_trn/run/ch4pxdbsilo04.pid
 
datadir         = /srv/dbdata_fast/mysql
back_log        = 50
max_connections = 1250
table_open_cache = 1000
table_definition_cache = 1000
skip-external-locking
max_allowed_packet = 16M
optimizer_switch = default
#thread_cache_size = 16
#thread_concurrency = 8
query_cache_type = 0
query_cache_size = 0
query_cache_limit = 16M
default_storage_engine = InnoDB
#general_log = 1
#general_log_file = /srv/db_trn/log/mysql/ch4pxdbsilo04-general.log
userstat        = 1
plugin-load     = auth_pam.so
wait_timeout    = 86400
innodb_strict_mode = 0
sql_mode = ''
 
# charset variables
init-connect='SET NAMES latin1'
character_set_client = latin1
character_set_server = latin1
skip-character-set-client-handshake
 
# logging variables
slow_query_log  = 1
long_query_time = 10
slow_query_log_file = /srv/db_trn/log/mysql/ch4pxdbsilo04-slow.log
log-warnings    = 2
log_error       = /srv/db_trn/log/mysql/ch4pxdbsilo04-error.log
sql-error-log-filename = /srv/db_trn/log/mysql/ch4pxdbsilo04_sql_errors.log
 
# binlog specific settings
binlog_cache_size = 1M
log_bin         = 1
log-bin         = /srv/db_trn/log/binlog/ch4pxdbsilo04-bin
binlog_format   = mixed
expire_logs_days = 7
max_binlog_size = 1000M
 
# session variables
read_buffer_size = 265k
#sort_buffer_size = 4M
read_rnd_buffer_size = 8M
#join_buffer_size = 4M
 
# tmp variables
tmp_table_size = 250M
tmpdir          = /srv/dbtmp
 
#  Replication related settings
server-id = 4
#master-host    = master
#master-user    = replication
#master-password =  Gelber@Repl!master
#master-port    = 3306
read_only
log_slave_updates = 1
slave-load-tmpdir = /srv/dbtmp
#skip-slave-start
log-slow-slave-statements
 
 
# MyISAM Specific options
key_buffer_size = 5G
max_heap_table_size = 1G
bulk_insert_buffer_size = 32M
myisam_sort_buffer_size = 64M
myisam_max_sort_file_size = 25G
myisam_repair_threads = 4
myisam_recover_option = backup
 
# INNODB Specific options
innodb_file_per_table = on
innodb_additional_mem_pool_size = 8M
innodb_buffer_pool_size = 25G
innodb_data_file_path = ibdata1:10M:autoextend
innodb_data_home_dir = /srv/dbdata_fast/mysql
#innodb_thread_concurrency = 8
innodb_flush_log_at_trx_commit = 2
innodb_log_buffer_size = 8M
innodb_log_file_size = 256M
innodb_log_group_home_dir = /srv/db_trn/log/mysql
innodb_read_io_threads =  64
innodb_write_io_threads =  64
innodb_max_dirty_pages_pct = 80
innodb_flush_method=O_DIRECT
innodb_stats_on_metadata = 0
innodb_print_all_deadlocks = 1
innodb_file_format = barracuda
innodb_print_all_deadlocks = 1
innodb_open_files = 1000
 
[mysqldump]
quick
max_allowed_packet = 64M
 
[mysql]
no-auto-rehash
 
[myisamchk]
key_buffer_size = 512M
sort_buffer_size = 512M
read_buffer = 8M
write_buffer = 8M
 
[mysqlhotcopy]
interactive-timeout
 
[safe_mysqld]
log_error       = /srv/db_trn/log/mysql/ch4pxdbsilo04-error.log
socket          = /srv/db_trn/run/mysql.sock
pid-file        = /srv/db_trn/run/chi0dbcore01.pid
numa-interleave

Comment by Elena Stepanova [ 2018-06-28 ]

10.1.24 is over a year old, if you upgraded just recently, why did you choose this version? Is it something that CentOS provides?
If you are installing from MariaDB repo, there have been 10 releases in 10.1 line since then, and lots of bugs fixed in this area. I recommend to upgrade to the recent release and see if it helps.

Comment by Lori Patrick [ 2018-06-28 ]

I went with 10.1.24 because it was the most recent version I have tested in our environment. I am not in a position to just take the most recent version as it needs to go thru a test cycle. I had been upgrading to 10.2.8, but we ran into another bug and pulled back to 10.1.24 as it was tested in our environment. I am trying to get rid of all my 5x servers, and I was then going to move forward with 10.2.13 or above as it has a fix for the issues I tripped into in 10.2.8.

Is this a known bug I have tripped into? Is there a setting or config fix I can do to not have to upgrade?

Comment by Lori Patrick [ 2018-06-28 ]

stored procedure being called:

DELIMITER $$
 
DROP PROCEDURE IF EXISTS `pnl_by_context`$$
 
CREATE DEFINER=`root`@`%` PROCEDURE `pnl_by_context`(ctxt VARCHAR(15))
BEGIN
SET @count = 0, @runningGross = 0, @runningCosts = 0, @runningNet = 0;
SELECT *
FROM
((SELECT 
r.trade_date, r.user_id, r.context,
CAST(@runningNet := @runningNet + r.net_pnl AS DECIMAL(12,2)) AS running_net, r.net_pnl,
CAST(@runningGross := @runningGross + r.gross_pnl AS DECIMAL(12,2)) AS running_gross, r.gross_pnl, 
CAST(@runningCosts := @runningCosts + r.costs AS DECIMAL(12,2)) AS running_costs, r.costs, 
CAST(r.round_turns AS SIGNED) AS round_turns,
CAST((@runningGross - @runningCosts) / (@count := @count + 1) AS DECIMAL(12,2)) AS avg_net,
CAST(@runningCosts * 100 / @runningGross AS DECIMAL(6,2)) AS cost_ratio
FROM
((SELECT 
trade_date, user_id, context,
SUM(net_pnl) AS net_pnl, 
SUM(gross_pnl) AS gross_pnl, 
SUM(costs) AS costs, 
((SUM(buys) + SUM(sells)) / 2) AS round_turns 
FROM daily WHERE context = ctxt GROUP BY trade_date ORDER BY trade_date ASC) AS r)) AS a) ORDER BY trade_date DESC;
END$$
 
DELIMITER ;

Comment by Elena Stepanova [ 2018-06-28 ]

> I went with 10.1.24 because it was the most recent version I have tested in our environment

It makes total sense, except that since it crashes, obviously it is not good enough for your environment. I quite understand the tendency to stay with a release, however old, which works for you, but this one doesn't, so there is no point sticking to it. I don't suggest switching to 10.2 right away, it's too radical and might be uncalled for if you do not need any particular features only available in 10.2; but it's more than reasonable to consider taking advantage of 10 bugfixing releases in the same 10.1 release line. While yes, regressions are always possible and do happen, on average bugfixing releases do much more good than harm, and 20th post-GA release is much more stable than first GA. So, please test one of the new ones in your environment (you might want to extend your tests of course, if they didn't catch the current problem).

> Is this a known bug I have tripped into?

As I said, there had been many bugs fixed in this area over the past year. You can search JIRA and see it yourself. It is also possible that it's one of still open known bugs, or an unknown one. If you can provide a complete test case with all structures, data and query ("run this SQL on a clean server => get the crash"), then I can check it against different versions to see where the problem exists and where it doesn't, and will let you know. Otherwise, without any indication that it is still reproducible on any reasonably recent version, we cannot launch a full-blown investigation on the off-chance that it's a still existing and unknown to us bug.

> Is there a setting or config fix I can do to not have to upgrade?

Possibly, but again, we cannot know it without a complete analysis, and we cannot do it for a year-old version without a good reason. If you provide the complete test case, we might be able to offer some advice, otherwise you can dig JIRA to see if you find anything similar and if there are any workarounds in there.

If you need support for old versions, you might consider acquiring a support contract. If you already have one, please file a ticket through it.

Comment by Lori Patrick [ 2018-06-28 ]

Thanks Elena! I hear what you're saying. I have 30+ servers and this was one of the last two that would get me off 5x. It was my understanding that 10.1.24 was the last version that didn't require me to compile glibc for CentOS 6.5 and that is why I stuck with it as I still have some CentOS 6.5 hosts.. Part of my test cycle was to get to these hosts last as they are the highest risk - so a new version even 10.1.x would set me back quite a bit. Unfortunate I tripped into this problem as I was nearly done getting everything to 10 and I am pretty surprised I found such a serious bug this late in the upgrade cycle.

I will pull together all the structures involved and add this to the ticket. Thanks!

Comment by Lori Patrick [ 2018-08-31 ]

I have not been able to reproduce it consistently. I have tried bringing down the size of innodb_buffer_pool to 100MB and open tables down to 100. Running several restores of mysql dumps files concurrently and looping thru the stored procedures that cause the crash at the same time to try to reproduce. It seems like it happens once something need to recycle (as in a dirty but unused cache that is now needed and needs to recycle), but I can't seem to configure the server to be able to reproduce at will. Until I can do that - testing a new version fixes the problem is a bit difficult.

Any ideas what config I can run to be a minimal as possible and potentially fill what ever cache or buffer needs to recycle that trips me into this crash?

Comment by Lori Patrick [ 2018-09-17 ]

Hi. I have been trying to reproduce this error to no avail. I have tried:

  • looping thru stored procedures involved from both windows and linux 10,000 times -> no crash
  • running above concurrently -> no crash
  • brought down buffer pool and was able to get innodb to throw the following:
    InnoDB: Warning: difficult to find free blocks in
    InnoDB: the buffer pool (339 search iterations)!
    InnoDB: 0 failed attempts to flush a page! Consider
    InnoDB: increasing the buffer pool size.
    InnoDB: It is also possible that in your Unix version
    InnoDB: fsync is very slow, or completely frozen inside
    InnoDB: the OS kernel. Then upgrading to a newer version
    InnoDB: of your operating system may help. Look at the
    InnoDB: number of fsyncs in diagnostic info below.
    InnoDB: Pending flushes (fsync) log: 0; buffer pool: 1
    InnoDB: 523993 OS file reads, 2186607 OS file writes, 266724 OS fsyncs
    InnoDB: Starting InnoDB Monitor to print further
    InnoDB: diagnostics to the standard output.
    but not crash
  • have used dumps that came from 5.5.34 and restored and it wouldn't crash
  • run tests against 10.1.24 replicated copy and wouldn't crash

If I can't reproduce the crash, I can't confirm a newer version fixes and upgrade accordingly. Based on the error on crash - can someone provide any guidance as to what area of the server I need to adjust or stress to reproduce?

Comment by Lori Patrick [ 2018-09-19 ]

This is issue fixed by MDEV-13936. It looks like I tripped into a known bug. I can reproduce my crash on demand at this point. This issue can be closed.

Comment by Elena Stepanova [ 2018-09-20 ]

Thanks for checking.

Generated at Thu Feb 08 08:30:16 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.