[MDEV-21010] Mariadb hangs, stops responding to new connections Created: 2019-11-08  Updated: 2020-12-17

Status: Open
Project: MariaDB Server
Component/s: Backup
Affects Version/s: 10.3.17
Fix Version/s: 10.3

Type: Bug Priority: Major
Reporter: Richard Assignee: Vicențiu Ciorbaru
Resolution: Unresolved Votes: 1
Labels: crash
Environment:

Debian buster 10.3.17-MariaDB-0+deb10u1 amd64


Attachments: File mysqld_bt_all_threads.4234.txt.gz     File mysqld_full_bt_all_threads.4234.txt.gz    

 Description   

During my daily mysqldump (called by the Debian automysqlbackup script) a few times per week mysql hangs. The TCP listener and socket stay up but no longer accept new queries. The only resolution is to kill -9 the mysqld process.

I was able to do a gdb stack trace in this state, see attachments.

The Debian bug report is https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=943962

The error.log says:

corrupted size vs. prev_size
191108 7:22:01 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.3.17-MariaDB-0+deb10u1-log
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=6
max_threads=65546
thread_count=7
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 142762793 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f33e4178cc8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f341981ace8 thread_stack 0x49000

My query.log has this right before the crash:

13019 Init DB drupal6
13019 Query SHOW CREATE DATABASE IF NOT EXISTS `drupal6`
13019 Query SHOW TABLES LIKE '%'
13019 Query LOCK TABLES `access` READ /!32311 LOCAL */,`accesslog` READ /!32311 LOCAL /,`actions` READ /!32311 LOCAL /,`actions_aid` READ /!32311 LOCAL /,`advanced_help_index` READ /!32311 LOCAL /,`authmap` READ /!32311 LOCAL */,`ba
tch` READ /!32311 LOCAL */,`blocks` READ /!32311 LOCAL /,`blocks_roles` READ /!32311 LOCAL /,`boxes` READ /!32311 LOCAL /,`cache` READ /!32311 LOCAL /,`cache_block` READ /!32311 LOCAL /,`cache_content` READ /!32311 LOCAL /,`cache_filter` READ /!323
11 LOCAL /,`cache_form` READ /!32311 LOCAL /,`cache_menu` READ /!32311 LOCAL /,`cache_page` READ /!32311 LOCAL /,`cache_update` READ /!32311 LOCAL /,`cache_views` READ /!32311 LOCAL /,`cache_views_data` READ /!32311 LOCAL /,`comments` READ /!32311
LOCAL /,`content_group` READ /!32311 LOCAL /,`content_group_fields` READ /!32311 LOCAL /,`content_node_field` READ /!32311 LOCAL /,`content_node_field_instance` READ /!32311 LOCAL /,`content_type_event` READ /!32311 LOCAL */,`date_format_locale` READ /
!32311 LOCAL */,`date_format_types` READ /!32311 LOCAL /,`date_formats` READ /!32311 LOCAL /,`files` READ /!32311 LOCAL /,`filter_formats` READ /!32311 LOCAL /,`filters` READ /!32311 LOCAL /,`flood` READ /!32311 LOCAL /,`history` READ /!32311 LOCAL
/,`image` READ /!32311 LOCAL /,`image_attach` READ /!32311 LOCAL /,`img_assist_map` READ /!32311 LOCAL /,`languages` READ /!32311 LOCAL /,`locales_source` READ /!32311 LOCAL /,`locales_target` READ /!32311 LOCAL /,`menu_custom` READ /!32311 LOCAL
/,`menu_links` READ /!32311 LOCAL /,`menu_router` READ /!32311 LOCAL /,`messaging_message_parts` READ /!32311 LOCAL /,`messaging_store` READ /!32311 LOCAL /,`modr8_log` READ /!32311 LOCAL /,`node` READ /!32311 LOCAL /,`node_access` READ /!32311 LOC
AL /,`node_access_role` READ /!32311 LOCAL /,`node_access_user` READ /!32311 LOCAL /,`node_comment_statistics` READ /!32311 LOCAL /,`node_counter` READ /!32311 LOCAL /,`node_revisions` READ /!32311 LOCAL /,`node_type` READ /!32311 LOCAL */,`notificat
ions` READ /!32311 LOCAL */,`notifications_event` READ /!32311 LOCAL /,`notifications_fields` READ /!32311 LOCAL /,`notifications_queue` READ /!32311 LOCAL /,`notifications_sent` READ /!32311 LOCAL /,`permission` READ /!32311 LOCAL /,`poll` READ /!32
311 LOCAL /,`poll_choices` READ /!32311 LOCAL /,`poll_votes` READ /!32311 LOCAL /,`revision_moderation` READ /!32311 LOCAL /,`role` READ /!32311 LOCAL /,`search_dataset` READ /!32311 LOCAL /,`search_index` READ /!32311 LOCAL */,`search_node_links` RE
AD /!32311 LOCAL */,`search_total` READ /!32311 LOCAL /,`semaphore` READ /!32311 LOCAL /,`sessions` READ /!32311 LOCAL /,`system` READ /!32311 LOCAL /,`term_data` READ /!32311 LOCAL /,`term_hierarchy` READ /!32311 LOCAL /,`term_node` READ /!32311 L
OCAL /,`term_relation` READ /!32311 LOCAL /,`term_synonym` READ /!32311 LOCAL /,`trigger_assignments` READ /!32311 LOCAL /,`upload` READ /!32311 LOCAL /,`url_alias` READ /!32311 LOCAL /,`user_import` READ /!32311 LOCAL /,`user_import_errors` READ /
!32311 LOCAL /,`users` READ /!32311 LOCAL /,`users_roles` READ /!32311 LOCAL /,`variable` READ /!32311 LOCAL /,`views_display` READ /!32311 LOCAL /,`views_object_cache` READ /!32311 LOCAL /,`views_view` READ /!32311 LOCAL /,`vocabulary` READ /!3231
1 LOCAL /,`vocabulary_node_types` READ /!32311 LOCAL /,`watchdog` READ /!32311 LOCAL /,`workflow_access` READ /!32311 LOCAL /,`workflow_node` READ /!32311 LOCAL /,`workflow_node_history` READ /!32311 LOCAL /,`workflow_scheduled_transition` READ /!323
11 LOCAL /,`workflow_states` READ /!32311 LOCAL /,`workflow_transitions` READ /!32311 LOCAL /,`workflow_type_map` READ /!32311 LOCAL /,`workflows` READ /!32311 LOCAL /,`wysiwyg` READ /!32311 LOCAL /,`xmlsitemap` READ /!32311 LOCAL */,`xmlsitemap_node
` READ /!32311 LOCAL */,`xmlsitemap_taxonomy` READ /!32311 LOCAL */
191108 7:22:01 13019 Query SELECT engine FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = DATABASE() AND table_name = 'access'
[snip]
13019 Init DB drupal6
13019 Query select @@collation_database
13019 Query SELECT TRIGGER_NAME FROM INFORMATION_SCHEMA.TRIGGERS WHERE EVENT_OBJECT_SCHEMA = DATABASE() AND EVENT_OBJECT_TABLE = 'url_alias'
13019 Query SET SESSION character_set_results = 'utf8mb4'
13019 Query SELECT engine FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = DATABASE() AND table_name = 'user_import'
13019 Query SET SQL_QUOTE_SHOW_CREATE=1
13019 Query SET SESSION character_set_results = 'binary'
13019 Query show create table `user_import`
13019 Query SET SESSION character_set_results = 'utf8mb4'
13019 Query show fields from `user_import`
13019 Query SELECT /*!40001 SQL_NO_CACHE */ `import_id`, `name`, `filename`, `oldfilename`, `filepath`, `started`, `pointer`, `processed`, `valid`, `first_line_skip`, `contact`, `username_space`, `send_email`, `field_match`, `roles`, `options
`, `setting` FROM `user_import`



 Comments   
Comment by Sergey Vojtovich [ 2019-11-08 ]

The problem seems to be caused by this thread:

Thread 33 (Thread 0x7f341981b700 (LWP 3996)):
#0  __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:63
#1  0x00007f34323a51bb in __malloc_fork_lock_parent () at arena.c:156
#2  0x00007f34323e78ca in __libc_fork () at ../sysdeps/nptl/fork.c:73
#3  0x0000556cd1ee9520 in start_addr2line_fork (binary_path=0x7ffd6f1d6ed4 "/usr/sbin/mysqld") at ./mysys/my_addr_resolve.c:172
#4  start_addr2line_fork (binary_path=0x7ffd6f1d6ed4 "/usr/sbin/mysqld") at ./mysys/my_addr_resolve.c:155
#5  0x0000556cd1ee9648 in my_addr_resolve (ptr=0x556cd1ed68be <my_print_stacktrace+46>, loc=loc@entry=0x7f3419817680) at ./mysys/my_addr_resolve.c:220
#6  0x0000556cd1ed693a in print_with_addr_resolve (n=<optimized out>, addrs=0x7f34198176a0) at ./mysys/stacktrace.c:254
#7  my_print_stacktrace (stack_bottom=<optimized out>, thread_stack=<optimized out>, silent=silent@entry=0 '\000') at ./mysys/stacktrace.c:273
#8  0x0000556cd1a086ed in handle_fatal_signal (sig=6) at ./sql/signal_handler.cc:206
#9  <signal handler called>
#10 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#11 0x00007f3432343535 in __GI_abort () at abort.c:79
#12 0x00007f343239a508 in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7f34324a528d "%s\n") at ../sysdeps/posix/libc_fatal.c:181
#13 0x00007f34323a0c1a in malloc_printerr (str=str@entry=0x7f34324a3376 "corrupted size vs. prev_size") at malloc.c:5341
#14 0x00007f34323a4e00 in _int_realloc (av=av@entry=0x7f33cc000020, oldp=oldp@entry=0x7f33cc6f0230, oldsize=oldsize@entry=1376, nb=nb@entry=2032) at malloc.c:4591
#15 0x00007f34323a5f3b in __GI___libc_realloc (oldmem=oldmem@entry=0x7f33cc6f0240, bytes=bytes@entry=2024) at malloc.c:3222
#16 0x0000556cd1ed31ef in my_realloc (oldpoint=0x7f33cc6f0248, size=2016, my_flags=64) at ./mysys/my_malloc.c:166
#17 0x0000556cd1e1759f in mi_alloc_rec_buff (info=info@entry=0x7f33cc057f08, length=length@entry=1905, buf=buf@entry=0x7f33cc058030) at ./storage/myisam/mi_open.c:762
#18 0x0000556cd1e12992 in _mi_read_rnd_dynamic_record (info=0x7f33cc057f08, buf=0x7f33cc053030 "\001", filepos=1016, skip_deleted_blocks=1 '\001') at ./storage/myisam/mi_dynrec.c:1811
#19 0x0000556cd1a0d3d7 in handler::ha_rnd_next (this=0x7f33cc052900, buf=0x7f33cc053030 "\001") at ./sql/handler.cc:2840
#20 0x0000556cd1b13e5c in rr_sequential (info=0x7f33e4182098) at ./sql/records.h:70
#21 0x0000556cd186bbc3 in READ_RECORD::read_record (this=0x7f33e4182098) at ./sql/records.h:69
#22 sub_select (end_of_records=false, join_tab=0x7f33e4181fd0, join=0x7f33e41806f8) at ./sql/sql_select.cc:19510
#23 sub_select (join=0x7f33e41806f8, join_tab=0x7f33e4181fd0, end_of_records=false) at ./sql/sql_select.cc:19426
#24 0x0000556cd188cc24 in do_select (procedure=<optimized out>, join=0x7f33e41806f8) at ./sql/sql_select.cc:19031
#25 JOIN::exec_inner (this=this@entry=0x7f33e41806f8) at ./sql/sql_select.cc:4044
#26 0x0000556cd188ce83 in JOIN::exec (this=this@entry=0x7f33e41806f8) at ./sql/sql_select.cc:3838
#27 0x0000556cd188cfcf in mysql_select (thd=0x7f33e4178cc8, tables=0x7f33e417ffd8, wild_num=0, fields=..., conds=<optimized out>, og_num=0, order=0x0, group=0x0, having=0x0, proc_param=0x0, select_options=551905659648, result=0x7f33e41806d0, unit=0x7f33e417c9d8, select_lex=0x7f33e417d158) at ./sql/sql_select.cc:4243
#28 0x0000556cd188d8ad in handle_select (thd=thd@entry=0x7f33e4178cc8, lex=lex@entry=0x7f33e417c910, result=result@entry=0x7f33e41806d0, setup_tables_done_option=setup_tables_done_option@entry=0) at ./sql/sql_select.cc:374
#29 0x0000556cd182f34c in execute_sqlcom_select (thd=0x7f33e4178cc8, all_tables=0x7f33e417ffd8) at ./sql/sql_parse.cc:6286
#30 0x0000556cd183bd9f in mysql_execute_command (thd=0x7f33e4178cc8) at ./sql/sql_parse.cc:3816
#31 0x0000556cd183db89 in mysql_parse (thd=thd@entry=0x7f33e4178cc8, rawbuf=<optimized out>, length=261, parser_state=parser_state@entry=0x7f341981a4c0, is_com_multi=is_com_multi@entry=false, is_next_command=is_next_command@entry=false) at ./sql/sql_parse.cc:7829
#32 0x0000556cd1840640 in dispatch_command (command=COM_QUERY, thd=0x7f33e4178cc8, packet=<optimized out>, packet_length=<optimized out>, is_com_multi=<optimized out>, is_next_command=<optimized out>) at ./sql/sql_class.h:1137
#33 0x0000556cd1841280 in do_command (thd=thd@entry=0x7f33e4178cc8) at ./sql/sql_parse.cc:1401
#34 0x0000556cd194a437 in threadpool_process_request (thd=0x7f33e4178cc8) at ./sql/threadpool_common.cc:359
#35 tp_callback (c=0x556cd4693c60) at ./sql/threadpool_common.cc:186
#36 0x0000556cd19df4e0 in worker_main (param=0x556cd4414938) at ./sql/threadpool_generic.cc:1606
#37 0x00007f3432d90fa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#38 0x00007f343241a4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

For some reason realloc() fails, which may be a bug in MariaDB. But then, unexpectedly, realloc attempts to reporting some error message like "corrupted size vs. prev_size", which is apparently done under some mutex protection. This blocks all mallocs in other threads. Error reporting routine crashes and calls MariaDB SIGSEGV handler, which in turn attempts allocating memory and gets blocked on that mutex.

There're 2 problems here:
1. realloc failure as such
2. the fact that realloc may crash while holding a mutex sounds very alarming: our SIGSEGV handler does mallocs

Comment by Richard [ 2019-11-08 ]

I'm not knowledgable enough about MariaDB to comment on problem 1, but I agree that problem 2 is very alarming. In the 15+ years of using mysql/mariadb I have never had the process hang on me. It leaves all my applications dead in the water, several days per week.

Comment by Eugene Kosov (Inactive) [ 2019-11-13 ]

It's incorrect to call async-unsafe functions from signal handlers. I think that the first warning here describes topic problem https://www.boost.org/doc/libs/1_71_0/doc/html/stacktrace/getting_started.html#stacktrace.getting_started.handle_terminates_aborts_and_seg

Comment by Richard [ 2019-11-27 ]

So will the signal handler be changed? The info at the link @kevg posted makes sense to me..

Comment by Sergey Vojtovich [ 2019-11-27 ]

According to man 7 signal, fork is listed among async-safe functions.

Comment by Vladislav Vaintroub [ 2019-11-27 ]

well, obviously man is not telling the full truth.

Comment by Sergey Vojtovich [ 2019-11-27 ]

Interesting, at the same time malloc() is not listed...

Comment by Richard [ 2019-12-23 ]

Ok, so changing the SIGSEGV handler to only use async-safe functions is a major change that will not likely be implemented any time soon.

How about the cause for the SIGSEGV in the first place, the "realloc failure as such" in the stack trace. Is that solvable? Do you need any more info from me?

I have several stack traces from 10.3.18-0+deb10u1 now as well. Do you need those?

Comment by Sergey Vojtovich [ 2019-12-23 ]

RichieB, I believe for now we have enough info to debug realloc failure. Meanwhile you can disable stack trace dump with stack-trace=off. It should hopefully solve your problem.

Comment by Eugene Kosov (Inactive) [ 2019-12-23 ]

> Interesting, at the same time malloc() is not listed...

It takes a global lock.

  1. one thread takes that lock
  2. another one crashes
  3. signal handler tries to take that lock too
  4. deadlock
Comment by Richard [ 2020-11-22 ]

Until the signal handler is changed to avoid this deadlock please ship Mariadb with a default config of "stack-trace=off". Having your database suddenly hang and stop processing requests is not acceptable in any situation.

Generated at Thu Feb 08 09:03:54 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.