|
dreas, Could you please upload the error log from the slave startup and till the moment you had to kill it? Thanks.
|
|
Please comment if you have further information on the issue, and the report will be re-opened.
|
|
So this is occurring again on a server:
~ # mysql -e "show full processlist" | head -n4
Id User Host db Command Time State Info Progress
2 system user NULL Connect 524935 Waiting for master to send event NULL 0.000
4 system user replication_database1 Connect 155648 After opening tables COMMIT 0.000
102091094 exim localhost local_database Query 155641 Opening tables SELECT CONCAT(host, ':', port) FROM replication_database2.table1 WHERE domain='example.com' ORDER BY id 0.000
Happy to collect more details, however not sure how to do so. There is nothing in mysql.log around the time this issue occurred.
|
|
After a kill -6:
161201 11:47:29 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.0.28-MariaDB-1~sid
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=2001
max_threads=2002
thread_count=198
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4658156 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x55bde5373e1e]
/usr/sbin/mysqld(handle_fatal_signal+0x2fc)[0x55bde4edebcc]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11100)[0x7ff723bca100]
/lib/x86_64-linux-gnu/libc.so.6(__poll+0x2d)[0x7ff7220dc56d]
/usr/sbin/mysqld(_Z26handle_connections_socketsv+0x1e7)[0x55bde4cc83f7]
/usr/sbin/mysqld(_Z11mysqld_mainiPPc+0xff6)[0x55bde4ccc5f6]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7ff72201d2b1]
/usr/sbin/mysqld(_start+0x2a)[0x55bde4cc041a]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Writing a core file
161201 11:47:33 mysqld_safe Number of processes running now: 0
161201 11:47:33 mysqld_safe mysqld restarted
|
|
I'm uploading mdev10104.tgz with the core dump, MySQL errors logs, my.cnf, and MariaDB binaries to the private FTP.
|
|
We hit what appears to be the same bug again with some different conditions:
root@server ~ # mysql -e "SHOW FULL processlist"
|
+----------+-------------+-----------+--------+---------+--------+----------------------------------+-------------------------------------------------------------------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+----------+-------------+-----------+--------+---------+--------+----------------------------------+-------------------------------------------------------------------------------------------------+----------+
|
| 2 | system user | | NULL | Connect | 320720 | Waiting for master to send event | NULL | 0.000 |
|
| 3 | system user | | replicated_database1 | Connect | 46087 | After opening tables | COMMIT | 0.000 |
|
| 47887750 | mysql_user | localhost | replicated_database1 | Query | 46086 | Opening tables | SELECT user FROM users WHERE a='' AND b IN ('*', '') | 0.000 |
|
| 53170618 | root | localhost | NULL | Query | 0 | init | SHOW FULL processlist | 0.000 |
|
+----------+-------------+-----------+--------+---------+--------+----------------------------------+-------------------------------------------------------------------------------------------------+----------+
|
root@server ~ #
|
I can execute the SELECT without issues:
MariaDB [(none)]> use replicated_database1;
|
Database changed
|
MariaDB [replicated_database1]> SELECT user FROM users WHERE a='' AND b IN ('*', '')
|
Empty set (0.00 sec)
|
|
MariaDB [filter]>
|
Replication hangs on "COMMIT" with "After opening tables" again, the select query is on the same database this time. This specific server does NOT have millions of tables as in the other example, so apparently that's not a condition for this to occur. Running 10.0.29-MariaDB-1~sid.
|
|
A back trace from all threads might give a clue as to what lock is being hit
gdb --batch --eval-command="thread apply all bt" -p $(pgrep -xn mysqld)
|
|
|
I attached that already 
|
|
A potential theory from IRC:
(14:08:50) montywi: the issue is that ha_write_row hangs, which it should never be able to do
(14:09:31) montywi: could be a hang on disk
(14:10:06) montywi: except there is a pthrad_cond_wait on top of the trace. hm...
(14:12:08) montywi: dreas: ok, I have a quess what could be happening:
(14:12:37) montywi: - The open table cache gets full and it tries to eject a table that is in use (which shouldn't happen) which causes a deadlock
It was requested for me to apply a MariaDB debug binary to collect the required details. Will try and get that done.
|
|
Assigning to Monty to have a proper owner, but we are actually waiting for the feedback as mentioned above.
|
|
10.0.25 and later have a debug symbols package for a more descriptive backtrace https://downloads.mariadb.com/MariaDB/mariadb-10.0.30/repo/debian/pool/main/m/mariadb-10.0/
edit: appears to be only sid.
|
|
New backtrace attached. Unfortunately mysqld started "hanging" shortly afterwards and became unresponsive to "kill -6". In the end I had to force kill it with "kill -9" which didn't produce a core dump. I did get a core dump from another slave that had the exact same issue (same position), however not running with the special binary. Let me know if that should be uploaded, hopefully the new backtrace provides sufficient information.
|
|
I presume the relevant bit is:
Thread 5 (Thread 0x7f5d1fcb9700 (LWP 19725)):
|
#0 0x00007f5d3d58e18f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
|
#1 0x000055d7a7d935f9 in inline_mysql_cond_wait (src_file=0x55d7a80fee80 "/root/mariadb-10.0.29/storage/maria/ma_pagecache.c", src_line=1991, mutex=0x55d7a90ae0a8 <maria_pagecache_var+200>, that=0x7f5d080008c8) at /root/mariadb-10.0.29/include/mysql/psi/mysql_thread.h:1165
|
#2 find_block (pagecache=pagecache@entry=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=pageno@entry=1, init_hits_left=init_hits_left@entry=3, wrmode=1 '\001', block_is_copied=1 '\001', reg_req=1 '\001', page_st=0x7f5d1fcb4d10) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:1991
|
#3 0x000055d7a7d94d2f in pagecache_read (pagecache=0x55d7a90adfe0 <maria_pagecache_var>, file=file@entry=0x7f5d08011030, pageno=1, level=level@entry=3, buff=buff@entry=0x7f5d1fcb4e30 "", type=PAGECACHE_LSN_PAGE, lock=PAGECACHE_LOCK_WRITE, page_link=0x7f5d1fcb4dc0) at /root/mariadb-10.0.29/storage/maria/ma_pagecache.c:3391
|
#4 0x000055d7a7da9b03 in _ma_fetch_keypage (page=page@entry=0x7f5d1fcb77d0, info=info@entry=0x7f5d083464b8, keyinfo=keyinfo@entry=0x7f5d080113f8, pos=pos@entry=8192, lock=lock@entry=PAGECACHE_LOCK_WRITE, level=level@entry=3, buff=0x7f5d1fcb4e30 "", return_buffer=0 '\000') at /root/mariadb-10.0.29/storage/maria/ma_page.c:109
|
#5 0x000055d7a7dc4033 in _ma_ck_real_delete (info=info@entry=0x7f5d083464b8, key=key@entry=0x7f5d1fcb7da0, root=root@entry=0x7f5d1fcb7858) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:222
|
#6 0x000055d7a7dc4339 in _ma_ck_delete (info=info@entry=0x7f5d083464b8, key=0x7f5d1fcb7da0) at /root/mariadb-10.0.29/storage/maria/ma_delete.c:178
|
#7 0x000055d7a7dc1da0 in maria_delete (info=0x7f5d083464b8, record=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/storage/maria/ma_delete.c:88
|
#8 0x000055d7a7bc968f in handler::ha_delete_row (this=0x7f5d080d8880, buf=0x7f5d0803a480 "\377") at /root/mariadb-10.0.29/sql/handler.cc:6076
|
#9 0x000055d7a7b6ca38 in rpl_slave_state::record_gtid (this=0x55d7aa384f50, thd=0x7f5d08001278, gtid=gtid@entry=0x7f5d1fcb8610, sub_id=sub_id@entry=461048395, in_transaction=in_transaction@entry=true, in_statement=in_statement@entry=false) at /root/mariadb-10.0.29/sql/rpl_gtid.cc:659
|
#10 0x000055d7a7c6d868 in Query_log_event::do_apply_event (this=0x7f5d080a9718, rgi=0x7f5d08000990, query_arg=0x7f5d080f4c85 "COMMIT", q_len_arg=6) at /root/mariadb-10.0.29/sql/log_event.cc:4312
|
#11 0x000055d7a7a20345 in Log_event::apply_event (rgi=0x7f5d08000990, this=0x7f5d080a9718) at /root/mariadb-10.0.29/sql/log_event.h:1360
|
#12 apply_event_and_update_pos (ev=ev@entry=0x7f5d080a9718, thd=thd@entry=0x7f5d08001278, rgi=rgi@entry=0x7f5d08000990, rpt=rpt@entry=0x0) at /root/mariadb-10.0.29/sql/slave.cc:3300
|
#13 0x000055d7a7a229aa in exec_relay_log_event (serial_rgi=0x7f5d08000990, rli=0x55d7aa6e9b28, thd=0x7f5d08001278) at /root/mariadb-10.0.29/sql/slave.cc:3630
|
#14 handle_slave_sql (arg=0x55d7aa6e8160) at /root/mariadb-10.0.29/sql/slave.cc:4731
|
#15 0x00007f5d3d588464 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
|
#16 0x00007f5d3b9519df in clone () from /lib/x86_64-linux-gnu/libc.so.6
|
Would be great to know if this is sufficient information to solve the issue, or if a core-dump is still required.
|
|
monty, please see above.
|
|
Here is a patch that should fix the above hang (bug in aria page cache under very rare circumstances when things are evicted from table cache):
— a/storage/maria/ma_pagecache.c
+++ b/storage/maria/ma_pagecache.c
@@ -1975,8 +1975,8 @@ static PAGECACHE_BLOCK_LINK *find_block(PAGECACHE *pagecache,
}
else
{
- DBUG_ASSERT(hash_link->requests > 0);
- hash_link->requests--;
+ DBUG_ASSERT(block->hash_link == hash_link);
+ remove_reader(block);
KEYCACHE_DBUG_PRINT("find_block",
("request waiting for old page to be saved"));
{
See the attached 'patch' for a full patch for this and similar failures
|
|
Fixed and pushed into 10.0
|