[MDEV-10644] One of parallel replication threads remains active after STOP SLAVE SQL_THREAD completes - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Duplicate
Affects Version/s: 10.0.26
Fix Version/s: 10.0.30, 10.1.22
Component/s: Replication
Labels:
None

Description

Strange problem (somewhat similar to ~~MDEV-8039~~ and maybe a regression related to the fix for ~~MDEV-7818~~) happened recently on 10.0.26 (that never happened on 10.0.19).

There was a long running transaction and eventually all parallel slave threads became busy, like this:

1077897 system user NULL Connect 79 Waiting for prior transaction to start commit before starting next transaction NULL 0.000

... a lot of threads like the one above

1077975 system user NULL Connect 17128 Waiting for prior transaction to commit NULL 0.000

... a lot of threads like the one above

1077976 system user ppc_agg Connect 17269 reading file LOAD DATA INFILE '../tmp/SQL_LOAD-187214027-188093007-114246.data' IGNORE...

1077977 system user NULL Connect 76 Waiting for room in worker thread event queue NULL 0.000

1077998 root localhost NULL Query 68 Killing slave STOP SLAVE SQL_THREAD 0.000

1077999 root localhost NULL Query 67 Filling schema table select VARIABLE_NAME as NAME, VARIABLE_VALUE as VALUE from

...

So, eventually thread 1077977 had to wait for room in worker thread event queue with the following stack trace:

Thread 2 (Thread 0x7f0647cf6700 (LWP 61351)):

#0 0x00007f2cd6fac68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x00000000006e619c in rpl_parallel_entry::choose_thread(rpl_group_info*, bool*, PSI_stage_info_v1*, bool) ()

#2 0x00000000006e7024 in rpl_parallel::do_event(rpl_group_info*, Log_event*, unsigned long long) ()

#3 0x0000000000571e71 in handle_slave_sql ()

#4 0x0000000000a84dfd in pfs_spawn_thread ()

#5 0x00007f2cd6fa8aa1 in start_thread () from /lib64/libpthread.so.0

#6 0x00007f2cd56c193d in clone () from /lib64/libc.so.6

Then STOP SLAVE SQL_THREAD was executed, and it had to wait for LOAD DATA to complete:

Thread 228 (Thread 0x7f12a24f6700 (LWP 50401)):

#0 0x00007f2cd6faca5e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x000000000056b26f in terminate_slave_thread ()

#2 0x000000000056fec5 in terminate_slave_threads(Master_info*, int, bool) ()

#3 0x00000000005fce7f in stop_slave(THD*, Master_info*, bool) ()

#4 0x00000000005e17a5 in mysql_execute_command(THD*) ()

#5 0x00000000005e55f7 in mysql_parse ()

#6 0x00000000005e7a5d in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()

#7 0x00000000006a6013 in do_handle_one_connection(THD*) ()

#8 0x00000000006a60e2 in handle_one_connection ()

#9 0x0000000000a84dfd in pfs_spawn_thread ()

#10 0x00007f2cd6fa8aa1 in start_thread () from /lib64/libpthread.so.0

#11 0x00007f2cd56c193d in clone () from /lib64/libc.so.6

Soon after that SELECT from I_S.GLOBAL_STATUS etc was executed and (as ~~MDEV-8039~~ describes) it had to wait on LOCK_active_mi to get Slave_received_heartbeats value:

Thread 219 (Thread 0x7f1182d88700 (LWP 52667)):

#0  0x00007f2cd6faf334 in __lll_lock_wait () from /lib64/libpthread.so.0

#1  0x00007f2cd6faa60e in _L_lock_995 () from /lib64/libpthread.so.0

#2  0x00007f2cd6faa576 in pthread_mutex_lock () from /lib64/libpthread.so.0

#3  0x00000000005557f2 in show_heartbeat_period(THD*, st_mysql_show_var*, char*) ()

#4  0x00000000006380ec in show_status_array(THD*, char const*, st_mysql_show_var*, enum_var_type, system_status_var*, char const*, TABLE*, bool, Item*) ()

#5  0x0000000000642ff9 in fill_status(THD*, TABLE_LIST*, Item*) ()

#6  0x00000000006463fe in get_schema_tables_result(JOIN*, enum_schema_table_state) ()

#7  0x000000000063010d in JOIN::exec_inner() ()

#8  0x00000000006322a1 in JOIN::exec() ()

#9  0x000000000066f0eb in st_select_lex_unit::exec() ()

#10 0x000000000067091e in mysql_union(THD*, LEX*, select_result*, st_select_lex_unit*, unsigned long) ()

#11 0x00000000006323f5 in handle_select(THD*, LEX*, select_result*, unsigned long) ()

#12 0x00000000005d8536 in execute_sqlcom_select(THD*, TABLE_LIST*) ()

#13 0x00000000005e3ad7 in mysql_execute_command(THD*) ()

#14 0x00000000005e55f7 in mysql_parse ()

#15 0x00000000005e7a5d in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()

#16 0x00000000006a6013 in do_handle_one_connection(THD*) ()

#17 0x00000000006a60e2 in handle_one_connection ()

#18 0x0000000000a84dfd in pfs_spawn_thread ()

#19 0x00007f2cd6fa8aa1 in start_thread () from /lib64/libpthread.so.0

#20 0x00007f2cd56c193d in clone () from /lib64/libc.so.6

All other later SHOW GLOBAL STATUS etc calls also had to wait.

Soon LOAD DATA completed, STOP SLAVE SQL_THREAD also completed successfully and we ended up with the following state:

1077977 system user NULL Connect 166 Waiting while replication worker thread pool is busy NULL 0.000

1077998 root localhost NULL Query 13 Waiting for table flush FLUSH TABLES WITH READ LOCK 0.000

1077999 root localhost NULL Query 157 Filling schema table select VARIABLE_NAME as NAME, VARIABLE_VALUE as VALUE from

...

Thread 1077977 had the following stack trace at the moment:

Thread 2 (Thread 0x7f0647cf6700 (LWP 61351)):

#0 0x00007f2cd6fac68c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0

#1 0x00000000006e2133 in pool_mark_busy(rpl_parallel_thread_pool*, THD*) ()

#2 0x00000000006e2186 in rpl_parallel_change_thread_count(rpl_parallel_thread_pool*, unsigned int) ()

#3 0x0000000000570c4b in handle_slave_sql ()

#4 0x0000000000a84dfd in pfs_spawn_thread ()

#5 0x00007f2cd6fa8aa1 in start_thread () from /lib64/libpthread.so.0

#6 0x00007f2cd56c193d in clone () from /lib64/libc.so.6

So, one of parallel threads that was in "Waiting for room in worker thread event queue" state was NOT killed while slave was stopped, it's current state changed to "Waiting while replication worker thread pool is busy" (and it probably still hold the LOCK_active_mi mutex somehow as SHOW or SELECT from I_S.GLOBAL_STATUS queries are blocked in "Filling schema table" status).

SELECT mentioned above accessed one real table via UNION, so further FTWRL call was also blocked (as you can see) and this caused problems for the next statements (obviously).

Questions are:

1. Why STOP SLAVE SQL_THREAD eventually completed but queries waiting on LOCK_active_mi remained blocked? Is it the same case as ~~MDEV-8039~~ essentially?

2. Why STOP SLAVE SQL_THREAD eventually completed but one of parallel slave threads remained active? Is it a new bug/regression of ~~MDEV-7818~~?

Attachments

Issue Links

is part of

MDEV-9573 'Stop slave' hangs on replication slave

Closed

Activity

People

Assignee:: Lixun Peng

Reporter:: Valerii Kravchuk

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 2016-08-22 12:06

Updated:: 2020-08-25 16:34

Resolved:: 2017-05-22 12:55

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.