Details

Type: Task
Status: Stalled (View Workflow)
Priority: Minor
Resolution: Unresolved
Fix Version/s: N/A
Component/s: Tests
Labels:
None

Sprint:
10.0.30

Description

The initial bug ~~MDEV-9573~~ was about a race condition / deadlock between STOP SLAVE and querying information schema. There is a test case in comments to the bug report; but the area is weak, poorly tested and error-prone, so more testing were required.

The development tree: bb-10.0-monty

Bugfix testing revealed lots of other problems, so the patch has become more complicated, and then even more tests were required.

The general stress test includes (not necessarily all components were a part of every test run):

one master and one slave, MBR, GTID and non-GTID replication;
flow on master: multi-thread DML or DML+DDL

flow on slave: multi-thread mix of

CHANGE MASTER

RESET SLAVE [master name]

STOP SLAVE [master name]

START SLAVE [master name]

STOP ALL SLAVES

FLUSH [BINARY] LOGS

SHOW [GLOBAL] STATUS

SHOW STATUS LIKE '%slave%'

SHOW SLAVE [master name] STATUS

SHOW ALL SLAVES STATUS

SET @@default_master_connection=

SHOW MASTER STATUS

RESET MASTER

FLUSH TABLES WITH READ LOCK ; UNLOCK TABLES

slave server restarts

Later additions to increase the coverage:

SET [GLOBAL|SESSION] gtid_slave_pos = ...

SET [GLOBAL|SESSION] slave_parallel_threads = ...

SET [GLOBAL|SESSION] slave_domain_parallel_threads = ...

SHOW VARIABLES LIKE '%slave%'

SHOW VARIABLES LIKE '%gtid%'

Intermediate problems were reported in a non-conventional fashion (via emails, on IRC).
Here are some.

Reported by email, added to gist later:
https://gist.github.com/elenst/473ef51b352db5895e353d2748957477
https://gist.github.com/elenst/1e9faf028749b4b29ca06d84154201fd

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

coverage_dd45615d
81 kB
2017-02-15 22:43
diff_dd45615d
89 kB
2017-02-15 22:43
misses_dd45615d
6 kB
2017-02-15 22:43

Issue Links

relates to

MDEV-9573 'Stop slave' hangs on replication slave

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2017-02-15 23:06 - edited

As of 2017-02-16, the head of the tree is dd45615d9c4c7982880c510fe094589acfad8f87
The cumulative patch is attached as diff_dd45615d. It includes some unrelated fixes, but those are small, the main part is about replication.

The coverage report for dd45615d9c4c7982880c510fe094589acfad8f87 is attached as coverage_dd45615d and misses_dd45615d

The same set of tests was run on 4 builds:

bb-10.0-monty dd45615d cmake . -DCMAKE_BUILD_TYPE=Debug -DENABLE_GCOV=ON
bb-10.0-monty dd45615d cmake .
10.0 1d725c81 cmake . -DCMAKE_BUILD_TYPE=Debug -DENABLE_GCOV=ON
10.0 1d725c81 cmake .

The goal was to understand whether bb-10.0-monty tree provides a quality improvement.

The test set contained 28 test runs total. The DML/DDL flow on master was uncomplicated, so in all cases it went all right as expected. Crashes and deadlocks on slaves were monitored. Since the slave is restarted multiple times during each test run, the total number of failures can exceed the total number of test runs.

Results:

10.0 debug build

Total number of recorded failures: 26

5 occurrences
sql/slave.cc:4475: void* handle_slave_sql(void*): Assertion `!rli->slave_running' failed

19 occurrences
170215 1:28:35 [Note] /home/elenst/git/10.0-gcov/sql/mysqld: Normal shutdown
..
170215 1:28:37 [ERROR] mysqld got signal 11 ;
..
Master_info_index15get_master_info

1 occurrence
sql/slave.cc:2338: void write_ignored_events_info_to_relay_log(THD, Master_info): Assertion `thd == mi->io_thd' failed

1 occurrence
Hang in in terminate_slave_thread

bb-10.0-monty debug build

Total number of recorded failures: 12

3 occurrences
mysys/mf_format.c:34: fn_format: Assertion `name != ((void *)0)' failed

3 occurrences
sql/sql_parse.cc:3167: int mysql_execute_command(THD*): Assertion `res == 0 \|\| thd->get_stmt_da()->is_error()' failed

5 occurrences
server hang
--------
Example of processlist:
16 root localhost:47568 NULL Query 1587 Flushing relay log and master info repository. STOP ALL SLAVES /* QNO 752 CON_ID 9 */ 0.000
18 rqg localhost:47570 mysql Sleep 1588 NULL 0.000
19 root localhost:47571 NULL Query 1586 Filling schema table SHOW GLOBAL STATUS /* QNO 751 CON_ID 12 */ 0.000
20 root localhost:47572 NULL Query 1587 Flushing relay log and master info repository. STOP ALL SLAVES /* QNO 795 CON_ID 11 */ 0.000
21 root localhost:47573 NULL Query 1586 Filling schema table SHOW GLOBAL STATUS /* QNO 661 CON_ID 15 */ 0.000
22 root localhost:47574 NULL Query 1585 init SHOW SLAVE STATUS /* QNO 749 CON_ID 16 */ 0.000
23 root localhost:47575 NULL Query 1586 Waiting while replication worker thread pool is busy FLUSH TABLES WITH READ LOCK /* QNO 830 CON_ID 8 */ 0.000
24 root localhost:47576 NULL Query 1587 Waiting while replication worker thread pool is busy START ALL SLAVES /* QNO 723 CON_ID 13 */ 0.000
25 root localhost:47577 NULL Query 1586 init SHOW ALL SLAVES STATUS /* QNO 778 CON_ID 14 */ 0.000
26 root localhost:47578 NULL Query 1585 init SHOW ALL SLAVES STATUS /* QNO 843 CON_ID 10 */ 0.000
27 root localhost:47583 NULL Query 1586 Waiting while replication worker thread pool is busy FLUSH TABLES WITH READ LOCK /* QNO 874 CON_ID 7 */ 0.000
29 rqg localhost:47587 NULL Query 1586 init CHANGE MASTER TO MASTER_PORT = 13000, MASTER_HOST = '127.0.0.1', MASTER_USER = 'root', MASTER_LOG_FILE = '', MASTER_LOG_POS = 0, MASTER_USE_GTID = slave_pos, MASTER_CONNECT_RETRY = 1 0.000
33 system user NULL Connect 1586 Waiting for slave mutex on exit NULL 0.000
34 system user NULL Connect 1586 Waiting for master to send event NULL 0.000
36 root localhost:47919 NULL Query 0 init show full processlist 0.000

1 occurrence
170213 16:59:07 [ERROR] mysqld got signal 11 ;
..
rpl_slave_state_tostring_helper

10.0 release build

Total number of recorded falures: 22

18 occurrences
170214 18:37:35 [Note] /home/elenst/git/10.0-rel/sql/mysqld: Normal shutdown
...
170214 18:37:37 [ERROR] mysqld got signal 11 ;
...
Master_info_index15get_master_info

4 occurrences
Hang in inline_mysql_mutex_lock

bb-10.0-monty release build

Total number of recorded failures: 4

3 occurrences
hang on server shutdown
--------------
Example of thread activity:
2) in slave_prepare_for_shutdown
3) in rpl_parallel_entry::choose_thread
4) in Master_info_index::stop_all_slaves
5) in stop_slave
6) in stop_slave
7) in Master_info_index::stop_all_slaves
8) in start_slave

1 occurrence
170213 10:32:52 [ERROR] mysqld got signal 11 ;
...
sql_slave_killed

Elena Stepanova added a comment - 2017-02-15 23:06 - edited As of 2017-02-16, the head of the tree is dd45615d9c4c7982880c510fe094589acfad8f87 The cumulative patch is attached as diff_dd45615d. It includes some unrelated fixes, but those are small, the main part is about replication. The coverage report for dd45615d9c4c7982880c510fe094589acfad8f87 is attached as coverage_dd45615d and misses_dd45615d The same set of tests was run on 4 builds: bb-10.0-monty dd45615d cmake . -DCMAKE_BUILD_TYPE=Debug -DENABLE_GCOV=ON bb-10.0-monty dd45615d cmake . 10.0 1d725c81 cmake . -DCMAKE_BUILD_TYPE=Debug -DENABLE_GCOV=ON 10.0 1d725c81 cmake . The goal was to understand whether bb-10.0-monty tree provides a quality improvement. The test set contained 28 test runs total. The DML/DDL flow on master was uncomplicated, so in all cases it went all right as expected. Crashes and deadlocks on slaves were monitored. Since the slave is restarted multiple times during each test run, the total number of failures can exceed the total number of test runs. Results: 10.0 debug build Total number of recorded failures: 26 5 occurrences sql/slave.cc:4475: void* handle_slave_sql(void*): Assertion `!rli->slave_running' failed 19 occurrences 170215 1:28:35 [Note] /home/elenst/git/10.0-gcov/sql/mysqld: Normal shutdown .. 170215 1:28:37 [ERROR] mysqld got signal 11 ; .. Master_info_index15get_master_info 1 occurrence sql/slave.cc:2338: void write_ignored_events_info_to_relay_log(THD*, Master_info*): Assertion `thd == mi->io_thd' failed 1 occurrence Hang in in terminate_slave_thread bb-10.0-monty debug build Total number of recorded failures: 12 3 occurrences mysys/mf_format.c:34: fn_format: Assertion `name != ((void *)0)' failed 3 occurrences sql/sql_parse.cc:3167: int mysql_execute_command(THD*): Assertion `res == 0 || thd->get_stmt_da()->is_error()' failed 5 occurrences server hang -------- Example of processlist: 16 root localhost:47568 NULL Query 1587 Flushing relay log and master info repository. STOP ALL SLAVES /* QNO 752 CON_ID 9 */ 0.000 18 rqg localhost:47570 mysql Sleep 1588 NULL 0.000 19 root localhost:47571 NULL Query 1586 Filling schema table SHOW GLOBAL STATUS /* QNO 751 CON_ID 12 */ 0.000 20 root localhost:47572 NULL Query 1587 Flushing relay log and master info repository. STOP ALL SLAVES /* QNO 795 CON_ID 11 */ 0.000 21 root localhost:47573 NULL Query 1586 Filling schema table SHOW GLOBAL STATUS /* QNO 661 CON_ID 15 */ 0.000 22 root localhost:47574 NULL Query 1585 init SHOW SLAVE STATUS /* QNO 749 CON_ID 16 */ 0.000 23 root localhost:47575 NULL Query 1586 Waiting while replication worker thread pool is busy FLUSH TABLES WITH READ LOCK /* QNO 830 CON_ID 8 */ 0.000 24 root localhost:47576 NULL Query 1587 Waiting while replication worker thread pool is busy START ALL SLAVES /* QNO 723 CON_ID 13 */ 0.000 25 root localhost:47577 NULL Query 1586 init SHOW ALL SLAVES STATUS /* QNO 778 CON_ID 14 */ 0.000 26 root localhost:47578 NULL Query 1585 init SHOW ALL SLAVES STATUS /* QNO 843 CON_ID 10 */ 0.000 27 root localhost:47583 NULL Query 1586 Waiting while replication worker thread pool is busy FLUSH TABLES WITH READ LOCK /* QNO 874 CON_ID 7 */ 0.000 29 rqg localhost:47587 NULL Query 1586 init CHANGE MASTER TO MASTER_PORT = 13000, MASTER_HOST = '127.0.0.1', MASTER_USER = 'root', MASTER_LOG_FILE = '', MASTER_LOG_POS = 0, MASTER_USE_GTID = slave_pos, MASTER_CONNECT_RETRY = 1 0.000 33 system user NULL Connect 1586 Waiting for slave mutex on exit NULL 0.000 34 system user NULL Connect 1586 Waiting for master to send event NULL 0.000 36 root localhost:47919 NULL Query 0 init show full processlist 0.000 1 occurrence 170213 16:59:07 [ERROR] mysqld got signal 11 ; .. rpl_slave_state_tostring_helper 10.0 release build Total number of recorded falures: 22 18 occurrences 170214 18:37:35 [Note] /home/elenst/git/10.0-rel/sql/mysqld: Normal shutdown ... 170214 18:37:37 [ERROR] mysqld got signal 11 ; ... Master_info_index15get_master_info 4 occurrences Hang in inline_mysql_mutex_lock bb-10.0-monty release build Total number of recorded failures: 4 3 occurrences hang on server shutdown -------------- Example of thread activity: 2) in slave_prepare_for_shutdown 3) in rpl_parallel_entry::choose_thread 4) in Master_info_index::stop_all_slaves 5) in stop_slave 6) in stop_slave 7) in Master_info_index::stop_all_slaves 8) in start_slave 1 occurrence 170213 10:32:52 [ERROR] mysqld got signal 11 ; ... sql_slave_killed

Elena Stepanova added a comment - 2017-02-16 23:45 - edited

Less unrealistic test performed on the same trees.

Master flow is the same, general mix of DML/DDL.
On the slave, instead of 10 threads performing all described actions in a random manner (which means there could be START SLAVE, STOP SLAVE, RESET SLAVE, CHANGE MASTER etc. all executed at once), now there are 4 threads only, and each of them only picks up an action from a given group:

thread1
SHOW VARIABLES LIKE '%gtid%' \|
SHOW VARIABLES LIKE '%slave%' \|
SELECT * FROM mysql.gtid_slave_pos \|
SHOW STATUS \|
SHOW STATUS LIKE '%slave%' \|
SHOW GLOBAL STATUS \|
SHOW SLAVE STATUS \|
SHOW SLAVE master_name_or_empty STATUS \|
SHOW ALL SLAVES STATUS

thread2
SET global_or_session gtid_slave_pos = mypos \|
SET global_or_session slave_parallel_threads = _digit \|
SET global_or_session slave_domain_parallel_threads = _digit \|
SET GLOBAL log_warnings = _digit \|
SET @@default_master_connection= master_name_or_empty
SHOW VARIABLES LIKE '%gtid%' \|
SHOW VARIABLES LIKE '%slave%' \|
SELECT * FROM mysql.gtid_slave_pos \|
SHOW STATUS \|
SHOW STATUS LIKE '%slave%' \|
SHOW GLOBAL STATUS \|
SHOW SLAVE STATUS \|
SHOW SLAVE master_name_or_empty STATUS \|
SHOW ALL SLAVES STATUS

thread3
CHANGE MASTER master_name TO MASTER_HOST='127.0.0.1', MASTER_PORT=master_port, MASTER_USER='foo', MASTER_USE_GTID=use_gtid \|
RESET SLAVE master_name all_or_not \|
STOP SLAVE master_name_or_empty \|
START SLAVE master_name_or_empty \|
STOP ALL SLAVES \|
START ALL SLAVES
SHOW VARIABLES LIKE '%gtid%' \|
SHOW VARIABLES LIKE '%slave%' \|
SELECT * FROM mysql.gtid_slave_pos \|
SHOW STATUS \|
SHOW STATUS LIKE '%slave%' \|
SHOW GLOBAL STATUS \|
SHOW SLAVE STATUS \|
SHOW SLAVE master_name_or_empty STATUS \|
SHOW ALL SLAVES STATUS

thread4
RESET MASTER \|
FLUSH BINARY LOGS \|
FLUSH LOGS \|
FLUSH TABLES WITH READ LOCK ; UNLOCK TABLES
SHOW VARIABLES LIKE '%gtid%' \|
SHOW VARIABLES LIKE '%slave%' \|
SELECT * FROM mysql.gtid_slave_pos \|
SHOW STATUS \|
SHOW STATUS LIKE '%slave%' \|
SHOW GLOBAL STATUS \|
SHOW SLAVE STATUS \|
SHOW SLAVE master_name_or_empty STATUS \|
SHOW ALL SLAVES STATUS

22 test runs on each tree.

Elena Stepanova added a comment - 2017-02-16 23:45 - edited Less unrealistic test performed on the same trees. Master flow is the same, general mix of DML/DDL. On the slave, instead of 10 threads performing all described actions in a random manner (which means there could be START SLAVE , STOP SLAVE , RESET SLAVE , CHANGE MASTER etc. all executed at once), now there are 4 threads only, and each of them only picks up an action from a given group: thread1 SHOW VARIABLES LIKE '%gtid%' | SHOW VARIABLES LIKE '%slave%' | SELECT * FROM mysql.gtid_slave_pos | SHOW STATUS | SHOW STATUS LIKE '%slave%' | SHOW GLOBAL STATUS | SHOW SLAVE STATUS | SHOW SLAVE master_name_or_empty STATUS | SHOW ALL SLAVES STATUS thread2 SET global_or_session gtid_slave_pos = mypos | SET global_or_session slave_parallel_threads = _digit | SET global_or_session slave_domain_parallel_threads = _digit | SET GLOBAL log_warnings = _digit | SET @@default_master_connection= master_name_or_empty SHOW VARIABLES LIKE '%gtid%' | SHOW VARIABLES LIKE '%slave%' | SELECT * FROM mysql.gtid_slave_pos | SHOW STATUS | SHOW STATUS LIKE '%slave%' | SHOW GLOBAL STATUS | SHOW SLAVE STATUS | SHOW SLAVE master_name_or_empty STATUS | SHOW ALL SLAVES STATUS thread3 CHANGE MASTER master_name TO MASTER_HOST='127.0.0.1', MASTER_PORT=master_port, MASTER_USER='foo', MASTER_USE_GTID=use_gtid | RESET SLAVE master_name all_or_not | STOP SLAVE master_name_or_empty | START SLAVE master_name_or_empty | STOP ALL SLAVES | START ALL SLAVES SHOW VARIABLES LIKE '%gtid%' | SHOW VARIABLES LIKE '%slave%' | SELECT * FROM mysql.gtid_slave_pos | SHOW STATUS | SHOW STATUS LIKE '%slave%' | SHOW GLOBAL STATUS | SHOW SLAVE STATUS | SHOW SLAVE master_name_or_empty STATUS | SHOW ALL SLAVES STATUS thread4 RESET MASTER | FLUSH BINARY LOGS | FLUSH LOGS | FLUSH TABLES WITH READ LOCK ; UNLOCK TABLES SHOW VARIABLES LIKE '%gtid%' | SHOW VARIABLES LIKE '%slave%' | SELECT * FROM mysql.gtid_slave_pos | SHOW STATUS | SHOW STATUS LIKE '%slave%' | SHOW GLOBAL STATUS | SHOW SLAVE STATUS | SHOW SLAVE master_name_or_empty STATUS | SHOW ALL SLAVES STATUS 22 test runs on each tree.

Elena Stepanova added a comment - 2017-02-17 00:00 - edited

Results of the test above

First of all, I've got a number of InnoDB failures of two kinds:

InnoDB: Error: Unable to read tablespace 57 page no 3 into the buffer pool after 100 attempts

InnoDB: Failing assertion: zip_size == fil_space_get_zip_size(space)

It happened on all trees, approximately equally, and is probably unrelated to the changes, so I won't count them in results. They are to be handled separately and probably to be discussed with InnoDB team.

10.0 debug build

Total number of recorded failures: 11

9 occurrences
crash in Master_info_index15get_master_info

2 occurrences
Assertion `!rli->slave_running' failed

bb-10.0-monty debug build

Total number of recorded failures: 0

10.0 release build

Total number of recorded failures: 5

5 occurrences
crash in Master_info_index15get_master_info

bb-10.0-monty release build

Total number of recorded failures: 3

2 occurrences
Deadlock
3 rqg localhost:52251 mysql Sleep 5284 NULL 0.000
11 root localhost:52259 NULL Query 5351 Killing slave STOP SLAVE '' /* QNO 520 CON_ID 11 */ 0.000
12 root localhost:52260 NULL Query 5294 Waiting for worker threads to pause for global read lock FLUSH TABLES WITH READ LOCK /* QNO 581 CON_ID 12 */ 0.000
108 system user NULL Connect 5373 Connecting to master NULL 0.000
109 system user NULL Connect 5462 Waiting for work from SQL thread NULL 0.000
110 system user NULL Connect 5462 closing tables NULL 0.000
111 system user NULL Connect 5373 Slave has read all relay log; waiting for the slave I/O thread to update it NULL 0.000
112 system user NULL Connect 5373 Waiting for master to send event NULL 0.000
113 system user NULL Connect 5293 Reading event from the relay log NULL 0.000
114 rqg localhost:52301 NULL Query 5283 Killing slave STOP SLAVE 0.000
116 root localhost:52410 NULL Query 0 init show processlist 0.000

1 occurrence
Server hang, apparently on shutdown (although no shutdown in the error log)
Master_info::lock_slave_threads (kill server thread)
rpl_parallel_thread_pool::get_thread (do event)
in terminate_slave_threads (stop all slaves)

Elena Stepanova added a comment - 2017-02-17 00:00 - edited Results of the test above First of all, I've got a number of InnoDB failures of two kinds: InnoDB: Error: Unable to read tablespace 57 page no 3 into the buffer pool after 100 attempts InnoDB: Failing assertion: zip_size == fil_space_get_zip_size(space) It happened on all trees, approximately equally, and is probably unrelated to the changes, so I won't count them in results. They are to be handled separately and probably to be discussed with InnoDB team. 10.0 debug build Total number of recorded failures: 11 9 occurrences crash in Master_info_index15get_master_info 2 occurrences Assertion `!rli->slave_running' failed bb-10.0-monty debug build Total number of recorded failures: 0 10.0 release build Total number of recorded failures: 5 5 occurrences crash in Master_info_index15get_master_info bb-10.0-monty release build Total number of recorded failures: 3 2 occurrences Deadlock 3 rqg localhost:52251 mysql Sleep 5284 NULL 0.000 11 root localhost:52259 NULL Query 5351 Killing slave STOP SLAVE '' /* QNO 520 CON_ID 11 */ 0.000 12 root localhost:52260 NULL Query 5294 Waiting for worker threads to pause for global read lock FLUSH TABLES WITH READ LOCK /* QNO 581 CON_ID 12 */ 0.000 108 system user NULL Connect 5373 Connecting to master NULL 0.000 109 system user NULL Connect 5462 Waiting for work from SQL thread NULL 0.000 110 system user NULL Connect 5462 closing tables NULL 0.000 111 system user NULL Connect 5373 Slave has read all relay log; waiting for the slave I/O thread to update it NULL 0.000 112 system user NULL Connect 5373 Waiting for master to send event NULL 0.000 113 system user NULL Connect 5293 Reading event from the relay log NULL 0.000 114 rqg localhost:52301 NULL Query 5283 Killing slave STOP SLAVE 0.000 116 root localhost:52410 NULL Query 0 init show processlist 0.000 1 occurrence Server hang, apparently on shutdown (although no shutdown in the error log) Master_info::lock_slave_threads (kill server thread) rpl_parallel_thread_pool::get_thread (do event) in terminate_slave_threads (stop all slaves)

Elena Stepanova added a comment - 2017-02-18 23:05 - edited

New test run, bb-10.0-monty 5c1c0416487ad5cd764ae64c00b2c5fe046134d0

2nd test as above, only on bb-10.0-monty debug and release.

Again lots of InnoDB falures, they are not counted below

bb-10.0-monty debug build

Total number of recorded failures: 1

1 occurrence
170218 22:14:08 [Note] /home/elenst/git/bb-10.0-monty-gcov/sql/mysqld: Normal shutdown

...
safe_mutex: Trying to lock unitialized mutex at /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_mi.cc, line 1484
170218 22:14:09 [ERROR] mysqld got signal 6 ;

"Reported" here: https://gist.github.com/elenst/68bb1cf9b69463e452efeb38cd16537b

bb-10.0-monty release build

Total number of recorded failures: 3

2 occurrences
# hang
in stop_slave
in rpl_parallel_entry::choose_thread
in terminate_slave_threads

"Reported" here: https://gist.github.com/elenst/6a18ea9b187ec5d9e6d10c33f6cb277e

1 occurrence
# hang on shutdown
in kill_server_thread
in show_heartbeat_period

Elena Stepanova added a comment - 2017-02-18 23:05 - edited New test run, bb-10.0-monty 5c1c0416487ad5cd764ae64c00b2c5fe046134d0 2nd test as above, only on bb-10.0-monty debug and release. Again lots of InnoDB falures, they are not counted below bb-10.0-monty debug build Total number of recorded failures: 1 1 occurrence 170218 22:14:08 [Note] /home/elenst/git/bb-10.0-monty-gcov/sql/mysqld: Normal shutdown ... safe_mutex: Trying to lock unitialized mutex at /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_mi.cc, line 1484 170218 22:14:09 [ERROR] mysqld got signal 6 ; "Reported" here: https://gist.github.com/elenst/68bb1cf9b69463e452efeb38cd16537b bb-10.0-monty release build Total number of recorded failures: 3 2 occurrences # hang in stop_slave in rpl_parallel_entry::choose_thread in terminate_slave_threads "Reported" here: https://gist.github.com/elenst/6a18ea9b187ec5d9e6d10c33f6cb277e 1 occurrence # hang on shutdown in kill_server_thread in show_heartbeat_period

Elena Stepanova added a comment - 2017-02-21 22:21 - edited

New test run, bb-10.0-monty b9d79f0dacd669593777afdb5dcbaa8496c3cd64

Slave flow has been modified to avoid InnoDB failures (successfully).

Simpler scenario

bb-10.0-monty debug build

Total number of recorded failures: 0

bb-10.0-monty release build

Total number of recorded failures: 1

1 occurrence
#2 <signal handler called>
#3 0x00007fc90a442d4f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#4 0x00000000006a6a73 in inline_mysql_cond_wait (src_file=0xb5a008 "/home/elenst/git/bb-10.0-monty-rel/sql/rpl_parallel.cc", src_line=2155, mutex=0x7fc8de448470, that=0x7fc8de448510) at /home/elenst/git/bb-10.0-monty-rel/include/mysql/psi/mysql_thread.h:1165
#5 rpl_parallel::wait_for_done (this=this@entry=0x7fc8eba6c320, thd=<optimised out>, rli=rli@entry=0x7fc8eba699f8) at /home/elenst/git/bb-10.0-monty-rel/sql/rpl_parallel.cc:2155
#6 0x0000000000559e3a in handle_slave_sql (arg=arg@entry=0x7fc8eba68000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:4754

Debug baseline: 26 crashes, 1 assertion failure
Release baseline: 12 crashes

More stressful scenario

bb-10.0-monty debug build

Total number of recorded failures: 9

3 occurrences
sql/rpl_parallel.cc:1307: void* handle_rpl_parallel_thread(void): Assertion `e->pause_sub_id == (uint64)(9223372036854775807LL 2ULL + 1) \|\| e->last_committed_sub_id >= e->pause_sub_id' failed

2 occurrences
mysys/mf_format.c:34: fn_format: Assertion `name != ((void *)0)' failed

2 occurrences
safe_mutex: Trying to lock unitialized mutex at /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_parallel.cc, line 2153

1 occurrence
mysqld: /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_parallel.cc:1855: rpl_parallel_thread* rpl_parallel_thread_pool::get_thread(rpl_parallel_thread**, rpl_parallel_
entry*): Assertion `count > 0' failed.
170222 0:21:04 [ERROR] mysqld got signal 6 ;

1 occurrence
crash
sql/rpl_mi.cc:1052(Master_info_index::init_all_master_info

bb-10.0-monty release build

Total number of recorded failures: 3

1 occurrence
hang on shutdown
Master_info::lock_slave_threads
rpl_parallel_thread_pool::get_thread
in terminate_slave_threads
in start_slave
...

2 occurrences
crash
/home/elenst/git/bb-10.0-monty-rel/sql/mysqld(_ZN17Master_info_index20init_all_master_infoEv+0x19f)[0x6665ff]

Debug baseline: 29 crashes, 2 hangs, 1 assertion failure
Release baseline: 14 crashes, 2 hangs

Elena Stepanova added a comment - 2017-02-21 22:21 - edited New test run, bb-10.0-monty b9d79f0dacd669593777afdb5dcbaa8496c3cd64 Slave flow has been modified to avoid InnoDB failures (successfully). Simpler scenario bb-10.0-monty debug build Total number of recorded failures: 0 bb-10.0-monty release build Total number of recorded failures: 1 1 occurrence #2 <signal handler called> #3 0x00007fc90a442d4f in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0 #4 0x00000000006a6a73 in inline_mysql_cond_wait (src_file=0xb5a008 "/home/elenst/git/bb-10.0-monty-rel/sql/rpl_parallel.cc", src_line=2155, mutex=0x7fc8de448470, that=0x7fc8de448510) at /home/elenst/git/bb-10.0-monty-rel/include/mysql/psi/mysql_thread.h:1165 #5 rpl_parallel::wait_for_done (this=this@entry=0x7fc8eba6c320, thd=<optimised out>, rli=rli@entry=0x7fc8eba699f8) at /home/elenst/git/bb-10.0-monty-rel/sql/rpl_parallel.cc:2155 #6 0x0000000000559e3a in handle_slave_sql (arg=arg@entry=0x7fc8eba68000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:4754 Debug baseline: 26 crashes, 1 assertion failure Release baseline: 12 crashes More stressful scenario bb-10.0-monty debug build Total number of recorded failures: 9 3 occurrences sql/rpl_parallel.cc:1307: void* handle_rpl_parallel_thread(void*): Assertion `e->pause_sub_id == (uint64)(9223372036854775807LL * 2ULL + 1) || e->last_committed_sub_id >= e->pause_sub_id' failed 2 occurrences mysys/mf_format.c:34: fn_format: Assertion `name != ((void *)0)' failed 2 occurrences safe_mutex: Trying to lock unitialized mutex at /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_parallel.cc, line 2153 1 occurrence mysqld: /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_parallel.cc:1855: rpl_parallel_thread* rpl_parallel_thread_pool::get_thread(rpl_parallel_thread**, rpl_parallel_ entry*): Assertion `count > 0' failed. 170222 0:21:04 [ERROR] mysqld got signal 6 ; 1 occurrence crash sql/rpl_mi.cc:1052(Master_info_index::init_all_master_info bb-10.0-monty release build Total number of recorded failures: 3 1 occurrence hang on shutdown Master_info::lock_slave_threads rpl_parallel_thread_pool::get_thread in terminate_slave_threads in start_slave ... 2 occurrences crash /home/elenst/git/bb-10.0-monty-rel/sql/mysqld(_ZN17Master_info_index20init_all_master_infoEv+0x19f)[0x6665ff] Debug baseline: 29 crashes, 2 hangs, 1 assertion failure Release baseline: 14 crashes, 2 hangs

Elena Stepanova added a comment - 2017-02-26 17:40 - edited

Additional shorter set (11 test runs), stressful scenario modified to avoid a syntax error during initial slave configuration.

bb-10.0-monty debug build

Total number of recorded failures: 7

4 occurrences
sql/rpl_parallel.cc:1307: void* handle_rpl_parallel_thread(void): Assertion `e->pause_sub_id == (uint64)(9223372036854775807LL 2ULL + 1) \|\| e->last_committed_sub_id >= e->pause_sub_id' failed.

2 occurrences
safe_mutex: Trying to lock unitialized mutex at /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_parallel.cc, line 2153

1 occurrence
sql/slave.cc:4476: void* handle_slave_sql(void*): Assertion `!rli->slave_running' failed

bb-10.0-monty release build

Total number of recorded failures: 2

1 occurrence

#3  0x000000000054cca2 in io_slave_killed (mi=0x7f85ea662000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:1013

#4  0x00000000005565b5 in handle_slave_io (arg=arg@entry=0x7f85ea662000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:4049

1 occurrence
#3 Log_event::Log_event (this=this@entry=0x7f31777fe820, buf=buf@entry=0x7f3178475009 "", description_event=description_event@entry=0x0) at /home/elenst/git/bb-10.0-monty-rel/sql/log_event.cc:898
#4 0x00000000007a8e8e in Rotate_log_event::Rotate_log_event (this=0x7f31777fe820, buf=0x7f3178475009 "", event_len=43, description_event=0x0) at /home/elenst/git/bb-10.0-monty-rel/sql/log_event.cc:6196
#5 0x0000000000552d3d in queue_event (mi=mi@entry=0x7f3187644000, buf=0x7f3178475009 "", event_len=43) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:5398
#6 0x0000000000556a33 in handle_slave_io (arg=arg@entry=0x7f3187644000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:4146

Elena Stepanova added a comment - 2017-02-26 17:40 - edited Additional shorter set (11 test runs), stressful scenario modified to avoid a syntax error during initial slave configuration. bb-10.0-monty debug build Total number of recorded failures: 7 4 occurrences sql/rpl_parallel.cc:1307: void* handle_rpl_parallel_thread(void*): Assertion `e->pause_sub_id == (uint64)(9223372036854775807LL * 2ULL + 1) || e->last_committed_sub_id >= e->pause_sub_id' failed. 2 occurrences safe_mutex: Trying to lock unitialized mutex at /home/elenst/git/bb-10.0-monty-gcov/sql/rpl_parallel.cc, line 2153 1 occurrence sql/slave.cc:4476: void* handle_slave_sql(void*): Assertion `!rli->slave_running' failed bb-10.0-monty release build Total number of recorded failures: 2 1 occurrence #3 0x000000000054cca2 in io_slave_killed (mi=0x7f85ea662000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:1013 #4 0x00000000005565b5 in handle_slave_io (arg=arg@entry=0x7f85ea662000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:4049 1 occurrence #3 Log_event::Log_event (this=this@entry=0x7f31777fe820, buf=buf@entry=0x7f3178475009 "", description_event=description_event@entry=0x0) at /home/elenst/git/bb-10.0-monty-rel/sql/log_event.cc:898 #4 0x00000000007a8e8e in Rotate_log_event::Rotate_log_event (this=0x7f31777fe820, buf=0x7f3178475009 "", event_len=43, description_event=0x0) at /home/elenst/git/bb-10.0-monty-rel/sql/log_event.cc:6196 #5 0x0000000000552d3d in queue_event (mi=mi@entry=0x7f3187644000, buf=0x7f3178475009 "", event_len=43) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:5398 #6 0x0000000000556a33 in handle_slave_io (arg=arg@entry=0x7f3187644000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:4146

Elena Stepanova added a comment - 2017-03-01 11:09 - edited

bb-10.0-monty 04ae88bd1881a63b5fe2a7b4e27036ce75d04600

Run #13

debug build

Total number of recorded failures: 7

6 occurrences
sql/rpl_parallel.cc:1307: void* handle_rpl_parallel_thread(void): Assertion `e->pause_sub_id == (uint64)(9223372036854775807LL 2ULL + 1) \|\| e->last_committed_sub_id >= e->pause_sub_id' failed

# Trials: 1, 4, 10, 12, 13, 21

1 occurrence
sql/rpl_parallel.cc:1869: rpl_parallel_thread* rpl_parallel_thread_pool::get_thread(rpl_parallel_thread*, rpl_parallel_entry): Assertion `count > 0' failed

# Trials: 17

release build

Total number of recorded failures: 5

1 occurrence
# hang on shutdown

#4 Master_info::lock_slave_threads (this=this@entry=0x7fa77766c000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:238

#2 MYSQL_BIN_LOG::wait_for_update_relay_log (this=this@entry=0x7fa777652e50, thd=0x7fa767817008) at /home/elenst/git/bb-10.0-monty-rel/sql/log.cc:7717
#3 0x0000000000559508 in next_event (event_size=<synthetic pointer>, rgi=0x7fa767806600) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:6676

2 rpl_parallel_thread_pool::get_thread (this=this@entry=0x12eb960 <global_rpl_thread_pool>, owner=owner@entry=0x7fa78003c5f0, entry=entry@entry=0x7fa78003c508) at /home/elenst/git/bb-10.0-monty-rel/sql/rpl_parallel.cc:1872
#3 0x00000000006a6a65 in rpl_parallel_entry::choose_thread (this=this@entry=0x7fa78003c508, rgi=rgi@entry=0x7fa780016600, did_enter_cond=did_enter_cond@entry=0x7fa78066d7df, old_stage=old_stage@

#2 terminate_slave_thread (thd=0x7fa78002f008, term_lock=0x7fa77766f238, term_cond=0x7fa77766f2a0, slave_running=0x7fa77766fc10, skip_lock=<optimised out>) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:774

Trials: 13

1 occurrence
170228 23:46:42 [ERROR] mysqld got signal 11 ;

sql/slave.cc:1013(io_slave_killed(Master_info*))[0x54ccc2]
sql/slave.cc:4075(handle_slave_io)[0x556958]

# Trials: 1

1 occurrence
70301 2:45:19 [ERROR] mysqld got signal 11 ;

sql/slave.cc:1038(sql_slave_killed(rpl_group_info*))[0x54cafd]
sql/slave.cc:4712(handle_slave_sql)[0x559141]

# Trials: 9

1 occurrence
# hang

4 root localhost:33891 NULL Query 7617 checking permissions START SLAVE '' /* QNO 864 CON_ID 20 */ 0.000
3 system user NULL Connect 7729 Waiting for master to send event NULL 0.000
5 root localhost:33892 NULL Query 7702 checking permissions START SLAVE '' /* QNO 770 CON_ID 19 */ 0.000
6 rqg localhost:33893 mysql Sleep 7729 NULL 0.000
8 root localhost:33895 NULL Query 7610 Killing slave STOP ALL SLAVES /* QNO 797 CON_ID 21 */ 0.000
9 root localhost:33896 NULL Query 7684 Killing slave STOP ALL SLAVES /* QNO 963 CON_ID 26 */ 0.000
10 root localhost:33897 NULL Query 7715 checking permissions START SLAVE '' /* QNO 685 CON_ID 22 */ 0.000
11 root localhost:33898 NULL Query 7729 Killing slave STOP ALL SLAVES /* QNO 837 CON_ID 24 */ 0.000
12 root localhost:33900 NULL Query 7676 Killing slave STOP ALL SLAVES /* QNO 796 CON_ID 23 */ 0.000
13 root localhost:33901 NULL Query 7644 Killing slave STOP ALL SLAVES /* QNO 888 CON_ID 25 */ 0.000
14 root localhost:33902 NULL Query 7702 checking permissions START SLAVE '' /* QNO 740 CON_ID 18 */ 0.000
16 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000
17 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000
18 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000
19 system user NULL Connect 7707 Waiting for work from SQL thread NULL 0.000
20 system user NULL Connect 7709 Waiting for work from SQL thread NULL 0.000
21 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000
22 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000
23 rqg localhost:33905 NULL Query 7729 Killing slave STOP SLAVE 0.000
24 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000
25 system user NULL Connect 7582 Slave has read all relay log; waiting for the slave I/O thread to update it NULL 0.000
31 root localhost:33941 NULL Query 0 init show processlist 0.000

# Trials: 4

1 occurrence
170301 2:23:28 [ERROR] mysqld got signal 11 ;

sql/log_event.cc:1599(Log_event::read_log_event(char const, unsigned int, char const, Format_description_log_event const, char))[0x7ad0c5]
sql/slave.cc:5559(queue_event(Master_info, char const, unsigned long))[0x553205]

# Trials: 5

Elena Stepanova added a comment - 2017-03-01 11:09 - edited bb-10.0-monty 04ae88bd1881a63b5fe2a7b4e27036ce75d04600 Run #13 debug build Total number of recorded failures: 7 6 occurrences sql/rpl_parallel.cc:1307: void* handle_rpl_parallel_thread(void*): Assertion `e->pause_sub_id == (uint64)(9223372036854775807LL * 2ULL + 1) || e->last_committed_sub_id >= e->pause_sub_id' failed # Trials: 1, 4, 10, 12, 13, 21 1 occurrence sql/rpl_parallel.cc:1869: rpl_parallel_thread* rpl_parallel_thread_pool::get_thread(rpl_parallel_thread**, rpl_parallel_entry*): Assertion `count > 0' failed # Trials: 17 release build Total number of recorded failures: 5 1 occurrence # hang on shutdown #4 Master_info::lock_slave_threads (this=this@entry=0x7fa77766c000) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:238 #2 MYSQL_BIN_LOG::wait_for_update_relay_log (this=this@entry=0x7fa777652e50, thd=0x7fa767817008) at /home/elenst/git/bb-10.0-monty-rel/sql/log.cc:7717 #3 0x0000000000559508 in next_event (event_size=<synthetic pointer>, rgi=0x7fa767806600) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:6676 2 rpl_parallel_thread_pool::get_thread (this=this@entry=0x12eb960 <global_rpl_thread_pool>, owner=owner@entry=0x7fa78003c5f0, entry=entry@entry=0x7fa78003c508) at /home/elenst/git/bb-10.0-monty-rel/sql/rpl_parallel.cc:1872 #3 0x00000000006a6a65 in rpl_parallel_entry::choose_thread (this=this@entry=0x7fa78003c508, rgi=rgi@entry=0x7fa780016600, did_enter_cond=did_enter_cond@entry=0x7fa78066d7df, old_stage=old_stage@ #2 terminate_slave_thread (thd=0x7fa78002f008, term_lock=0x7fa77766f238, term_cond=0x7fa77766f2a0, slave_running=0x7fa77766fc10, skip_lock=<optimised out>) at /home/elenst/git/bb-10.0-monty-rel/sql/slave.cc:774 Trials: 13 1 occurrence 170228 23:46:42 [ERROR] mysqld got signal 11 ; sql/slave.cc:1013(io_slave_killed(Master_info*))[0x54ccc2] sql/slave.cc:4075(handle_slave_io)[0x556958] # Trials: 1 1 occurrence 70301 2:45:19 [ERROR] mysqld got signal 11 ; sql/slave.cc:1038(sql_slave_killed(rpl_group_info*))[0x54cafd] sql/slave.cc:4712(handle_slave_sql)[0x559141] # Trials: 9 1 occurrence # hang 4 root localhost:33891 NULL Query 7617 checking permissions START SLAVE '' /* QNO 864 CON_ID 20 */ 0.000 3 system user NULL Connect 7729 Waiting for master to send event NULL 0.000 5 root localhost:33892 NULL Query 7702 checking permissions START SLAVE '' /* QNO 770 CON_ID 19 */ 0.000 6 rqg localhost:33893 mysql Sleep 7729 NULL 0.000 8 root localhost:33895 NULL Query 7610 Killing slave STOP ALL SLAVES /* QNO 797 CON_ID 21 */ 0.000 9 root localhost:33896 NULL Query 7684 Killing slave STOP ALL SLAVES /* QNO 963 CON_ID 26 */ 0.000 10 root localhost:33897 NULL Query 7715 checking permissions START SLAVE '' /* QNO 685 CON_ID 22 */ 0.000 11 root localhost:33898 NULL Query 7729 Killing slave STOP ALL SLAVES /* QNO 837 CON_ID 24 */ 0.000 12 root localhost:33900 NULL Query 7676 Killing slave STOP ALL SLAVES /* QNO 796 CON_ID 23 */ 0.000 13 root localhost:33901 NULL Query 7644 Killing slave STOP ALL SLAVES /* QNO 888 CON_ID 25 */ 0.000 14 root localhost:33902 NULL Query 7702 checking permissions START SLAVE '' /* QNO 740 CON_ID 18 */ 0.000 16 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000 17 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000 18 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000 19 system user NULL Connect 7707 Waiting for work from SQL thread NULL 0.000 20 system user NULL Connect 7709 Waiting for work from SQL thread NULL 0.000 21 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000 22 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000 23 rqg localhost:33905 NULL Query 7729 Killing slave STOP SLAVE 0.000 24 system user NULL Connect 7689 Waiting for work from SQL thread NULL 0.000 25 system user NULL Connect 7582 Slave has read all relay log; waiting for the slave I/O thread to update it NULL 0.000 31 root localhost:33941 NULL Query 0 init show processlist 0.000 # Trials: 4 1 occurrence 170301 2:23:28 [ERROR] mysqld got signal 11 ; sql/log_event.cc:1599(Log_event::read_log_event(char const*, unsigned int, char const**, Format_description_log_event const*, char))[0x7ad0c5] sql/slave.cc:5559(queue_event(Master_info*, char const*, unsigned long))[0x553205] # Trials: 5

Elena Stepanova added a comment - 2020-07-22 16:31

alice,

Please take a look to see if there is anything you want/need. If not, feel free to close.

Elena Stepanova added a comment - 2020-07-22 16:31 alice , Please take a look to see if there is anything you want/need. If not, feel free to close.

People

Assignee:: Alice Sherepa

Reporter:: Elena Stepanova

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2017-02-15 22:41

Updated:: 2020-07-22 16:31

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Testing for MDEV-9573 and extra replication bugfixes

Details

Description

Attachments

Attachments

Issue Links

Activity

10.0 debug build

bb-10.0-monty debug build

10.0 release build

bb-10.0-monty release build

10.0 debug build

bb-10.0-monty debug build

10.0 release build

bb-10.0-monty release build

bb-10.0-monty debug build

bb-10.0-monty release build

Simpler scenario

bb-10.0-monty debug build

bb-10.0-monty release build

More stressful scenario

bb-10.0-monty debug build

bb-10.0-monty release build

bb-10.0-monty debug build

bb-10.0-monty release build

bb-10.0-monty 04ae88bd1881a63b5fe2a7b4e27036ce75d04600

debug build

release build

People

Dates

Git Integration