[MDEV-12906] binlog_encryption.encrypted_master failed in buildbot - Failed to sync Created: 2017-05-24  Updated: 2021-06-04  Resolved: 2021-06-04

Status: Closed
Project: MariaDB Server
Component/s: Replication, Tests
Affects Version/s: 10.1, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.1.48, 10.2.38, 10.3.29, 10.4.19, 10.5.10, 10.6.1

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Brandon Nesterenko
Resolution: Not a Bug Votes: 0
Labels: None


 Description   

http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/10064/steps/test/logs/stdio

binlog_encryption.encrypted_master 'cbc,innodb_plugin,mix' w4 [ fail ]
        Test ended at 2017-05-22 10:54:18
 
CURRENT_TEST: binlog_encryption.encrypted_master
mysqltest: In included file "./include/sync_slave_sql_with_io.inc": 
included from /mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/suite/binlog_encryption/encrypted_master.test at line 158:
At line 49: Failed to sync slave SQL thread with slave IO thread.
 
The result from queries just before the failure was:
< snip >
master-bin.000005	740562	Gtid	1	740600	BEGIN GTID 0-1-139
master-bin.000005	740600	Annotate_rows	1	740702	DELETE FROM aria_table_name_to_encrypt ORDER BY int_column_name_to_encrypt LIMIT 10
master-bin.000005	740702	Table_map	1	740797	table_id: 80 (database_name_to_encrypt.aria_table_name_to_encrypt)
master-bin.000005	740797	Delete_rows_v1	1	740876	table_id: 80 flags: STMT_END_F
master-bin.000005	740876	Query	1	740965	COMMIT
master-bin.000005	740965	Gtid	1	741003	GTID 0-1-140
master-bin.000005	741003	Query	1	741128	use `database_name_to_encrypt`; ANALYZE TABLE myisam_table_name_to_encrypt
master-bin.000005	741128	Gtid	1	741166	GTID 0-1-141
master-bin.000005	741166	Query	1	741309	use `database_name_to_encrypt`; RENAME USER user_name_to_encrypt to new_user_name_to_encrypt
master-bin.000005	741309	Gtid	1	741347	GTID 0-1-142
master-bin.000005	741347	Query	1	741495	use `database_name_to_encrypt`; REVOKE ALL PRIVILEGES, GRANT OPTION FROM new_user_name_to_encrypt
master-bin.000005	741495	Gtid	1	741533	GTID 0-1-143
master-bin.000005	741533	Query	1	741654	DROP DATABASE database_name_to_encrypt
master-bin.000005	741654	Gtid	1	741692	GTID 0-1-144
master-bin.000005	741692	Query	1	741785	DROP USER new_user_name_to_encrypt
 
**** SHOW RELAYLOG EVENTS on server_1 ****
relaylog_name = 'No such row'
SHOW RELAYLOG EVENTS IN 'No such row';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
 
More results from queries before failure can be found in /mnt/data/buildot/maria-slave/work-opensuse-amd64/build/mysql-test/var/4/log/encrypted_master.log



 Comments   
Comment by Marko Mäkelä [ 2020-12-09 ]

I observed recent occurrences of this in a 10.5-based branch: https://buildbot.mariadb.org/#/builders/46/builds/1887/steps/8/logs/stdio

bb-10.5-MDEV-24369 3cf05b1e06328a6a97d977c60af43cbfd38ea744

binlog_encryption.encrypted_master 'cbc,innodb,mix' w4 [ retry-fail ]
        Test ended at 2020-12-09 05:37:05
CURRENT_TEST: binlog_encryption.encrypted_master
mysqltest: In included file "./include/sync_slave_sql_with_io.inc": 
included from /buildbot/debug-ubuntu-1804/build/mysql-test/suite/binlog_encryption/encrypted_master.test at line 160:
At line 49: Failed to sync slave SQL thread with slave IO thread.
The result from queries just before the failure was:
< snip >
master-bin.000005	114190	Annotate_rows	1	114292	DELETE FROM aria_table_name_to_encrypt ORDER BY int_column_name_to_encrypt LIMIT 10
master-bin.000005	114292	Table_map	1	114387	table_id: 82 (database_name_to_encrypt.aria_table_name_to_encrypt)
master-bin.000005	114387	Delete_rows_v1	1	114466	table_id: 82 flags: STMT_END_F
master-bin.000005	114466	Query	1	114555	COMMIT
master-bin.000005	114555	Gtid	1	114593	GTID 0-1-140
master-bin.000005	114593	Query	1	114718	use `database_name_to_encrypt`; ANALYZE TABLE myisam_table_name_to_encrypt
master-bin.000005	114718	Gtid	1	114756	GTID 0-1-141
master-bin.000005	114756	Query	1	114899	use `database_name_to_encrypt`; RENAME USER user_name_to_encrypt to new_user_name_to_encrypt
master-bin.000005	114899	Gtid	1	114937	GTID 0-1-142
master-bin.000005	114937	Query	1	115085	use `database_name_to_encrypt`; REVOKE ALL PRIVILEGES, GRANT OPTION FROM new_user_name_to_encrypt
master-bin.000005	115085	Gtid	1	115123	GTID 0-1-143
master-bin.000005	115123	Query	1	115244	DROP DATABASE database_name_to_encrypt
master-bin.000005	115244	Gtid	1	115282	GTID 0-1-144
master-bin.000005	115282	Query	1	115375	DROP USER new_user_name_to_encrypt

It occurs with both stmt and mix, as well as with both ctr and cbc combinations. The builder name is debug-ubuntu-1809, and it is running on Docker.

Comment by Marko Mäkelä [ 2020-12-09 ]

Buildbot cross-reference is also showing SIGSEGV in this test for various versions. Here is an example from a main branch:

10.2 450a5b33a22efc78dabc0df3f6b7544b

2020-05-20 22:00:13 70366467227648 [Note] Error reading relay log event: slave SQL thread was killed
2020-05-20 22:00:13 70366467227648 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000001' at position 369
2020-05-20 22:00:15 70366467227648 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000001' at position 369, relay log './slave-relay-bin.000002' position: 629
2020-05-20 22:00:17 70366467227648 [Note] Error reading relay log event: slave SQL thread was killed
2020-05-20 22:00:17 70366467227648 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000006' at position 427
200520 22:00:17 [ERROR] mysqld got signal 11 ;
#3  <signal handler called>
#4  inline_mysql_mutex_lock (src_line=2061, src_file=0x4a7b4bb8 "/home/buildbot/buildbot/build/mariadb-10.2.33/sql/rpl_rli.cc", that=0x400000000) at ./include/mysql/psi/mysql_thread.h:663
#5  rpl_group_info::unmark_start_commit (this=0x4ae2a740 <end_of_list>) at ./sql/rpl_rli.cc:2061
#6  0x000000004a061ec4 in ha_rollback_trans (thd=0x3fff08001298, all=<optimized out>) at ./sql/handler.cc:1679
#7  0x0000000049f61248 in trans_rollback (thd=0x3fff08001298) at ./sql/transaction.cc:415
#8  0x0000000049e0f744 in THD::cleanup (this=0x3fff08001298) at ./sql/sql_class.cc:1381
#9  0x0000000049e0f944 in THD::free_connection (this=0x3fff08001298) at ./sql/sql_class.cc:1449
#10 0x0000000049e10048 in THD::~THD (this=0x3fff08001298, __in_chrg=<optimized out>) at ./sql/sql_class.cc:1519
#11 0x0000000049e100b8 in THD::~THD (this=<optimized out>, __in_chrg=<optimized out>) at ./sql/sql_class.cc:1574
#12 0x0000000049dace48 in handle_slave_sql (arg=0x10015bbb8d0) at ./sql/slave.cc:5370
#13 0x000000004a68afc8 in pfs_spawn_thread (arg=0x3fff180061d8) at ./storage/perfschema/pfs.cc:1869
#14 0x00003fff7fb280fc in start_thread (arg=0x3fff78488000) at pthread_create.c:458
#15 0x00003fff7f291748 in clone () at ../sysdeps/unix/sysv/linux/powerpc/powerpc64/clone.S:85

Comment by Andrei Elkin [ 2020-12-09 ]

There seems to be multiple unrelated failures on the test. E.g X-ref witnesses (10.5 2020-11-03 19:53:29)

#2  0x000000004e1e7774 in handle_fatal_signal (sig=<optimized out>) at ./sql/signal_handler.cc:330
#3  <signal handler called>
#4  0x000000004e07c4d4 in ilink::unlink (this=0x10026e07bc8, this=0x10026e07bc8) at ./sql/sql_list.h:696
#5  ilink::~ilink (this=0x10026e07bc8, __in_chrg=<optimized out>) at ./sql/sql_list.h:707
#6  CONNECT::~CONNECT (this=0x10026e07bc8, __in_chrg=<optimized out>) at ./sql/sql_connect.h:51
#7  CONNECT::~CONNECT (this=0x10026e07bc8, __in_chrg=<optimized out>) at ./sql/sql_connect.h:51
#8  do_handle_one_connection (connect=0x10026e07bc8, put_in_cache=<optimized out>) at ./sql/sql_connect.cc:1436

Comment by Brandon Nesterenko [ 2021-06-01 ]

It looks like the buildbot failures from this test (along with other tests, e.g., encrypted_slave and encrypted_master_switch_to_unencrypted) are caused by the same issue of running out of disk space. Overall, this test needs about 180M of disk space to complete (tested on my laptop). The output resulting from the failure depends on when the VM runs out of disk space and potentially which thread catches the SIGABRT. The list following this paragraph describes the points that the disk can run out of space along with its corresponding failure output. Note, that some of the buildbot failures directly indicate that there is no more room left on the device, and these failures are not listed. To ensure the buildbot VMs have enough space to finish the test, is there a buildbot resource allocator/scheduler which can be adjusted?

==== Master Errors
The master/primary server can fill the disk while executing testdata.inc commands, which may lead to the following error scenarios:

1) 2013: Lost connection to MySQL server during query.

2) Timeout after 900s. More specifically, a GDB backtrace shows that the server can hang on the call to (in 10.5) wait_for_free_space (filename=0x7fcd980266f8 "<BUILD_DIR>/mysql-test/var/tmp/mysqld.1/#sql-temptable-16717-6-1a.MAD", errors=7) at <SRC_ROOT>/mysys/errors.C:126

==== Slave Errors
The slave/secondary server can fill the disk while executing the relay log in sync_slave_sql_with_io.inc, which may lead to the following error scenarios:

1)
At line 44: query 'if (`SELECT IFNULL($_sync_slave_sql_with_io_errno, -1) < 0`)
{' failed: 1064: You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ' -1) < 0' at line 1

2)
mysqltest: In included file "./include/show_rpl_debug_info.inc":
included from ./include/sync_slave_sql_with_io.inc at line 49:
included from <SRC_DIR>/mysql-test/suite/binlog_encryption/encrypted_master.test at line 160:
At line 75: Failed to flush '<BUILD_DIR>/mysql-test/var/log/encrypted_master.log', errno: 28

3) (If the SQL thread handles the SIGABRT)
included from <BUILD_DIR>/mysql-test/suite/binlog_encryption/encrypted_master.test at line 160:
At line 49: Failed to sync slave SQL thread with slave IO thread.

4) ‘2013: Lost connection to MySQL server during query’ with ABRT (signal 6) in a backtrace (not SQL thread).

I was not (yet) able to reproduce the SIGSEGV (signal 11) from within rpl_end.inc.

Comment by Brandon Nesterenko [ 2021-06-04 ]

The failures were due to a lack of space on the ramdisk of buildbot machines. TODO-2958 was created to increase the overall memory allocated to each machine, and thereby the ramdisk as well, as they are partitioned based on a percentage of the overall memory designated to the machine.

Generated at Thu Feb 08 08:01:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.