[MDEV-32551] "Read semi-sync reply magic number error" warnings on master Created: 2023-10-23  Updated: 2024-02-05  Resolved: 2024-01-23

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.6
Fix Version/s: 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3

Type: Bug Priority: Critical
Reporter: Michael Widenius Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: None
Environment:

Master and single semi-sync slave


Attachments: File skr.cnf     File skr.test    
Issue Links:
Problem/Incident
causes MDEV-32940 Semi-sync Slave Falsely Logs that Mas... Closed
causes MDEV-32943 Stopping Semi-sync Slave Reports that... Closed
causes MDEV-32944 Rpl_semi_sync_slave_enabled sysvar Sh... Closed
causes MDEV-32945 Rpl_semi_sync_master_status is ON Wit... Open
causes MDEV-32947 Semi-sync with Multi-source Checksum ... Open
causes MDEV-32960 Semi-sync ACKed Transaction can Timeo... Open
causes MDEV-32961 Semi-sync Primary Segfaults on Net Error Closed
causes MDEV-33331 IO Thread Relay Log Inconsistent Stat... Open
Relates
relates to MDEV-11853 semisync thread can be killed after s... Closed
relates to MDEV-24773 slave_compressed_protocol doesn't wor... Closed
relates to MDEV-28114 Semi-sync Master ACK Receiver Thread ... Closed
relates to MDEV-28141 Slave crashes with Packets out of ord... Open
relates to MDEV-28306 Slave I/O: Fatal error: Failed to run... Open
relates to MDEV-29369 rpl.rpl_semi_sync_shutdown_await_ack ... In Review
relates to MDEV-32365 detailize the semisync replication ma... Closed
relates to MDEV-32385 Semi-Sync Ack_Receiver Thread Should ... Open
relates to MDEV-32792 Second Semi-sync Replica Can Hang at ... Open
relates to MDEV-33327 rpl_seconds_behind_master_spike Sensi... Closed
relates to MXS-4792 Semi-sync replication through MaxScal... Closed
relates to MXS-4793 Unknown command errors when semi-sync... Closed

 Description   

The user gets a lot of these errors randomly in the error log:

"Read semi-sync reply magic number error"

This is probably some missing or out of sync packet as part of semi sync replication.

We should try to repeat this by have our QA and benchmark teams run tests on a master and semi-sync slave to try to repeat the issue.



 Comments   
Comment by Roel Van de Paar [ 2023-10-23 ]

monty What is the environment please?
Is slave_compressed_protocol=1 ?

Comment by Roel Van de Paar [ 2023-10-24 ]

Related MySQL bugs
https://bugs.mysql.com/bug.php?id=85382 - Not applicable as slave_compressed_protocol is OFF (confirmed)
https://bugs.mysql.com/bug.php?id=86230 - IDEM
https://bugs.mysql.com/bug.php?id=99628 - Elkin please review
https://bugs.mysql.com/bug.php?id=105395 - Interesting as there seem to be network issues around the same time as the error
https://bugs.mysql.com/bug.php?id=101056 - Interesting given the connection with net timeouts

The more I review the issue and it's occurences, the more I am leaning towards believing this is caused by either reduced network stability or network congestion.

Comment by Andrei Elkin [ 2023-11-06 ]

@elenst reproduced the error. I followed the suit. A stored procedure to hit the error is here

delimiter //
create procedure p1(i int)
begin
  declare  k int default 0;
  while k < i do
  	select sleep(0.1);
  	SET GLOBAL rpl_semi_sync_slave_enabled= OFF;
	STOP SLAVE IO_THREAD;
  	select sleep(0.001);
	start slave io_thread;
	SET GLOBAL rpl_semi_sync_slave_enabled= ON;
      set k=k+1;
  end while;
end//

Analysis of rr traces showed the ack thread received 4 bytes from the network layer to raise 1 byte, with the value of 1 as payload. The packet must've come from a slave server that aggressively reconnecting.
I did not succeed to identify when/if it is sent by a slave's thread.
Perhaps this can be done without tcpdump-like sniffers, but I'd resumed the analysis with it.
monty ^

Comment by Andrei Elkin [ 2023-11-06 ]

Here is the test that catches

 2023-11-06 19:26:18 10 [ERROR] Read semi-sync reply magic number error

e.g on

> show global variables  like 'version';
+---------------+---------------------------+
| Variable_name | Value                     |
+---------------+---------------------------+
| version       | 10.4.32-MariaDB-debug-log |
+---------------+---------------------------+

--source include/have_debug.inc
--source include/master-slave.inc
 
--connection slave
--source include/stop_slave.inc
 
--connection master
set global rpl_semi_sync_master_enabled = ON;
 
 
--connection slave
delimiter //;
create procedure p1(i int)
begin
  declare  k int default 0;
  while k < i do
  	select sleep(0.1);
  	SET GLOBAL rpl_semi_sync_slave_enabled= OFF;
	STOP SLAVE IO_THREAD;
  	select sleep(0.001);
	start slave io_thread;
	SET GLOBAL rpl_semi_sync_slave_enabled= ON;
      set k=k+1;
  end while;
end//
delimiter ;//
 
# set br https://github.com/MariaDB/server/blob/10.4/sql/semisync_master.cc#L572
--connection slave1
--send call p1(3000)
 
--connection slave
--send call p1(3000)
 
--connection master
--echo "set br https://github.com/MariaDB/server/blob/10.4/sql/semisync_master.cc#L572"
select sleep(10000);
 
--source include/rpl_end.inc

Comment by Brandon Nesterenko [ 2023-11-07 ]

I've got the magic number error definitively to be another case of MDEV-28114. I added trace DBUG_PRINTS() to show that the slave IO thread sends the packet `01 00 00 00 10` on the active semi-sync connection during its exit/cleanup, and that the primary's `Ack_receiver` reads that packet at the same time. In more detail, we have the following steps that need to occur for this to happen (I'll spend tomorrow making an MTR test to explicitly do it):

1. Master: `set rpl_semi_sync_master_enabled=1`
2. Slave: `set rpl_semi_sync_slave_enabled=1`
3. Slave: Starts IO thread, thus forming an active semi-sync connection (i.e. `rpl_semi_sync_slave_enabled=1` AND `rpl_semi_sync_slave_status=1`)
4. Slave: Requests binlog dump from master
5. Master: Sees the binlog dump request, and sees that semi-sync is desired, so the slave is added to the `Ack_receiver` thread to listen for slave `ACK`s
6. Slave: `set rpl_semi_sync_slave_enabled=0`
7. Slave: Stop slave IO thread, which now skips a needed part of the current semi-sync exit protocol to create a separate connection to kill the active semi-sync connection, due to `rpl_semi_sync_slave_enabled=0`, despite the connection actual being semi-sync active. Looking into the exit code we have this function snippet:

  int Repl_semi_sync_slave::slave_stop(Master_info *mi)
{
  if (get_slave_enabled())
    kill_connection(mi->mysql);
    ...
}

so now that we have disabled rpl_semi_sync_slave_enabled, get_slave_enabled() returns false, and we skip kill_connection. It is needed though, because of this later code snippet in the exit process (condensed):

  repl_semisync_slave.slave_stop(mi); 
  ...
  if (mysql)
  {
    ...
    // So now we end up here, calling mysql_close(mysql), which will send COM_QUIT on the mysql connection, which is an active semi-sync connection. So when the ACK thread sees 
    mysql_close(mysql);
    mi->mysql=0;
  }

where slave_stop references the first code snippet, which skips the kill_connection(mi->mysql) now leaving mysql in-tact. So because kill_connection was skipped, we go into the if (mysql) conditional, calling into mysql_close(mysql). The problem with mysql_close() on an active semi-sync connection (which `mysql` is) is that it will send a COM_QUIT packet over the net. So when the Ack_receiver sees the COM_QUIT, it results in the magic number error.

8. Master: Reads the COM_QUIT packet from the slave connection, but it isn't in the semi-sync protocol, and thereby issues the magic number error. Note that it is important the dump thread hasn't yet removed the slave from the Ack_receiver thread. This is frequently not the case, factoring into the sporadic failures.

For the trace points (with the keyword magicerr) that prove that the slave sends `01 00 00 00 10`,to track when semi_sync_status is turned on for the slave, but by the time we close, _semi_sync_enabled is falsified, and so we never issue semisync_slave.kill_connection():

T@59   :    2: request_dump: magicerr: semisync slave status is ON!
T@59   :    2: request_dump: magicerr: sending COM_BINLOGDUMP
T@59   :    4: net_write_command: data_written: Memory: 0x1522dd0aa813  Bytes: (5)
1C 00 00 00 12
T@59   :    4: net_write_command: data_written: Memory: 0x1522dd0aa940  Bytes: (27)
49 01 00 00 02 00 02 00 00 00 6D 61 73 74 65 72 2D 62 69 6E 2E 30 30 30 30 30
31
T@59   :    6: net_real_write: data_written: Memory: 0x152288012bd8  Bytes: (32)
1C 00 00 00 12 49 01 00 00 02 00 02 00 00 00 6D 61 73 74 65 72 2D 62 69 6E 2E
30 30 30 30 30 31
T@59   :    1: handle_slave_io: magicerr: Request_dump succeeded
2023-11-07 20:08:09 50 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 329
2023-11-07 20:08:09 50 [Note] master was 127.0.0.1:16000
T@59   :    1: handle_slave_io: magicerr: In semisync_slave::slave_stop()
T@59   :    1: handle_slave_io: magicerr: Ending slave_stop()
T@59   :    1: handle_slave_io: magicerr: check for mysql before close()
T@59   :    1: handle_slave_io: magicerr: calling mysql_close()
T@59   :    4: net_write_command: data_written: Memory: 0x1522dd0aaa63  Bytes: (5)
01 00 00 00 01
T@59   :    4: net_write_command: data_written: Memory: (nil)  Bytes: (0)
 
T@59   :    6: net_real_write: data_written: Memory: 0x152288012bd8  Bytes: (5)
01 00 00 00 01

And a 10.4.32 stack trace showing that `mysql_close()` sends the COM_QUIT:

#0  net_write_command (net=0x7f7ea0017138, command=1 '\001', header=0x0, head_len=0, packet=0x0, len=0) at /home/brandon/workspace/server/sql/net_serv.cc:503
#1  0x00005613bc91c2aa in cli_advanced_command (mysql=0x7f7ea0017138, command=COM_QUIT, header=0x0, header_length=0, arg=0x0, arg_length=0, skip_check=1 '\001', stmt=0x0) at /home/brandon/workspace/server/sql-common/client.c:503
#2  0x00005613bc9233b4 in mysql_close_slow_part (mysql=0x7f7ea0017138) at /home/brandon/workspace/server/sql-common/client.c:3428
#3  0x00005613bc9232c2 in server_mysql_close (mysql=0x7f7ea0017138) at /home/brandon/workspace/server/sql-common/client.c:3440
#4  0x00005613bc3e2bd0 in handle_slave_io (arg=0x5613c117d650) at /home/brandon/workspace/server/sql/slave.cc:5155
#5  0x00005613bcd3655f in pfs_spawn_thread (arg=0x7f7ea801d3a8) at /home/brandon/workspace/server/storage/perfschema/pfs.cc:2201
#6  0x00007f7f08a94ac3 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#7  0x00007f7f08b26a40 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

And a master-side trace that shows the error at the same timestamp, and that the slave is removed after the magic number error occurs (step 8):

2023-11-07 20:08:09 10 [ERROR] Read semi-sync reply magic number error
T@17   :    4: Ack_receiver::remove_slave: magicerr: Ack_receiver removing slave

A few potential avenues to fix this are MDEV-32385, to include COM_QUIT into the semi-sync protocol. Alternatively, the IO thread's check to call kill_connection(mysql) would need to be dependent on the true semi-sync status of the connection, rather than a dynamic global variable.

Comment by Michael Widenius [ 2023-11-27 ]

My new semi-sync code will ensure that the above error will not happen.
Just waiting for review so that I can push

Comment by Brandon Nesterenko [ 2024-01-08 ]

Ok to push

Generated at Thu Feb 08 10:32:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.