[MDEV-32551] "Read semi-sync reply magic number error" warnings on master Created: 2023-10-23 Updated: 2024-02-05 Resolved: 2024-01-23 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.6 |
| Fix Version/s: | 10.6.17, 10.11.7, 11.0.5, 11.1.4, 11.2.3 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Michael Widenius | Assignee: | Michael Widenius |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Master and single semi-sync slave |
||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The user gets a lot of these errors randomly in the error log: "Read semi-sync reply magic number error" This is probably some missing or out of sync packet as part of semi sync replication. We should try to repeat this by have our QA and benchmark teams run tests on a master and semi-sync slave to try to repeat the issue. |
| Comments |
| Comment by Roel Van de Paar [ 2023-10-23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
monty What is the environment please? | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roel Van de Paar [ 2023-10-24 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Related MySQL bugs The more I review the issue and it's occurences, the more I am leaning towards believing this is caused by either reduced network stability or network congestion. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
@elenst reproduced the error. I followed the suit. A stored procedure to hit the error is here
Analysis of rr traces showed the ack thread received 4 bytes from the network layer to raise 1 byte, with the value of 1 as payload. The packet must've come from a slave server that aggressively reconnecting. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrei Elkin [ 2023-11-06 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Here is the test that catches
e.g on
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2023-11-07 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
I've got the magic number error definitively to be another case of 1. Master: `set rpl_semi_sync_master_enabled=1`
so now that we have disabled rpl_semi_sync_slave_enabled, get_slave_enabled() returns false, and we skip kill_connection. It is needed though, because of this later code snippet in the exit process (condensed):
where slave_stop references the first code snippet, which skips the kill_connection(mi->mysql) now leaving mysql in-tact. So because kill_connection was skipped, we go into the if (mysql) conditional, calling into mysql_close(mysql). The problem with mysql_close() on an active semi-sync connection (which `mysql` is) is that it will send a COM_QUIT packet over the net. So when the Ack_receiver sees the COM_QUIT, it results in the magic number error. 8. Master: Reads the COM_QUIT packet from the slave connection, but it isn't in the semi-sync protocol, and thereby issues the magic number error. Note that it is important the dump thread hasn't yet removed the slave from the Ack_receiver thread. This is frequently not the case, factoring into the sporadic failures. For the trace points (with the keyword magicerr) that prove that the slave sends `01 00 00 00 10`,to track when semi_sync_status is turned on for the slave, but by the time we close, _semi_sync_enabled is falsified, and so we never issue semisync_slave.kill_connection():
And a 10.4.32 stack trace showing that `mysql_close()` sends the COM_QUIT:
And a master-side trace that shows the error at the same timestamp, and that the slave is removed after the magic number error occurs (step 8):
A few potential avenues to fix this are MDEV-32385, to include COM_QUIT into the semi-sync protocol. Alternatively, the IO thread's check to call kill_connection(mysql) would need to be dependent on the true semi-sync status of the connection, rather than a dynamic global variable. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Widenius [ 2023-11-27 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
My new semi-sync code will ensure that the above error will not happen. | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Brandon Nesterenko [ 2024-01-08 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Ok to push |