[MDEV-32961] Semi-sync Primary Segfaults on Net Error Created: 2023-12-06  Updated: 2023-12-07  Resolved: 2023-12-07

Status: Closed
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.6
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Brandon Nesterenko Assignee: Michael Widenius
Resolution: Fixed Votes: 0
Labels: MDEV-32551-test

Issue Links:
Problem/Incident
is caused by MDEV-32365 detailize the semisync replication ma... Closed
is caused by MDEV-32551 "Read semi-sync reply magic number er... Closed

 Description   

With the merge of MDEV-32365 into bb-10.6-semisync, the reporting of a net error results in a segfault. This is because the slave object is deleted, and then it is referenced for its server_id when reporting the error. See the following code snippet:

        else if (net.last_errno == ER_NET_READ_ERROR)
        {
          it.remove();
          if (net.last_errno > 0 && global_system_variables.log_warnings > 2)
            sql_print_warning("Semisync ack receiver got error %d \"%s\" "
                              "from slave server-id %d",
                              net.last_errno, ER_DEFAULT(net.last_errno),
                              slave->server_id());
        }

The statement it.remove(); should happen after the warning is issued.

Also, with the merge, the error log can report 3 separate messages when a slave disconnects, which seems potentially annoying/confusing for users:

2023-12-06 14:30:21 10 [Warning] Semisync ack receiver got error 1158 "Got an error reading communication packets" from slave server-id 2
2023-12-06 14:30:21 45 [Warning] Aborted connection 44 to db: 'unconnected' user: 'root' host: 'localhost' (KILLED)
2023-12-06 14:30:21 44 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)



 Comments   
Comment by Michael Widenius [ 2023-12-07 ]

Fixed and pushed to bb-10.6-semisync

Generated at Thu Feb 08 10:35:20 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.