[MXS-4668] Binlogrouter eventually stops working if semi-sync replication is not used - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 23.02.2
Fix Version/s: 23.02.4
Component/s: binlogrouter
Labels:
None

Sprint:
MXS-SPRINT-187, MXS-SPRINT-188

Description

The binlogrouter will eventually stop replicating with an error that has no error message. This only happens if semi-sync replication is not configured for the server from which the binlogrouter is replicating from. This happens due to ~~CONC-659~~ where the connector always assumes that the two bytes of semi-sync information are sent.

Original description:
I try the binlog router with MariaDB Enterprise 10.6.14. Eventually it stops reading binlog events from the master:

MaxScale log:

2023-07-08 12:39:11   error  : (Replication-Proxy); Error received during replication from '172.31.1.142:3306': Failed to fetch binlog event from master:

2023-07-08 12:39:13   error  : (Replication-Proxy); Error received during replication from '172.31.1.142:3306': Failed to fetch binlog event from master:

...

MariaDB [(none)]> show slave status\G

*************************** 1. row ***************************

                Slave_IO_State: Reconnecting after a failed primary event read

                   Master_Host: 172.31.1.142

                   Master_User: xxxxxx

                   Master_Port: 3306

                 Connect_Retry: 1

               Master_Log_File: mariadb-bin.000001

           Read_Master_Log_Pos: 2756715

...

                    Last_Errno: -1

                    Last_Error: Failed to fetch binlog event from master:

...

                   Gtid_IO_Pos: 0-18334-2749960

...

The binlog from MaxScale ends with the COMMIT for GTID 0-18334-2749960 and fails with the next GTID being announced at position 2756805 in mariadb-bin.000009.

The last good binlog event and the failed one decoded:

mysqlbinlog --start-position=2756774 --stop-position=2756847 --hexdump /data/clustrix/mariadb/mariadb-bin.000009

# at 2756774

#230708 12:39:10 server id 18334  end_log_pos 2756805 CRC32 0x6c726826

# Position

#           |Timestamp   |Type |Master ID   |Size        |Master Pos  |Flags

#   2a10a6  |ee 58 a9 64 |10   |9e 47 00 00 |1f 00 00 00 |c5 10 2a 00 |00 00

#   2a10b9  fb 4b 55 03 00 00 00 00  26 68 72 6c              |.KU.....&hrl|

# Event:        Xid = 55921659

COMMIT/*!*/;

# at 2756805

#230708 12:39:11 server id 18334  end_log_pos 2756847 CRC32 0x286242e3

# Position

#           |Timestamp   |Type |Master ID   |Size        |Master Pos  |Flags

#   2a10c5  |ef 58 a9 64 |a2   |9e 47 00 00 |2a 00 00 00 |ef 10 2a 00 |08 00

#   2a10d8  09 f6 29 00 00 00 00 00  00 00 00 00 0c 00 00 00  |..).............|

#   2a10e8  00 00 00 e3 42 62 28                              |....Bb(|

# Event:        GTID 0-18334-2749961 trans

/*!100001 SET @@session.gtid_seq_no=2749961*//*!*/;

START TRANSACTION

/*!*/;

DELIMITER ;

# End of log file

Attached:

my.cnf from master
maxscale.cnf
maxscale log
the maxscale copy of the binlog

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

BLR10.png
14 kB
2023-08-07 08:24
BLR20.png
14 kB
2023-08-07 08:24
mariadb-bin.000001.gz
832 kB
2023-07-08 14:35
master.my.cnf
2 kB
2023-07-08 14:35
maxscale.cnf
0.5 kB
2023-07-08 14:35
maxscale.log
395 kB
2023-07-08 14:35

Issue Links

is blocked by

CONC-659 The replication API always uses semi-sync

Closed

is duplicated by

MXS-4679 Binlog Router replication breaks without semi-sync set on Primary

Closed

Activity

Ascending order - Click to sort in descending order

Axel Schwenke added a comment - 2023-07-10 11:50

I have repeated the test with MaxScale 2.4.19 (maxscale-2.4.19-1.rhel.8.x86_64.rpm). It does not have any problems parsing the binlog. If however mariadb10_master_gtid is enabled then MaxScale 2.4 cannot keep pace with the master.

Axel Schwenke added a comment - 2023-07-10 11:50 I have repeated the test with MaxScale 2.4.19 (maxscale-2.4.19-1.rhel.8.x86_64.rpm). It does not have any problems parsing the binlog. If however mariadb10_master_gtid is enabled then MaxScale 2.4 cannot keep pace with the master.

Axel Schwenke added a comment - 2023-07-10 17:48 - edited

Some remarks about my environment:

master and 4 slaves, m5.2xlarge (8 vCPU, 32G RAM); running MariaDB Enterprise 10.6.14
1x MaxScale 23.03 configured with readwritesplit router in front, m5.4xlarge
4 benchmark drivers c5.xlarge (4 vCPU, 16G RAM), running sysbench oltp_read_write_split.lua
1x MaxScale 23.03 as BLR, m5.2xlarge

I think to reproduce this it is enough to have 1 MariaDB server, the MaxScale running BLR and the driver running sysbench1.1. All those processes can be colocated on the same hardware. If distributed then run the MariaDB server and sysbench on one host and MaxScale BLR on the other.

I used oltp_read_write_split.lua --write-percentage=10 (see here) but I guess oltp_read_write.lua is also ok since we don't use the second maxscale with readwritesplit.

Bring the MariaDB server and MaxScale online (both config files included). Point the BLR on the master and make sure it's replicating.

Load the data set; 10 tables with 1 mio rows each. This requires ~2.5GB in the datadir and the buffer pool. Adjust if you don't have that much space. Run as root:

sysbench /path/to/oltp_read_write.lua --mysql-socket={master-socket} --mysql-user=root --mysql-db=test --table-size=1000000 --tables=10 --threads=10  prepare

if this was successful, execute (again as root):

sysbench /path/to/oltp_read_write.lua --mysql-socket={master-socket} --mysql-user=root --mysql-db=test --table-size=1000000 --tables=10 --threads=32  --time=300 --rand-type=uniform --report-interval=10 --db-ps-mode=disable run

This assumes you run sysbench on the same host than MariaDB. If things go as expected you should see BLR hanging. If not, adjust the number of threads and/or the runtime.

If necessary you can also try oltp_read_write_split.lua with --write-percentage=10. It can be found here. Note: you must also use out copy of oltp_common.lua.

Axel Schwenke added a comment - 2023-07-10 17:48 - edited Some remarks about my environment: master and 4 slaves, m5.2xlarge (8 vCPU, 32G RAM); running MariaDB Enterprise 10.6.14 1x MaxScale 23.03 configured with readwritesplit router in front, m5.4xlarge 4 benchmark drivers c5.xlarge (4 vCPU, 16G RAM), running sysbench oltp_read_write_split.lua 1x MaxScale 23.03 as BLR, m5.2xlarge I think to reproduce this it is enough to have 1 MariaDB server, the MaxScale running BLR and the driver running sysbench1.1. All those processes can be colocated on the same hardware. If distributed then run the MariaDB server and sysbench on one host and MaxScale BLR on the other. I used oltp_read_write_split.lua --write-percentage=10 (see here ) but I guess oltp_read_write.lua is also ok since we don't use the second maxscale with readwritesplit . Bring the MariaDB server and MaxScale online (both config files included). Point the BLR on the master and make sure it's replicating. Load the data set; 10 tables with 1 mio rows each. This requires ~2.5GB in the datadir and the buffer pool. Adjust if you don't have that much space. Run as root: sysbench /path/to/oltp_read_write.lua --mysql-socket={master-socket} --mysql-user=root --mysql-db=test --table-size=1000000 --tables=10 --threads=10 prepare if this was successful, execute (again as root): sysbench /path/to/oltp_read_write.lua --mysql-socket={master-socket} --mysql-user=root --mysql-db=test --table-size=1000000 --tables=10 --threads=32 --time=300 --rand-type=uniform --report-interval=10 --db-ps-mode=disable run This assumes you run sysbench on the same host than MariaDB. If things go as expected you should see BLR hanging. If not, adjust the number of threads and/or the runtime. If necessary you can also try oltp_read_write_split.lua with --write-percentage=10 . It can be found here . Note: you must also use out copy of oltp_common.lua .

Axel Schwenke added a comment - 2023-08-07 08:31

I have run 3 hour benchmarks against MaxScale 23.02.3 without using semisync master and it didn't stop working. So I can confirm the fix works.

I have also rerun the same tests against 23.02.2 but this time with the master configured for semisync and it went also well.

Benchmark results:

It looks as if BLR from 23.02 gives better results for semisync replication than BLR from 2.4. But this could be simply the fact that the replication isn't really semisync (MaxScale not sending ACK packets to the master) and is silently downgraded by the master to normal replication. Sounds right?

Axel Schwenke added a comment - 2023-08-07 08:31 I have run 3 hour benchmarks against MaxScale 23.02.3 without using semisync master and it didn't stop working. So I can confirm the fix works. I have also rerun the same tests against 23.02.2 but this time with the master configured for semisync and it went also well. Benchmark results: It looks as if BLR from 23.02 gives better results for semisync replication than BLR from 2.4. But this could be simply the fact that the replication isn't really semisync (MaxScale not sending ACK packets to the master) and is silently downgraded by the master to normal replication. Sounds right?

markus makela added a comment - 2023-08-07 08:37 - edited

If you didn't configure semisync in 2.4 then it didn't use semi-sync at all. However in this case the performance should be better, not worse, as the replication goes into fully asynchronous mode. This could suggest that 23.02 is faster than 2.4 when used with semi-sync replication.

All in all, this confirms that an unconditional SET @rpl_semi_sync_slave=1 will fix the problem for now.

markus makela added a comment - 2023-08-07 08:37 - edited If you didn't configure semisync in 2.4 then it didn't use semi-sync at all. However in this case the performance should be better, not worse, as the replication goes into fully asynchronous mode. This could suggest that 23.02 is faster than 2.4 when used with semi-sync replication. All in all, this confirms that an unconditional SET @rpl_semi_sync_slave=1 will fix the problem for now.

MariaDB MaxScale

Binlogrouter eventually stops working if semi-sync replication is not used

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration