[MXS-4668] Binlogrouter eventually stops working if semi-sync replication is not used Created: 2023-07-08 Updated: 2023-08-29 Resolved: 2023-08-15 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | binlogrouter |
| Affects Version/s: | 23.02.2 |
| Fix Version/s: | 23.02.4 |
| Type: | Bug | Priority: | Critical |
| Reporter: | Axel Schwenke | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Sprint: | MXS-SPRINT-187, MXS-SPRINT-188 | ||||||||||||||||
| Description |
|
The binlogrouter will eventually stop replicating with an error that has no error message. This only happens if semi-sync replication is not configured for the server from which the binlogrouter is replicating from. This happens due to CONC-659 where the connector always assumes that the two bytes of semi-sync information are sent. Original description:
The binlog from MaxScale ends with the COMMIT for GTID 0-18334-2749960 and fails with the next GTID being announced at position 2756805 in mariadb-bin.000009. The last good binlog event and the failed one decoded:
Attached:
|
| Comments |
| Comment by Axel Schwenke [ 2023-07-10 ] | ||
|
I have repeated the test with MaxScale 2.4.19 (maxscale-2.4.19-1.rhel.8.x86_64.rpm). It does not have any problems parsing the binlog. If however mariadb10_master_gtid is enabled then MaxScale 2.4 cannot keep pace with the master. | ||
| Comment by Axel Schwenke [ 2023-07-10 ] | ||
|
Some remarks about my environment:
I think to reproduce this it is enough to have 1 MariaDB server, the MaxScale running BLR and the driver running sysbench1.1. All those processes can be colocated on the same hardware. If distributed then run the MariaDB server and sysbench on one host and MaxScale BLR on the other. I used oltp_read_write_split.lua --write-percentage=10 (see here) but I guess oltp_read_write.lua is also ok since we don't use the second maxscale with readwritesplit. Bring the MariaDB server and MaxScale online (both config files included). Point the BLR on the master and make sure it's replicating. Load the data set; 10 tables with 1 mio rows each. This requires ~2.5GB in the datadir and the buffer pool. Adjust if you don't have that much space. Run as root:
if this was successful, execute (again as root):
This assumes you run sysbench on the same host than MariaDB. If things go as expected you should see BLR hanging. If not, adjust the number of threads and/or the runtime. If necessary you can also try oltp_read_write_split.lua with --write-percentage=10. It can be found here. Note: you must also use out copy of oltp_common.lua. | ||
| Comment by Axel Schwenke [ 2023-08-07 ] | ||
|
I have run 3 hour benchmarks against MaxScale 23.02.3 without using semisync master and it didn't stop working. So I can confirm the fix works. I have also rerun the same tests against 23.02.2 but this time with the master configured for semisync and it went also well. It looks as if BLR from 23.02 gives better results for semisync replication than BLR from 2.4. But this could be simply the fact that the replication isn't really semisync (MaxScale not sending ACK packets to the master) and is silently downgraded by the master to normal replication. Sounds right? | ||
| Comment by markus makela [ 2023-08-07 ] | ||
|
If you didn't configure semisync in 2.4 then it didn't use semi-sync at all. However in this case the performance should be better, not worse, as the replication goes into fully asynchronous mode. This could suggest that 23.02 is faster than 2.4 when used with semi-sync replication. All in all, this confirms that an unconditional SET @rpl_semi_sync_slave=1 will fix the problem for now. |