[MXS-4685] Replication via binlogrouter temporarily blocks the REST-API Created: 2023-07-26  Updated: 2023-11-17  Resolved: 2023-08-10

Status: Closed
Project: MariaDB MaxScale
Component/s: binlogrouter
Affects Version/s: 22.08.4
Fix Version/s: 22.08.8, 23.02.4

Type: Bug Priority: Critical
Reporter: Bryan Bancroft (Inactive) Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: triage
Environment:

skysql DR connecting to on-prem maxscale


Attachments: Text File perf_report_replStarting.txt    
Issue Links:
Relates
relates to MXS-4710 Run (reverse) name lookups in a separ... Closed

 Description   

When the slave thread starts maxscale hits high usage on a single node leading to very slow maxctrl returns. For this customer it leads to flapping on health checks.

The slave thread connect appears to completely take a core on occasion or just spike high, why is this hitting a bottleneck for 1 connection?

##On replica
MariaDB [(none)]> stop slave 'ext_repl';start slave 'ext_repl';
##Usage observations 
 670282 maxscale  20   0 4586296  51076  14948 S  56.1   0.1 558:38.14 maxscale
##Seen delay in maxctrl return
Wed 26 Jul 2023 08:59:34 PM UTC
Wed 26 Jul 2023 08:59:36 PM UTC



 Comments   
Comment by markus makela [ 2023-08-03 ]

We'll need a way to reproduce this locally (example table and dataset, Maxscale configuraton etc.) or some profiling information on what is causing the slowness.

Comment by Bryan Bancroft (Inactive) [ 2023-08-08 ]

markus makela This appears to have been a slight improvement but still lags on binlog replication start

───────────┐
│ Server     │ Address       │ Port │ Connections │ State                                     │ GTID                              │ Monitor         │
├────────────┼───────────────┼──────┼─────────────┼───────────────────────────────────────────┼───────────────────────────────────┼─────────────────┤
│ db1.mfgreg │ 10.90.194.124 │ 3306 │ 0           │ Master, Running, Slave of External Server │ 0-110-64564785,921500-921500-2352 │ MariaDB-Monitor │
├────────────┼───────────────┼──────┼─────────────┼───────────────────────────────────────────┼───────────────────────────────────┼─────────────────┤
│ db2.mfgreg │ 10.90.194.27  │ 3306 │ 0           │ Slave, Running                            │ 0-110-64564785,921500-921500-2352 │ MariaDB-Monitor │
└────────────┴───────────────┴──────┴─────────────┴───────────────────────────────────────────┴───────────────────────────────────┴─────────────────┘
 
real    0m5.067s
user    0m0.187s
sys     0m0.025s

Comment by markus makela [ 2023-08-08 ]

bbancroft please check the other commands as well, at least maxctrl show maxscale and maxctrl list filters.

Comment by markus makela [ 2023-08-10 ]

The new code now seeks to the GTID position incrementally in the normal replication event processing code. The startup code only finds the file in which the GTID located which is very fast compared to finding the GTID position in the file. The overall performance of the seeking is also improved since the unnecessary tellg() calls were removed in the event handling code while also providing more granular scheduling of work on the worker threads. As a result, the binlogrouter no longer causes any issues with the REST-API while a server is initiating replication from the binlogrouter.

Generated at Thu Feb 08 04:30:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.