[MXS-4775] KafkaCDC: current_gtid.txt is moving but is behind Created: 2023-09-23 Updated: 2023-09-29 Resolved: 2023-09-29 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | avrorouter, kafkacdc |
| Affects Version/s: | 23.02.3 |
| Fix Version/s: | 6.4.11, 22.08.9, 23.02.5, 23.08.2 |
| Type: | Bug | Priority: | Major |
| Reporter: | Presnickety | Assignee: | markus makela |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | replication | ||
| Environment: |
RHEL v8.2 |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Description |
|
Hi There, We've observed the GTID value within current_gtid.txt is behind what is reported by the maxctrl CLI. We have three Maxscale instances, the current_gtid.txt on each one exhibits the same behaviour. The value within the current_gtid.txt files is at least a day old, yet the binary logs are purged after 3 hrs. Thanks. |
| Comments |
| Comment by markus makela [ 2023-09-26 ] | ||
|
What does the output of maxctrl show services show as the GTID position? Are there also any errors in the MaxScale log? Is the MaxScale instance using a lot of CPU or is it hitting some other resource limitation? Is the kafka broker still getting updates via MaxScale or does it appear stalled? | ||
| Comment by Presnickety [ 2023-09-27 ] | ||
|
Hi Markus, Please see attached snips for the three servers there's a difference between maxctrl list servers vs. maxctrl show services for the GTIDs. Yesterday servers 02 & 03 report the following error; 2023-09-26 05:57:38 error : (159) [readwritesplit] (Read-Write-Split); No valid candidates for session command (COM_STMT_PREPARE: call session_insert_dumptruck(?,?,?,?,?,?,?,?,?)). Connection status: | ||
| Comment by markus makela [ 2023-09-27 ] | ||
|
One possible reason is that the broker can't keep up with the message flow (example here). How loaded is the broker itself (CPU and disk IO)? | ||
| Comment by Presnickety [ 2023-09-27 ] | ||
|
We also saw he following repeat info message from servers 02 & 03; 2023-09-26 05:57:51 notice : Started replicating from [10.195.241.80]:3306 at GTID '1-1-2048859537' | ||
| Comment by Presnickety [ 2023-09-27 ] | ||
|
We have 16 vCPUs in the VM, so CPU + Mem usage are OK; top - 17:12:03 up 3 days, 10 min, 1 user, load average: 9.33, 9.96, 10.05 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
| Comment by Presnickety [ 2023-09-27 ] | ||
|
Depending on whether the write master experiences a failover or drop in connections, then the KafkaCDC feed will continue even when the servers 02 & 03 are showing the above errors. Once the write master has a failover event then it will log errors as per above. Sometimes the there will be a few errors logged, then the feed will continue, other times there will be constant errors and the feed stops. We configured a tmpfs for the KafakaCDC directory as per MXS-4404, and experienced the same, since the version upgrade we reverted back to the xfs file system. | ||
| Comment by Presnickety [ 2023-09-27 ] | ||
|
CPU + Mem on the broker host is OK. We had the same issue when using a Kafka broker cluster that is on the same data centre / same subnet to the load balancer VMs' top - 17:45:32 up 50 days, 1:26, 1 user, load average: 0.69, 0.24, 0.09 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | ||
| Comment by markus makela [ 2023-09-27 ] | ||
|
If that top output was taken when MaxScale was processing events and the database was in normal usage, MaxScale is using quite a lot of CPU. I think that the slowdown might be somewhere in the kafkacdc processing that's taking a lot of CPU time. Are you able to trace the MaxScale process with perf to see where the CPU time is being spent? You can do this with the following commands:
Please then upload the generated perf.script file here and we can produce a CPU flamegraph from it (using this). | ||
| Comment by markus makela [ 2023-09-28 ] | ||
|
Screencast from 2023-09-28 10-00-51.webm | ||
| Comment by Presnickety [ 2023-09-28 ] | ||
|
Hello Markus, I managed to prepare the perf.script. FYI I've attached a graph of the CPU soft lockups since the platform was upgraded to vCenter 7 late Feb/Mar, the replicator process appears quite a lot. Thanks. | ||
| Comment by markus makela [ 2023-09-28 ] | ||
|
OK, it looks like at least part of what is in that perf output is the same as what I found in my local testing. In the perf output the blocking calls that happen when the server status information is requested from another thread do not show as much as they did when I locally profiled the code. However, what does show up is the binlog event-to-JSON conversion followed by the JSON-to-string conversion. The synchronization problem is easier to solve by moving the updates to happen in parallel so that if the set of servers where the replication is done from changes or the ownership of the cluster is lost, the replicating thread will pick it up shortly afterwards. This is not perfect since there's a small gap during which events are still replicated even if the server is no longer a part of the service or the cluster lost ownership but given the nature of KafkaCDC and the fact that there's no guarantee of events only being delivered once, it doesn't seem like a showstopper. Ideally the act of updating the servers of a service or the loss of cluster ownership would trigger a function to be called but I suspect that it wouldn't improve the situation in practice. The JSON conversion is something that is harder to solve. I'll open a separate bug for that as fixing that requires more work. | ||
| Comment by markus makela [ 2023-09-28 ] | ||
|
Do those soft lockup errors have a stacktrace in them? | ||
| Comment by markus makela [ 2023-09-28 ] | ||
|
Here's the perf output visualized as a flame graph. | ||
| Comment by Presnickety [ 2023-09-29 ] | ||
|
Hi Markus, I found dumps only for the processes labelled 'Replicator', not for processes labelled 'cdc::Replicator'. Dumps specifically for the following events have been attached separately; Sep 25 09:12:18 viexh-session-usage-mdb-03 kernel: watchdog: BUG: soft lockup - CPU#1 stuck for 34s! [Replicator:427436] Also attached all core dumps since this was enabled in August 2023 and all CPU soft lockup events involving the 'Replicator' and 'cdc::Replicator' processes. Thanks. | ||
| Comment by markus makela [ 2023-09-29 ] | ||
|
OK, those look like perfectly normal stacktraces of an idle MaxScale which means that it's probably a problem related to virtualization. The reason why it's the Replicator thread might be that it's doing a blocking network read which for some reason is more likely to be treated as a locked up kernel call when the VM gets scheduled out to make space for another one. I'd recommend keeping an eye on those and monitoring whether the virtualization server is under heavy load or there's a lot of memory pressure. | ||
| Comment by Presnickety [ 2023-09-29 ] | ||
|
Hi Markus, Actually several different processes get affected, mostly the swapper process; Sep 28 17:35:25 viexh-session-usage-mdb-01 kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 34s! [swapper/14:0] |