[MXS-2753] MXS crash on cdc stream request Created: 2019-11-06  Updated: 2021-08-10  Resolved: 2021-08-10

Status: Closed
Project: MariaDB MaxScale
Component/s: avrorouter, binlogrouter, cdc
Affects Version/s: 2.4.2, 2.4.3
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: DBA666 Assignee: markus makela
Resolution: Cannot Reproduce Votes: 1
Labels: None
Environment:

CentOS7, 8 core Xeon, 8GB RAM.
Maxscale 2.4.3 (updated today). MXS CDC Connector 2.4.3.



 Description   

Currently seeing Maxscale hang, fail and then restart via systemd watchdog whenever a CDC request is made. This is both via a local cdc.py request and from an external server running mxs_adapter.

The service had been running for around 6 hours without failure, handling circa 1k new rows into Columnstore every 10 seconds successfully.

Nothing changed in the config of mxs_adapter or maxscale between running state and failed state.

When failure occured, an update from 2.4.2 to 2.4.3 was performed today.

Additionally, following advice in https://jira.mariadb.org/browse/MXS-964 the router_options entry was added to the avro-router.

On failure Maxscale service outputs this...

Nov 06 16:35:05 maxscale1 systemd[1]: maxscale.service watchdog timeout (limit 1min)!
Nov 06 16:35:05 maxscale1 maxscale[63821]: Fatal: MaxScale 2.4.3 received fatal signal 6. Commit ID: b33ef98f6c26b71e3cc9ea44b398776d51b35664 System name: Linux Release string: NAME="CentOS Linux"
Nov 06 16:35:05 maxscale1 maxscale[63821]: 
                                                    /lib64/libc.so.6(epoll_wait+0x33): :?
                                                    /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0xd0): maxutils/maxbase/src/worker.cc:795
                                                    /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53): maxutils/maxbase/src/worker.cc:559
                                                    /usr/bin/maxscale(main+0x2a76): server/core/gateway.cc:2265
                                                    /lib64/libc.so.6(__libc_start_main+0xf5): ??:?
                                                    /usr/bin/maxscale(): ??:?
Nov 06 16:35:05 maxscale1 systemd[1]: maxscale.service: main process exited, code=killed, status=6/ABRT
Nov 06 16:35:05 maxscale1 systemd[1]: Unit maxscale.service entered failed state.
Nov 06 16:35:05 maxscale1 systemd[1]: maxscale.service failed.

And here is the relevant section from Maxscale config file:

[replication-listener]
type=listener
service=replication
protocol=MariaDBClient
port=3311
 
[replication]
type=service
router=binlogrouter
master_id=50
server_id=50
binlogdir=/home/binlogs
filestem=mysql-bin
user=username
password=password
 
[avro-router]
type=service
router=avrorouter
source=replication
router_options=disable_sescmd_history=true
match=/databasename\.tablename/
binlogdir=/home/binlogs
avrodir=/home/binlogs
filestem=mysql-bin
start_index=17
 
[avro-listener]
type=listener
service=avro-router
protocol=CDC
port=4001



 Comments   
Comment by DBA666 [ 2019-11-06 ]

Removing the 60sec watchdog from the service definition has allowed the mxs_adapter to resume. At startup of the mxs_adapter maxscale hangs for around 2 minutes while the avrorouter locates the correct GTID position within the large file for the streaming to recommence.

Is there some way to instruct maxscale to start a new avro file to prevent this large delay on startup?

As a side issue, because the maxscale service hangs during this time, maxctrl is unavailable resulting in timeout from cli. For us this means our keepalived probe results in failure and causes the system failover to another node. Not super critical for this set up, but worth mentioning all the same.

Comment by markus makela [ 2019-11-06 ]

There should be a fix in 2.4.3 (MXS-2610) that solves some watchdog related problems in the avrorouter. Can you confirm that you're still seeing this?

Also, please open another bug for the slow startup of CDC requests.

Comment by Caleb Terry [ 2019-11-06 ]

Having a very similar issue in our setup. Issue started yesterday at 7:53AM CST running on CentOS 7.6.1810 with Maxscale 2.3.8-1. We upgraded to 2.3.13-1 and issue persisted so upgraded to 2.4.3 and issue still persists. 2.3.8-1 has been working for months in our environment and we haven't made any config changes. Output is very similar to original upload:
Nov 6 17:37:28 mxs-blf1 systemd: maxscale.service watchdog timeout (limit 1min)!
Nov 6 17:37:28 mxs-blf1 maxscale[24947]: (sigfatal_handler): Fatal: MaxScale 2.4.3 received fatal signal 6. Commit ID: b33ef98f6c26b71e3cc9ea44b398776d51b35664 System name: Linux Release string: CentOS Linux release 7.6.1810 (Core)
Nov 6 17:37:28 mxs-blf1 maxscale[24947]: (sigfatal_handler): #012 /lib64/libc.so.6(epoll_wait+0x33): :?#012 /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(ZN7maxbase6Worker15poll_waiteventsEv+0xd0): maxutils/maxbase/src/worker.cc:795#012 /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53): maxutils/maxbase/src/worker.cc:559#012 /usr/bin/maxscale(main+0x2a76): server/core/gateway.cc:2265#012 /lib64/libc.so.6(_libc_start_main+0xf5): ??:?#012 /usr/bin/maxscale(): ??:?
Nov 6 17:37:28 mxs-blf1 abrt-hook-ccpp: Process 24947 (maxscale) of user 995 killed by SIGABRT - dumping core
Nov 6 17:37:29 mxs-blf1 systemd: maxscale.service: main process exited, code=dumped, status=6/ABRT
Nov 6 17:37:29 mxs-blf1 systemd: Unit maxscale.service entered failed state.
Nov 6 17:37:29 mxs-blf1 systemd: maxscale.service failed.
Nov 6 17:37:29 mxs-blf1 abrt-server: Package 'maxscale' isn't signed with proper key
Nov 6 17:37:29 mxs-blf1 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2019-11-06-17:37:28-24947' exited with 1
Nov 6 17:37:29 mxs-blf1 abrt-server: Deleting problem directory '/var/spool/abrt/ccpp-2019-11-06-17:37:28-24947'
Nov 6 17:37:29 mxs-blf1 systemd: maxscale.service holdoff time over, scheduling restart.
Nov 6 17:37:29 mxs-blf1 systemd: Stopped MariaDB MaxScale Database Proxy.
Nov 6 17:37:29 mxs-blf1 systemd: Starting MariaDB MaxScale Database Proxy..

Comment by DBA666 [ 2019-11-07 ]

Added the watchdog timeout back into the service definition with version 2.4.3. Confirmed same issue remains. Service doesn't ever recover. After removing the watchdog timeout again, everything is operational.

Nov 07 09:18:49 maxscale1 systemd[1]: maxscale.service watchdog timeout (limit 1min)!
Nov 07 09:18:49 maxscale1 maxscale[136495]: Fatal: MaxScale 2.4.3 received fatal signal 6. Commit ID: b33ef98f6c26b71e3cc9ea44b398776d51b35664 System name: Linux Release string: NAME="CentOS Linux"
Nov 07 09:18:49 maxscale1 maxscale[136495]:
                                                     /lib64/libc.so.6(epoll_wait+0x33): :?
                                                     /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0xd0): maxutils/maxbase/src/worker.cc:795
                                                     /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x53): maxutils/maxbase/src/worker.cc:559
                                                     /usr/bin/maxscale(main+0x2a76): server/core/gateway.cc:2265
                                                     /lib64/libc.so.6(__libc_start_main+0xf5): ??:?
                                                     /usr/bin/maxscale(): ??:?

I'll create a new issue for slow startup of CDC requests to avrorouter.

Comment by Caleb Terry [ 2019-11-07 ]

Thanks DBA666. Commenting out the WatchdogSec Param in /lib/systemd/system/maxscale.service resolved this issue for me.
Steps are:
1. comment out "#WatchdogSec=60s"
2. systemctl daemon-reload
3. systemctl restart maxscale

Our issue has been resolved after these steps.

Comment by DBA666 [ 2019-11-07 ]

It's not resolved as such. The service is still hanging durnig all CDC operations, which renders maxscale completely unusable by everything else. This includes keepalived checks querying maxctrl.

Yes it does make the system function from a CDC perspective with the workaround in place, but it feels dirty to remove the systemd protection that would restart the service on failure.

Comment by markus makela [ 2020-11-24 ]

This might be fixed in 2.5 but I'll have to check to be sure.

Comment by markus makela [ 2021-08-02 ]

DBA666 any updates on this issue? It's been open for quite some time and there have been multiple releases of both 2.4 and 2.5 that you could try. If you're not able to reproduce this or don't have the capability to do so at this moment, I think I'll close this issue as Cannot Reproduce. This way I can reopen the issue once we know for sure that the problem still remains.

Generated at Thu Feb 08 04:16:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.