[MXS-2753] MXS crash on cdc stream request Created: 2019-11-06 Updated: 2021-08-10 Resolved: 2021-08-10 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | avrorouter, binlogrouter, cdc |
| Affects Version/s: | 2.4.2, 2.4.3 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | DBA666 | Assignee: | markus makela |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Environment: |
CentOS7, 8 core Xeon, 8GB RAM. |
||
| Description |
|
Currently seeing Maxscale hang, fail and then restart via systemd watchdog whenever a CDC request is made. This is both via a local cdc.py request and from an external server running mxs_adapter. The service had been running for around 6 hours without failure, handling circa 1k new rows into Columnstore every 10 seconds successfully. Nothing changed in the config of mxs_adapter or maxscale between running state and failed state. When failure occured, an update from 2.4.2 to 2.4.3 was performed today. Additionally, following advice in https://jira.mariadb.org/browse/MXS-964 the router_options entry was added to the avro-router. On failure Maxscale service outputs this...
And here is the relevant section from Maxscale config file:
|
| Comments |
| Comment by DBA666 [ 2019-11-06 ] | |||||||||
|
Removing the 60sec watchdog from the service definition has allowed the mxs_adapter to resume. At startup of the mxs_adapter maxscale hangs for around 2 minutes while the avrorouter locates the correct GTID position within the large file for the streaming to recommence. Is there some way to instruct maxscale to start a new avro file to prevent this large delay on startup? As a side issue, because the maxscale service hangs during this time, maxctrl is unavailable resulting in timeout from cli. For us this means our keepalived probe results in failure and causes the system failover to another node. Not super critical for this set up, but worth mentioning all the same. | |||||||||
| Comment by markus makela [ 2019-11-06 ] | |||||||||
|
There should be a fix in 2.4.3 ( Also, please open another bug for the slow startup of CDC requests. | |||||||||
| Comment by Caleb Terry [ 2019-11-06 ] | |||||||||
|
Having a very similar issue in our setup. Issue started yesterday at 7:53AM CST running on CentOS 7.6.1810 with Maxscale 2.3.8-1. We upgraded to 2.3.13-1 and issue persisted so upgraded to 2.4.3 and issue still persists. 2.3.8-1 has been working for months in our environment and we haven't made any config changes. Output is very similar to original upload: | |||||||||
| Comment by DBA666 [ 2019-11-07 ] | |||||||||
|
Added the watchdog timeout back into the service definition with version 2.4.3. Confirmed same issue remains. Service doesn't ever recover. After removing the watchdog timeout again, everything is operational.
I'll create a new issue for slow startup of CDC requests to avrorouter. | |||||||||
| Comment by Caleb Terry [ 2019-11-07 ] | |||||||||
|
Thanks DBA666. Commenting out the WatchdogSec Param in /lib/systemd/system/maxscale.service resolved this issue for me. Our issue has been resolved after these steps. | |||||||||
| Comment by DBA666 [ 2019-11-07 ] | |||||||||
|
It's not resolved as such. The service is still hanging durnig all CDC operations, which renders maxscale completely unusable by everything else. This includes keepalived checks querying maxctrl. Yes it does make the system function from a CDC perspective with the workaround in place, but it feels dirty to remove the systemd protection that would restart the service on failure. | |||||||||
| Comment by markus makela [ 2020-11-24 ] | |||||||||
|
This might be fixed in 2.5 but I'll have to check to be sure. | |||||||||
| Comment by markus makela [ 2021-08-02 ] | |||||||||
|
DBA666 any updates on this issue? It's been open for quite some time and there have been multiple releases of both 2.4 and 2.5 that you could try. If you're not able to reproduce this or don't have the capability to do so at this moment, I think I'll close this issue as Cannot Reproduce. This way I can reopen the issue once we know for sure that the problem still remains. |