[MXS-2371] Maxscale crashed after losing connection to master database Created: 2019-03-07  Updated: 2019-10-18  Resolved: 2019-10-18

Status: Closed
Project: MariaDB MaxScale
Component/s: failover
Affects Version/s: 2.3.4
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Jeffrey Parker Assignee: markus makela
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Centos 7, Percona Mysql backend, PHP front end


Attachments: File CCRFilter.cnf     File Hint.cnf     File QLAProd.cnf     File QLASMP01.cnf     File Read-Write-Service.cnf     File Top.cnf     File TopSmp01.cnf     File dbm-lock-filter.cnf     File dnr-filter.cnf     File engine-injector-recipients-filter.cnf     File maxscale.cnf    

 Description   

MaxScale lost connection to the master server and promptly crashed. The relevant log lines are below.

Mar  7 03:10:17 maxscale-2 maxscale[15414]: Monitor timed out when connecting to server db-master[192.168.50.46:3306] : 'Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 110'
Mar  7 03:10:17 maxscale-2 maxscale[15414]: Server changed state: db-master[192.168.50.46:3306]: master_down. [Master, Running] -> [Down]
Mar  7 03:10:18 maxscale-2 maxscale[15414]: Server changed state: db-master[192.168.50.46:3306]: master_up. [Down] -> [Master, Running]
Mar  7 03:10:18 maxscale-2 maxscale[15414]: Fatal: MaxScale 2.3.4 received fatal signal 11. Attempting backtrace.
Mar  7 03:10:18 maxscale-2 maxscale[15414]: Commit ID: aea64aede280558ca6b55500dfa7eb049ec9c377 System name: Linux Release string: CentOS Linux release 7.6.1810 (Core)
Mar  7 03:10:18 maxscale-2 maxscale[15414]:  /usr/bin/maxscale(_ZN7maxbase15dump_stacktraceESt8functionIFvPKcS2_EE+0x2b) [0x40cbab]: /home/vagrant/MaxScale/maxutils/maxbase/src/stacktrace.cc:130
Mar  7 03:10:18 maxscale-2 maxscale[15414]:  /usr/bin/maxscale(_ZN7maxbase15dump_stacktraceEPFvPKcS1_E+0x4e) [0x40cf0e]: /usr/include/c++/4.8.2/functional:2029
Mar  7 03:10:18 maxscale-2 maxscale[15414]:  /usr/bin/maxscale() [0x4095b9]: ??:0
Mar  7 03:10:18 maxscale-2 maxscale[15414]:  /lib64/libpthread.so.0(+0xf5d0) [0x7f0653b925d0]: sigaction.c:?
Mar  7 03:10:18 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libreadwritesplit.so(_ZN14RWSplitSession17handle_got_targetEP5gwbufRSt10shared_ptrIN8maxscale9RWBackendEEb+0x9d) [0x7f064747512d]: /home/vagrant/MaxScale/include/maxscale/protocol/mysql.h:629
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libreadwritesplit.so(_ZN14RWSplitSession17route_single_stmtEP5gwbuf+0xa05) [0x7f0647476115]: /home/vagrant/MaxScale/server/modules/routing/readwritesplit/rwsplit_route_stmt.cc:328
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libreadwritesplit.so(_ZN14RWSplitSession10routeQueryEP5gwbuf+0x1bc) [0x7f064746d71c]: /home/vagrant/MaxScale/server/modules/routing/readwritesplit/rwsplitsession.cc:199
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libreadwritesplit.so(_ZN8maxscale6RouterI7RWSplit14RWSplitSessionE10routeQueryEP10mxs_routerP18mxs_router_sessionP5gwbuf+0x1e) [0x7f064746b7ae]: /home/vagrant/MaxScale/include/maxscale/router.hh:181
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0xcbf9f) [0x7f065430df9f]: /home/vagrant/MaxScale/server/core/session.cc:1115
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker4tickEv+0xe6) [0x7f065431b426]: /home/vagrant/MaxScale/maxutils/maxbase/include/maxbase/worker.hh:777
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase11WorkerTimer6handleEPNS_6WorkerEj+0x36) [0x7f0654319bc6]: /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:256
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x1b5) [0x7f065431a5c5]: /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:844
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x51) [0x7f065431a7c1]: /home/vagrant/MaxScale/maxutils/maxbase/src/worker.cc:545
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /lib64/libstdc++.so.6(+0xb5070) [0x7f0652bc6070]: ??:?
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /lib64/libpthread.so.0(+0x7dd5) [0x7f0653b8add5]: pthread_create.c:?
Mar  7 03:10:19 maxscale-2 maxscale[15414]:  /lib64/libc.so.6(clone+0x6d) [0x7f0651b6aead]: ??:?
Mar  7 03:10:19 maxscale-2 abrt-hook-ccpp: Process 15414 (maxscale) of user 995 killed by SIGSEGV - dumping core
Mar  7 03:10:28 maxscale-2 abrt-hook-ccpp: Failed to create core_backtrace: waitpid failed: No child processes
Mar  7 03:10:29 maxscale-2 systemd: maxscale.service: main process exited, code=killed, status=11/SEGV
Mar  7 03:10:29 maxscale-2 systemd: Unit maxscale.service entered failed state.
Mar  7 03:10:29 maxscale-2 systemd: maxscale.service failed.



 Comments   
Comment by Jeffrey Parker [ 2019-03-07 ]

Unfortunately this only happened under load that I cannot seem to reproduce without our production environment so I do not have simple reproduction steps and the core dump does not exist because that failed due to a signing issue.

Comment by markus makela [ 2019-03-07 ]

Please add the base MaxScale configuration plus any generated configurations from /var/lib/maxscale/maxscale.cnf.d/ to the issue. Please also remove any sensitive data from the configuration files like usernames, passwords and IP addresses.

First look at the stacktrace would suggest either a null buffer or a pointer to a freed buffer.

Comment by Jeffrey Parker [ 2019-03-07 ]

Configs attached.

Comment by markus makela [ 2019-03-12 ]

Seems that the readwritesplit service has a lot of filters. Can you try to remove the filters and see if it happens again? If it doesn't, try adding back the filters one at a time to find out which causes the problem.

Comment by Jeffrey Parker [ 2019-03-12 ]

I would really love to help you out with this, but we really can't do that since we have only been able to get this to happen with our full production load going through maxscale and when it crashes we have like an hour of complete downtime while we fix it. This is not something we can do and any testing we have done has either not put enough load on it, has not run long enough, or has not run the correct queries to cause this issue. We also had to add the filters to try to make sure the software worked when using maxscale, while there may be some redundancy we kind of need the filters in place except for the top and qla filters. I will point out that the maxscale system was nowhere near having a high load, very low CPU usage, no high memory usage, no disk io to speak of.

Comment by markus makela [ 2019-03-13 ]

OK, that's understandable. We'll continue our efforts to try and reproduce this on our environment.

Comment by markus makela [ 2019-03-28 ]

I've been testing with the combined configuration and haven't been able to reproduce the crash. I'll see if I can put this configuration under some heavy testing and see if that helps.

Comment by markus makela [ 2019-10-18 ]

We never reproduced this error which is why I'll close this as Cannot Reproduce. If this still happens with the latest release, please let us know and we'll reopen this issue.

Generated at Thu Feb 08 04:13:38 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.