[MXS-2446] Fatal on Maxscale server on reimaging clustrix setup being monitored. Created: 2019-04-23  Updated: 2019-07-04  Resolved: 2019-07-03

Status: Closed
Project: MariaDB MaxScale
Component/s: xpandmon
Affects Version/s: None
Fix Version/s: 2.4.1

Type: Bug Priority: Major
Reporter: Rahul Joshi (Inactive) Assignee: Johan Wikman
Resolution: Fixed Votes: 0
Labels: None
Environment:

MaxScale server karma172:
OS: CentOS 7
Version: MaxScale 2.3.5 Commit: ab96606269d6526bb20c1b8585edb0a242408320
Clustrix nodes:
OS: RHEL 7 / CentOS 7
Version: Fred


Sprint: MXS-SPRINT-81, MXS-SPRINT-85

 Description   

I’m seeing a Fatal on MaxScale:

2019-04-20 03:31:31   notice : (40000201) [MySQLAuth] [GluttonRCR-vqc007c] No users were loaded but 'inject_service_user' is enabled. Enabling service credentials for authentication until database users have been successfully loaded.
2019-04-20 03:31:31   error  : [clustrixmon] Clustrix: Could not execute 'SELECT ni.nodeid, ni.iface_ip, ni.mysql_port, ni.healthmon_port, sn.nodeid FROM system.nodeinfo AS ni LEFT JOIN system.softfailed_nodes AS sn ON ni.nodeid = sn.nodeid' on 10.2.15.13: [11280] Permission denied: User 'maxscale'@'%' is missing SELECT on `system`.`softfailed_nodes`; is missing SELECT on `system`.`nodeinfo`.
2019-04-20 03:31:31   alert  : Fatal: MaxScale 2.3.5 received fatal signal 11. Attempting backtrace.
2019-04-20 03:31:31   alert  : Commit ID: ab96606269d6526bb20c1b8585edb0a242408320 System name: Linux Release string: CentOS Linux release 7.6.1810 (Core)
2019-04-20 03:31:31   notice : (40000202) [MySQLAuth] [GluttonRCR-vqc007c] No users were loaded but 'inject_service_user' is enabled. Enabling service credentials for authentication until database users have been successfully loaded.

I had a working setup with MaxScale and Clustrix and then I reimaged the OS (from RHEL 7 to CentOS 7) on all clustrix nodes to take it to 9.1.4 from Fred release, reinstalled Clustrix and created the clustrix group again. maxscale user is not yet created in Clustrix.
I think, that’s when this Fatal occurred. Maxscale service stopped. The third setup monitored by the same maxscale server was running fine; but monitoring it stopped as maxscale service crashed.
This may not be a very frequent scenario, but I did not expect a Fatal.

Config file:

[root@karma172 log]# cat /etc/clustrix_allsetups_ssl.cnf
[maxscale]
#log_info=1
logdir=/data/clustrix/log
threads=auto
 
######## Glutton karma067 #######
 
[Bootstrap1]
type=server
address=10.2.14.123
port=3306
protocol=mariadbbackend
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
#karma067
 
# [Bootstrap2]
# type=server
# address=10.2.224.102
# port=3306
# protocol=mariadbbackend
 
# [Bootstrap3]
# type=server
# address=10.2.224.103
# port=3306
# protocol=mariadbbackend
 
[Clustrix]
type=monitor
module=clustrixmon
servers=Bootstrap1
user=maxscale
password=maxscale_pw
cluster_monitor_interval=10000
 
[RCR]
type=service
router=readconnroute
user=maxscale
password=maxscale_pw
cluster=Clustrix
 
[RCR-Listener]
type=listener
service=RCR
protocol=MariaDBClient
#address=127.0.0.1
port=4008
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
 
######## Dlong #######
 
[Dlong1]
type=server
address=10.2.15.83
port=3306
protocol=mariadbbackend
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
#karma031
# can't enable SSL until 33424 fixed.
 
[Dlong2]
type=server
address=10.2.15.121
port=3306
protocol=mariadbbackend
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
#karma044
 
[Dlong3]
type=server
address=10.2.15.114
port=3306
protocol=mariadbbackend
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
# karma006
 
[Clustrix-Dlong]
type=monitor
module=clustrixmon
servers=Dlong1,Dlong2,Dlong3
user=maxscale
password=maxscale_pw
cluster_monitor_interval=10000
 
[RCR-Dlong]
type=service
router=readconnroute
user=maxscale
password=maxscale_pw
cluster=Clustrix-Dlong
 
[RCR-Listener-Dlong]
type=listener
service=RCR-Dlong
protocol=MariaDBClient
#address=127.0.0.1
port=4009
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
 
######## Glutton karma050 #######
 
[vqc007c-karma050]
type=server
address=10.2.15.127
port=3306
protocol=mariadbbackend
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
#karma050
 
[vqc007c-karma058]
type=server
address=10.2.15.147
port=3306
protocol=mariadbbackend
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
 
[vqc007c-karma073]
type=server
address=10.2.12.164
port=3306
protocol=mariadbbackend
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
 
[Gluttonmon-vqc007c]
type=monitor
module=clustrixmon
servers=vqc007c-karma050,vqc007c-karma058,vqc007c-karma073
user=maxscale
password=maxscale_pw
cluster_monitor_interval=10000
 
[GluttonRCR-vqc007c]
type=service
router=readconnroute
user=maxscale
password=maxscale_pw
cluster=Gluttonmon-vqc007c
 
[GluttonRCR-Listener-vqc007c]
type=listener
service=GluttonRCR-vqc007c
protocol=MariaDBClient
#address=127.0.0.1
port=4010
ssl=required
#ssl_version=TLSv10
ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
ssl_key=/etc/my.cnf.d/certs/client-key.pem
ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
 
######## Glutton karma118 vqc005c #######
[vqc005c-karma118]
type=server
address=10.2.15.67
port=3306
protocol=mariadbbackend
#ssl=required
#ssl_version=TLSv10
#ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
#ssl_key=/etc/my.cnf.d/certs/client-key.pem
#ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
#karma118
 
[vqc005c-karma148]
type=server
address=10.2.15.193
port=3306
protocol=mariadbbackend
 
[vqc005c-karma110]
type=server
address=10.2.15.51
port=3306
protocol=mariadbbackend
 
[Gluttonmon-vqc005c]
type=monitor
module=clustrixmon
servers=vqc005c-karma118,vqc005c-karma148,vqc005c-karma110
user=maxscale
password=maxscale_pw
cluster_monitor_interval=10000
 
[GluttonRCR-vqc005c]
type=service
router=readconnroute
user=maxscale
password=maxscale_pw
cluster=Gluttonmon-vqc005c
 
[GluttonRCR-Listener-vqc005c]
type=listener
service=GluttonRCR-vqc005c
protocol=MariaDBClient
#address=127.0.0.1
port=4011
#ssl=required
#ssl_version=TLSv10
#ssl_cert=/etc/my.cnf.d/certs/client-cert.pem
#ssl_key=/etc/my.cnf.d/certs/client-key.pem
#ssl_ca_cert=/etc/my.cnf.d/certs/ca-cert.pem
 
 
######################
[MaxAdmin-Service]
type=service
router=cli
 
[MaxAdmin-Unix-Listener]
type=listener
service=MaxAdmin-Service
protocol=maxscaled
socket=default



 Comments   
Comment by markus makela [ 2019-04-23 ]

Can you provide the full stacktrace (all lines logged at alert level)?

Comment by Rahul Joshi (Inactive) [ 2019-04-23 ]

This is what I could find out:

[root@karma172 log]# grep alert maxscale.log
2019-04-20 03:31:31   alert  : Fatal: MaxScale 2.3.5 received fatal signal 11. Attempting backtrace.
2019-04-20 03:31:31   alert  : Commit ID: ab96606269d6526bb20c1b8585edb0a242408320 System name: Linux Release string: CentOS Linux release 7.6.1810 (Core)
2019-04-20 03:31:31   alert  :   maxscale(_ZN7maxbase15dump_stacktraceESt8functionIFvPKcS2_EE+0x2b) [0x40cf1b]:
2019-04-20 03:31:31   alert  :   maxscale(_ZN7maxbase15dump_stacktraceEPFvPKcS1_E+0x4e) [0x40d27e]:
2019-04-20 03:31:31   alert  :   maxscale() [0x409769]:
2019-04-20 03:31:31   alert  :   /lib64/libpthread.so.0(+0xf5d0) [0x7f25073c45d0]: sigaction.c:?
2019-04-20 03:31:31   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN6SERVER10set_statusEm+0x1) [0x7f2507db4da1]: /root/MaxScale/server/core/server.cc:705
2019-04-20 03:31:31   alert  :   /usr/lib64/maxscale/libclustrixmon.so(_ZN15ClustrixMonitor10check_httpEN7maxbase6Worker4Call8action_tE+0xfc) [0x7f24ff1a36dc]: /root/MaxScale/server/modules/monitor/clustrixmon/clustrixnode.hh:128
2019-04-20 03:31:31   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker4tickEv+0xe6) [0x7f2507dd4256]: /root/MaxScale/maxutils/maxbase/include/maxbase/worker.hh:783
2019-04-20 03:31:31   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase11WorkerTimer6handleEPNS_6WorkerEj+0x36) [0x7f2507dd29b6]: /root/MaxScale/maxutils/maxbase/src/worker.cc:256
2019-04-20 03:31:31   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x1b5) [0x7f2507dd33e5]: /root/MaxScale/maxutils/maxbase/src/worker.cc:848
2019-04-20 03:31:31   alert  :   /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x51) [0x7f2507dd35e1]: /root/MaxScale/maxutils/maxbase/src/worker.cc:549
2019-04-20 03:31:31   alert  :   /lib64/libstdc++.so.6(+0xb5070) [0x7f25063f7070]: ??:?
2019-04-20 03:31:31   alert  :   /lib64/libpthread.so.0(+0x7dd5) [0x7f25073bcdd5]: pthread_create.c:?
2019-04-20 03:31:31   alert  :   /lib64/libc.so.6(clone+0x6d) [0x7f250539bead]: ??:?

Comment by Johan Wikman [ 2019-04-30 ]

rahul.joshi@mariadb.com Was MaxScale built in Debug or Release mode?

As the crash happens at server.cc:705 it means that the pointer to the object at that point must have pointer at NULL or garbage. The object calling that method at that point is assumed to be created in a context where the pointer to that server object is valid. If that is not the case, then that would have caused an assertion to have been triggered. But that assertion is active only in debug mode.

Comment by Rahul Joshi (Inactive) [ 2019-04-30 ]

Hi johan.wikman ,

I built MaxScale from source (develop branch) and did not specify any debug flags.

I used:
git clone https://github.com/mariadb-corporation/MaxScale; cd MaxScale/; git checkout develop; git status; mkdir ../build; cd ../build
../MaxScale/BUILD/install_build_deps.sh; cmake ../MaxScale -DCMAKE_INSTALL_PREFIX=/usr; make; sudo make install; sudo ./postinst

Thanks and Regards,
--Rahul

Comment by Johan Wikman [ 2019-05-03 ]

Ok, then that means that the assertion would not have been triggered even if the condition was fulfilled.

Comment by Johan Wikman [ 2019-05-06 ]

This could not be repeated, but the only plausible reason for this is now detected and dealt with. Please reopen or create new issue if reoccurs.

Comment by Rahul Joshi (Inactive) [ 2019-06-12 ]

Okay, it happened again on the build that has the fix :
2019-06-11 22:45:43 alert : Fatal: MaxScale 2.4.0 received fatal signal 11. Attempting backtrace.
2019-06-11 22:45:43 alert : Commit ID: e0c5791a6e80120d54f6f4106188583dd3aab175 System name: Linux Release string: CentOS Linux release 7.6.1810 (Core)
2019-06-11 22:45:43 alert : maxscale(_ZN7maxbase15dump_stacktraceESt8functionIFvPKcS2_EE+0x2b) [0x40ce8b]:
2019-06-11 22:45:43 alert : maxscale(_ZN7maxbase15dump_stacktraceEPFvPKcS1_E+0x4e) [0x40d1ee]:
2019-06-11 22:45:43 alert : maxscale() [0x4096f9]:
2019-06-11 22:45:43 alert : /lib64/libpthread.so.0(+0xf5d0) [0x7f3d4abb65d0]: sigaction.c:?
2019-06-11 22:45:43 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN6SERVER10set_statusEm+0x1) [0x7f3d4b7bee51]: /root/MaxScale/server/core/server.cc:661
2019-06-11 22:45:43 alert : /usr/lib64/maxscale/libclustrixmon.so(_ZN15ClustrixMonitor10check_httpEN7maxbase6Worker4Call8action_tE+0x95) [0x7f3d4255e485]: /root/MaxScale/server/modules/monitor/clustrixmon/clustrixnode.hh:123
2019-06-11 22:45:43 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker4tickEv+0xf0) [0x7f3d4b7dfc70]: /root/MaxScale/maxutils/maxbase/include/maxbase/worker.hh:778
2019-06-11 22:45:43 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase11WorkerTimer6handleEPNS_6WorkerEj+0x36) [0x7f3d4b7de3a6]: /root/MaxScale/maxutils/maxbase/src/worker.cc:256
2019-06-11 22:45:43 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x1b5) [0x7f3d4b7dedd5]: /root/MaxScale/maxutils/maxbase/src/worker.cc:848
2019-06-11 22:45:43 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x51) [0x7f3d4b7defd1]: /root/MaxScale/maxutils/maxbase/src/worker.cc:549
2019-06-11 22:45:43 alert : /lib64/libstdc++.so.6(+0xb5070) [0x7f3d49be9070]: ??:?
2019-06-11 22:45:43 alert : /lib64/libpthread.so.0(+0x7dd5) [0x7f3d4abaedd5]: pthread_create.c:?
2019-06-11 22:45:43 alert : /lib64/libc.so.6(clone+0x6d) [0x7f3d48b8dead]: ??:?
[root@karma172 log]#

This time, I was reimaging the CentOS 7 Glutton setup with 9 nodes.
Config file and all setups are the same as before.

I don't see and option to reopen it. Can someone please reopen this bug for me?
Please let me know if I should rather file a new one.

Comment by Johan Wikman [ 2019-06-13 ]

rahul.joshi@mariadb.com Would you have a core file?

Comment by Johan Wikman [ 2019-06-13 ]

rahul.joshi@mariadb.com Could you explain in detail, step by step, what you are doing, because I'm not 100% sure I have fully understood. From the starting setup until the crash.

Comment by Johan Wikman [ 2019-06-13 ]

rahul.joshi@mariadb.com And would you have the MaxScale log?

Comment by Rahul Joshi (Inactive) [ 2019-06-18 ]

Hi johan.wikman ,

This does not seem to be reproducible at will.
I tried to repro it, but it didn't.
Here is what I had done both times it reproduced:
MaxScale server is monitoring 4 clustrix setups as shown in the conf file.
We got a new build and had to reimage the OS on clustrix nodes in one of the clustrix setups.
I reimaged the nodes with CentOS 7 and then reinstalled and reconfigured Clustrix on these nodes, but in the meantime, either there was no clustrix or individual nodes had Clustrix but no MaxScale user as I was reimaging and reinstalling/reconfiguring.
At this time, there was a Fatal and MaxScale service stopped.
I tried doing the same steps with two more setups this time, but it did not Fatal this time.
Unfortunately, there were no core files. I have set ulimit -c to unlimited and restarted MaxScale now. I'll upload the core files if it happens again.

MaxScale log under:
https://drive.google.com/open?id=11Hhyki_vvIE9jfV2atWNvnh_mUUaDTI9

Thanks and Regards,
--Rahul

Comment by Rahul Joshi (Inactive) [ 2019-07-02 ]

Finally it happened again after multiple setups were reimaged over the time. Similar configuration as last time except no SSL was used.
This time I reimaged vqc005c-glutton setup (karma118 and friends in cluster) and maxscale service crashed with a core dump. Dump, config file and log accessible under:
https://drive.google.com/open?id=1kkophb_K286CioDEbpeKgfTJhtS3SgqK

[root@karma172 log]# date; grep -i "2019-07-02.*alert" maxscale.log
Tue Jul 2 21:34:03 UTC 2019
2019-07-02 20:22:46 alert : Fatal: MaxScale 2.4.0 received fatal signal 11. Attempting backtrace.
2019-07-02 20:22:46 alert : Commit ID: e0c5791a6e80120d54f6f4106188583dd3aab175 System name: Linux Release string: CentOS Linux release 7.6.1810 (Core)
2019-07-02 20:22:47 alert : maxscale(_ZN7maxbase15dump_stacktraceESt8functionIFvPKcS2_EE+0x2b) [0x40ce8b]:
2019-07-02 20:22:47 alert : maxscale(_ZN7maxbase15dump_stacktraceEPFvPKcS1_E+0x4e) [0x40d1ee]:
2019-07-02 20:22:47 alert : maxscale() [0x4096f9]:
2019-07-02 20:22:47 alert : /lib64/libpthread.so.0(+0xf5d0) [0x7f845a28d5d0]: sigaction.c:?
2019-07-02 20:22:47 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN6SERVER10set_statusEm+0x1) [0x7f845ae95e51]: /root/MaxScale/server/core/server.cc:661
2019-07-02 20:22:47 alert : /usr/lib64/maxscale/libclustrixmon.so(_ZN15ClustrixMonitor10check_httpEN7maxbase6Worker4Call8action_tE+0x95) [0x7f8451c35485]: /root/MaxScale/server/modules/monitor/clustrixmon/clustrixnode.hh:123
2019-07-02 20:22:47 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker4tickEv+0xf0) [0x7f845aeb6c70]: /root/MaxScale/maxutils/maxbase/include/maxbase/worker.hh:778
2019-07-02 20:22:47 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase11WorkerTimer6handleEPNS_6WorkerEj+0x36) [0x7f845aeb53a6]: /root/MaxScale/maxutils/maxbase/src/worker.cc:256
2019-07-02 20:22:47 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker15poll_waiteventsEv+0x1b5) [0x7f845aeb5dd5]: /root/MaxScale/maxutils/maxbase/src/worker.cc:848
2019-07-02 20:22:47 alert : /usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase6Worker3runEPNS_9SemaphoreE+0x51) [0x7f845aeb5fd1]: /root/MaxScale/maxutils/maxbase/src/worker.cc:549
2019-07-02 20:22:47 alert : /lib64/libstdc++.so.6(+0xb5070) [0x7f84592c0070]: ??:?
2019-07-02 20:22:47 alert : /lib64/libpthread.so.0(+0x7dd5) [0x7f845a285dd5]: pthread_create.c:?
2019-07-02 20:22:47 alert : /lib64/libc.so.6(clone+0x6d) [0x7f8458264ead]: ??:?

Comment by Johan Wikman [ 2019-07-03 ]

rahul.joshi@mariadb.com Thanks! I'm working on this now and I think I know what is happening, although I'm not 100% sure yet.

Generated at Thu Feb 08 04:14:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.