[MXS-415] MaxScale 1.2.1 crashed with Signal 6 and 11 Created: 2015-10-18  Updated: 2015-11-24  Resolved: 2015-11-24

Status: Closed
Project: MariaDB MaxScale
Component/s: readconnroute
Affects Version/s: 1.2.1
Fix Version/s: 1.3.0

Type: Bug Priority: Blocker
Reporter: Alex Vladulescu Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

Debian 7.9, VM under KVM, 8 vCPUs/8GB RAM (designed sql LB type) - deb package involved : maxscale-1.2.1-1.deb_wheezy.x86_64.deb


Attachments: Text File maxscale-error.log     Text File maxscale-errorlog-second-crash.txt     File maxscale.cnf    
Issue Links:
Relates
relates to MXS-329 The session pointer in a DCB can be n... Closed

 Description   

Hello guys,

At a quick search the issue seems very similar to MXS-337.

The LB is in production for approx 2 weeks, in which there were no outages/issues (nor to mysql or maxscale side) due to it's behalf. The system VM is running LB solo, no other services involved or running to it, as it has the following usage:

07:17:41 up 6 days, 3:59, 1 user, load average: 0.07, 0.05, 0.05

total used free shared buffers cached
Mem: 8007 459 7548 0 126 154
-/+ buffers/cache: 177 7829
Swap: 879 0 879

The LB is configured to connected to 4 nodes running galera 3.12.1.wheezy and percona xtradb cluster 5.6.25-25.12-1.wheezy both amd64.

Any clues why did the service died ?
I will be monitoring to see if this issue will be back.

Update: on less than 24 hours the issue repeats again. I have attached the log with the latest backtrace in addition to the initial one.



 Comments   
Comment by Alex Vladulescu [ 2015-10-19 ]

After spending some hours checking the issue list based on the similar error reporting from the logs I provided, I came across the following two reportings:

As I have understood (I am not a programmer) this is related to The Descriptor Control Block being NULL in some cases.

Could someone please advise or provide some helpful links/info on what are exactly the DCBs and what is impacting this NULL value to be set (bad connection/bad codding - on user app /bad sql queries ?). I have scattered all the information I could find on the official git page of the project, but nothing explains into detail the DCB's characteristics.

Also, I used to install maxscale via the official page download link (.deb), but comparing this to the build from source page on Git, I saw a vague reference to mariadb embedded library, informing that if prior versions to Debian 8 are in use the libc6 version should be installed and build on a higher version that the provided official one from Debian repositories.

My current libc6 version is 2.13-38+deb7u8, but as I have managed to install the maxscale .deb package on the system without any errors, i suppose it has been build against this version.

Please somebody correct me if I am wrong.

Thank you.

Comment by markus makela [ 2015-10-19 ]

The DCB is an abstraction of a network connection used inside MaxScale. A DCB should never be freed before all threads have finished processing the requests for a DCB. There is ongoing work in the MXS-329 branch on GitHub which aims to fix this problem. The reason a DCB is set to NULL while still in use is most likely related to a server going down.

As a user, you can help by giving information about what happened when the crash occurred. Did a server go down? Was there a specific load that was going through MaxScale?

The embedded library and libc version is only an issue when you are building MaxScale from source since the required code is embedded into MaxScale at build time.

Comment by Alex Vladulescu [ 2015-10-19 ]

Markus, thank you for clarifying the DCB for me.

Yes, I can provide you more information about the environment.

This is a production one, it runs over KVM on a very good hardware base (2 x 2630v3, 256GB RAM, 12 x 960 GB SSD enterprise class and H710 raid controller with BBU - configured on RAID10 - throughput speed on disk peaks ~ 3BG/sec). All the VMs are under 2 servers with the same characteristics. Hypervisor RAM usage is ~ 110GB.

The visualization storage for the VMs is configured local, therefore the network connection possible issues by running remote storage protocols like NFS or ISCSI are out of question.

Generally there are only 12 VMs running on each hypervisor, and as for it's average load is around 20-25%, therefore I am not running the environment on the clip of exhaustion.

Checking the NMS, I couldn't see any strange patters to cpu, memory, disk usage or mysql statistics on the galera database servers, db-load balancer or hypervisor itself.
But on the other hand, there are alerts configured in case the db servers are offline or the tcp port 3306 is not responding, which haven't been triggered so far when the maxscale crashed.

If you're thinking on the moment in the first error log attached when databases where not available one by one, it was the case when I have been changing each galera db node for a ram upgrade on each VM - so that's a known event.

Another important aspects to mention is that I had this setup tested and working okay before going into production, but there was no actual real load over there.But If it helps, the peak network traffic on the LB is ~ 10Mbps (db traffic between hosts is on a separated virtual network on all hosts, so it isn't overlapping with other type of traffic).

Also I have checked on the VM and hypervisor level for any kernel panics, crashes or KVM KSM issues, but none where reported nor to console or syslogging information.

For the current moment, I quickly configured monit for restarting (actually starting) maxscale in the case of any further crash.

I would be glad to provide any other useful information for this if needed.

Thank you.

Comment by markus makela [ 2015-10-19 ]

Thank you for the information. This does seem like a problem the MXS-329 branch is trying to solve. I will link this issue to the relevant task and inform you if any further information would be helpful.

Comment by Alex Vladulescu [ 2015-10-19 ]

Thank you, I will be looking forward to head an update on this.
Keep up the good work !

Comment by Johan Wikman [ 2015-11-24 ]

I will close this now since the expectation is that MXS-329 fixes this and we are starting to wrap things up for 1.3.

If this can be repeated on 1.3, then we will reopen this or create a new bug.

Generated at Thu Feb 08 03:59:07 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.