[MDEV-11312] One Galera node hangs on "show global status" Created: 2016-11-18  Updated: 2017-09-18  Resolved: 2017-09-18

Status: Closed
Project: MariaDB Server
Component/s: Galera
Affects Version/s: 10.1.19
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Ján Regeš Assignee: Jan Lindström (Inactive)
Resolution: Cannot Reproduce Votes: 1
Labels: Galera
Environment:

Gentoo Linux, Galera based on 3 nodes


Attachments: Zip Archive MariaDB_hangs_on_status_2016-11-18.zip     File galera_mdev-11312.result     File galera_mdev-11312.test    
Issue Links:
Relates
relates to MDEV-12344 Crash after kill of first hanged SHOW... Closed
Sprint: 10.1.20

 Description   

Hi,

we have a monitoring, which checks show global status, and one other, which calls SHOW STATUS LIKE 'wsrep_local_state.

2 nodes works properly, but 3. hangs on*show global status*. After few minutes, it contained hundreds of hang queries and "Too many connections..." error.

I attached ZIP file with complete PROCESSLIST and all variables.

Thank you for your check.



 Comments   
Comment by Ján Regeš [ 2016-11-18 ]

Btw, now i discovered some interesting fact...

Mysqld process is running already 18 hours (checked by ps -ef | grep mysqld).

But 2,5 hours ago, log below appeared in MySQL error.log.

In this state, I'll leave it for another 4 hours and then the server restarts.

When i check strace -p MySQL-PID, database does nothing, just time-to-time i see "Too many connections" trace for monitoring mysql-client.

When it's helpfull, i can provide IP-limited SSH access to this node, which is currently hanged.

Thank you.

161118 16:04:49 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.1.19-MariaDB
key_buffer_size=67108864
read_buffer_size=2097152
max_used_connections=61
max_threads=252
thread_count=69
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1618979 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x7f8be022b008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f8bdcff6e58 thread_stack 0x80000

Comment by Ján Regeš [ 2016-11-21 ]

Hi,

today, the same problem appears again.

I checked mysqld process by "strace -p 53000", but it waits for something infinitely.

Process 53000 attached
futex(0x56549946bf84, FUTEX_WAIT_PRIVATE, 3, NULL^CProcess 53000 detached
 <detached ...>

MySQL stops only after "kill -9 53000". After start, this node synchronize whole state by SST from second node and after that, it works properly. But, tomorrow, or later, problem probably appears again.

Thank you for your check.

Comment by Ján Regeš [ 2016-11-24 ]

And today, bug appears on 2 nodes from 3 nodes (but in different times).

So, this bug is very critical for us.

I think, that Galera bug appears in moment, when 2 users asks for "show status" in the same time. All status queries hangs on "Filling schema table".

In our case, there are 2 users, which asks for status every 2-5 seconds.

First user: "monitoring" - our script, that ask and log MySQL status every 3-5 seconds (for situation-reconstruction in "last seconds of disaster")

Second user: "clustercheck" - our script for haproxy, which ask every Galera node for its status (haproxy need it for fallback management)

Temporary, i will disable "monitoring" user and will check, if it's better.

Comment by Sachin Setiya (Inactive) [ 2016-12-06 ]

Hi,

i am not able to reproduce this. I created 3 node galera cluster each one asks for global status, SHOW STATUS LIKE 'wsrep_local_state. Still I am not able to reproduce this . I am attaching mtr test script with result galera_mdev-11312.test fie galera_mdev-11312.result

Comment by Ján Regeš [ 2016-12-06 ]

Thank you Sachin for your feedback and prepared test.

This issue is not easy reproducible, unfortunatelly

It happens just one for few days, but it's impact is very critical and it needs more and deeper investigation.

It is possible, that in the next case of this bug, i will provide you SSH connection to server and you check it on server directly? Or it does not help? Maybe you will have some tools or tricks, how to analyze hanged database on live server.

Thanks.

Comment by Sachin Setiya (Inactive) [ 2016-12-06 ]

jan.reges I can try. I am quite new to galera but I will still try my best.
Regards
sachin

Comment by Ján Regeš [ 2016-12-06 ]

Sachin, thank you!

Btw, could you check our status/config variables (in previously attached ZIP)? Maybe you discover some specific settings in it, which may have relation to this bug. Maybe it is related to our misconfiguration...

For example, we use SSL encryption in replication. It's probably on another application layer that reason of this bug, but maybe has some relation?

Comment by Ján Regeš [ 2017-01-28 ]

Hi,

few days ago, I updated all 3 nodes to MariaDB 10.1.20 and today, same issue appears on one node.
Process list is here: http://preview.siteone.cz/janek/2017-01-28_galera-node-lag.png

After that node hangs, CPU from one MySQL thread has 100% CPU. Strace for PID of this CPU intensive process/thread does not display any data. Probably, somewhere is neverending loop, or something like that ;-(

Do you have some tips, how to debug it at next occurrence and how to bring any helpful informations to you?

Maybe it will be helpful.. here is output of `strace -f -p [main-mysql-process-pid]` for about 10 seconds:

http://preview.siteone.cz/janek/2017-01-28_galera-node-lag.strace.txt

Thank you for your help.

Comment by Ján Regeš [ 2017-02-03 ]

Today, same issue with another 3-node MariaDB Galera

It's strange.. I think, that our setup is quite standard. Haproxy with loading Galera state from all nodes in 2-3 seconds interval and external monitoring (Nagios+Munin), which loading Galera state from all nodes in interval 10-90 seconds.

Today evening, i will update whole Galera cluster from 10.1.20 to 10.1.21. But release notes does not look, that contains any bugfixes related to this. So, probably, this issue will be also in 10.1.21...

Error log does not contain any records from this time.. Is there any process/tool, how to help you deeply to debug this issue?

Comment by Ján Regeš [ 2017-03-17 ]

Hello,

this bug did not happen from time (about 1 month ago) since I disabled executing of commands below for every 5 seconds.

These commands were running and logged on all 3 nodes.

So, when you want to simulate this bug, run these commands on all 3 nodes continuously for few days and probably bug appears (just change count to 86400 and run it every day).

MariaDB/Galera is probably sensitive/allergic for this scenario, when on all 3 nodes, status/extended-status is read every 1 second.

mysqladmin %credentials% --sleep=1 --count=4 --connect-timeout=3 status
mysqladmin %credentials% --sleep=1 --count=4 --connect-timeout=3 extended-status

Comment by Jan Lindström (Inactive) [ 2017-09-18 ]

Please upgrade to more recent version of 10.1 and galera provider. If still repeatable please attach a debugger to hang node and provide output from thread apply all bt.

Generated at Thu Feb 08 07:48:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.