[MDEV-11312] One Galera node hangs on "show global status" Created: 2016-11-18 Updated: 2017-09-18 Resolved: 2017-09-18 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 10.1.19 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Ján Regeš | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | Galera | ||
| Environment: |
Gentoo Linux, Galera based on 3 nodes |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Sprint: | 10.1.20 | ||||||||
| Description |
|
Hi, we have a monitoring, which checks show global status, and one other, which calls SHOW STATUS LIKE 'wsrep_local_state. 2 nodes works properly, but 3. hangs on*show global status*. After few minutes, it contained hundreds of hang queries and "Too many connections..." error. I attached ZIP file with complete PROCESSLIST and all variables. Thank you for your check. |
| Comments |
| Comment by Ján Regeš [ 2016-11-18 ] | |||
|
Btw, now i discovered some interesting fact... Mysqld process is running already 18 hours (checked by ps -ef | grep mysqld). But 2,5 hours ago, log below appeared in MySQL error.log. In this state, I'll leave it for another 4 hours and then the server restarts. When i check strace -p MySQL-PID, database does nothing, just time-to-time i see "Too many connections" trace for monitoring mysql-client. When it's helpfull, i can provide IP-limited SSH access to this node, which is currently hanged. Thank you.
| |||
| Comment by Ján Regeš [ 2016-11-21 ] | |||
|
Hi, today, the same problem appears again. I checked mysqld process by "strace -p 53000", but it waits for something infinitely.
MySQL stops only after "kill -9 53000". After start, this node synchronize whole state by SST from second node and after that, it works properly. But, tomorrow, or later, problem probably appears again. Thank you for your check. | |||
| Comment by Ján Regeš [ 2016-11-24 ] | |||
|
And today, bug appears on 2 nodes from 3 nodes (but in different times). So, this bug is very critical for us. I think, that Galera bug appears in moment, when 2 users asks for "show status" in the same time. All status queries hangs on "Filling schema table". In our case, there are 2 users, which asks for status every 2-5 seconds. First user: "monitoring" - our script, that ask and log MySQL status every 3-5 seconds (for situation-reconstruction in "last seconds of disaster") Second user: "clustercheck" - our script for haproxy, which ask every Galera node for its status (haproxy need it for fallback management) Temporary, i will disable "monitoring" user and will check, if it's better. | |||
| Comment by Sachin Setiya (Inactive) [ 2016-12-06 ] | |||
|
Hi, i am not able to reproduce this. I created 3 node galera cluster each one asks for global status, SHOW STATUS LIKE 'wsrep_local_state. Still I am not able to reproduce this . I am attaching mtr test script with result galera_mdev-11312.test | |||
| Comment by Ján Regeš [ 2016-12-06 ] | |||
|
Thank you Sachin for your feedback and prepared test. This issue is not easy reproducible, unfortunatelly It happens just one for few days, but it's impact is very critical and it needs more and deeper investigation. It is possible, that in the next case of this bug, i will provide you SSH connection to server and you check it on server directly? Or it does not help? Maybe you will have some tools or tricks, how to analyze hanged database on live server. Thanks. | |||
| Comment by Sachin Setiya (Inactive) [ 2016-12-06 ] | |||
|
jan.reges I can try. I am quite new to galera but I will still try my best. | |||
| Comment by Ján Regeš [ 2016-12-06 ] | |||
|
Sachin, thank you! Btw, could you check our status/config variables (in previously attached ZIP)? Maybe you discover some specific settings in it, which may have relation to this bug. Maybe it is related to our misconfiguration... For example, we use SSL encryption in replication. It's probably on another application layer that reason of this bug, but maybe has some relation? | |||
| Comment by Ján Regeš [ 2017-01-28 ] | |||
|
Hi, few days ago, I updated all 3 nodes to MariaDB 10.1.20 and today, same issue appears on one node. After that node hangs, CPU from one MySQL thread has 100% CPU. Strace for PID of this CPU intensive process/thread does not display any data. Probably, somewhere is neverending loop, or something like that ;-( Do you have some tips, how to debug it at next occurrence and how to bring any helpful informations to you? Maybe it will be helpful.. here is output of `strace -f -p [main-mysql-process-pid]` for about 10 seconds: http://preview.siteone.cz/janek/2017-01-28_galera-node-lag.strace.txt Thank you for your help. | |||
| Comment by Ján Regeš [ 2017-02-03 ] | |||
|
Today, same issue with another 3-node MariaDB Galera It's strange.. I think, that our setup is quite standard. Haproxy with loading Galera state from all nodes in 2-3 seconds interval and external monitoring (Nagios+Munin), which loading Galera state from all nodes in interval 10-90 seconds. Today evening, i will update whole Galera cluster from 10.1.20 to 10.1.21. But release notes does not look, that contains any bugfixes related to this. So, probably, this issue will be also in 10.1.21... Error log does not contain any records from this time.. Is there any process/tool, how to help you deeply to debug this issue? | |||
| Comment by Ján Regeš [ 2017-03-17 ] | |||
|
Hello, this bug did not happen from time (about 1 month ago) since I disabled executing of commands below for every 5 seconds. These commands were running and logged on all 3 nodes. So, when you want to simulate this bug, run these commands on all 3 nodes continuously for few days and probably bug appears (just change count to 86400 and run it every day). MariaDB/Galera is probably sensitive/allergic for this scenario, when on all 3 nodes, status/extended-status is read every 1 second.
| |||
| Comment by Jan Lindström (Inactive) [ 2017-09-18 ] | |||
|
Please upgrade to more recent version of 10.1 and galera provider. If still repeatable please attach a debugger to hang node and provide output from thread apply all bt. |