[MDEV-13903] "query end" never ends Created: 2017-09-25 Updated: 2019-01-29 Resolved: 2019-01-29 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Data Manipulation - Delete, Data Manipulation - Insert, Data Manipulation - Update, Galera, Locking |
| Affects Version/s: | 10.2.7, 10.2.8, 10.2.9, 10.2.10 |
| Fix Version/s: | 10.3.11, 10.2.19 |
| Type: | Bug | Priority: | Major |
| Reporter: | Vincent Milum Jr | Assignee: | Jan Lindström (Inactive) |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | galera, innodb | ||
| Environment: |
Debian Jessie |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Description |
|
I've recently upgraded a few MariaDB Galera clusters from 10.1 to 10.2. Since this upgrade, MariaDB / Galera has been extremely unstable on every single cluster. One of the issue I'm seeing quite frequently is the combination of "query end" never ending, which appears to lock the table metadata. This issue is compounded by the fact that simple SELECT statements apparently need a metadata lock on these tables in order to complete. The processes with "query end" hanging around cannot be killed. And because SELECT statements are waiting on metadata locks, they cannot complete either. The only "work around" if you could call it that is to literally take the entire cluster offline and start it up again with --wsrep-new-cluster This is happening anywhere from every couple of hours to every few days, on clusters that otherwise lasted months of uptime and only ever saw downtime during upgrades. Here is an example process list
|
| Comments |
| Comment by Andrii Nikitin (Inactive) [ 2017-09-28 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This may be caused by Otherwise - if the problem still occurs - please provide outputs from each node when the problem occurs:
Gdb stack traces also may help to troubleshoot the problem on node with the problem:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2017-10-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm currently upgrading two separate clusters from 10.2.8 to 10.2.9 that use Debian Linux. If issue happens again, I'll report back. However, I'm not able to upgrade the FreeBSD cluster yet as there are no packaged binaries available newer than 10.2.7. I may look into setting up a compile environment in FreeBSD when I get some more time to play around with this issue. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2017-10-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This issue still exists on 10.2.9. I just had the cluster die out on me again following a DELETE operation that got stuck in "query end" status, locking the entire database from updates. When this occurs, it seems to happen on ALL nodes in the cluster at the same time, not just one. The table in question has less than 1K rows total, yet locked for over 30 minutes before I killed the process and rebooted the cluster.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2017-10-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I had a little bit more time to investigate this today. On server A in the cluster, it happened with the same DELETE query mentioned above. However, on server B, it happened from an INSERT query. These didn't actually happen at the same time, but actually several minutes apart. I've also got the gdb dumps attached from each of these two servers instances. stack-1.txt
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrii Nikitin (Inactive) [ 2017-10-20 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Could you also upload Error log for the problem servers? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2017-11-16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Not a whole lot went into the actual logs, sadly. There were no other MariaDB related syslog messages within a few hours before this occurred. This particular time, it happened on all nodes within the cluster within a few minutes. Last time it occurred earlier this week, only 1 node was effected, caught, and restarted before any other nodes had any issues. There are six nodes total in the cluster when this is happening. Node A:
Node B:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2018-01-02 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've not fully confirmed this yet, but I believe the issue may be related to using multiple gmcast segments. So far, I've been running stable for a few days now with disabling all but the primary data center cluster. This, however, means I currently do not have my real-time off-site copies of the data (the whole reason we use gmcast segments across multiple datacenters to begin with) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2018-04-19 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Some new notes I've had the cluster not using gmcast segments for the past couple months, and all has been stable in that time frame. But like mentioned above, this means I've dropped from multiple data centers for redundancy down to a single data center. In my application, there were calls to "SET SESSION wsrep_sync_wait = [MASK]" which have since been removed as they're no longer needed after refactoring other code. Since then, I've re-enabled other gmcast segments, and so far things appear stable. Additionally, the cluster has been upgraded to MariaDB Server 10.2.14 (current stable) with Galera 25.3.23. There may have been a MariaDB Server or Galera patch that addressed this issue, or it may have been a bug with particular wsrep_sync_wait states. I'm currently betting on the latter, since that is directly responsible for locking the transaction processing queue, and may have had issues unlocking it. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2018-05-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Just as an update, since removing all references to wsrep_sync_wait in my application code and MariaDB configurations, everything has remained stable and online for over a month now. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2018-12-27 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Unsure if https://jira.mariadb.org/browse/MDEV-17073 or https://jira.mariadb.org/browse/MDEV-17541 are related. Ever since switching to MariaDB 10.3.11, things seem stable thus far. The descriptions in those other bugs seems quite similarly related to possible underlying issues with this bug. For the time being, I'm making this one as "FIXED" unless I see this issue pop up again in any of my Galera clusters. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Vincent Milum Jr [ 2019-01-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
It has now been well over a month since switching to 10.3.11 and there has been zero issues with "query end" locking ever since. Previously my cluster would die at least once a week. I now fully believe this issue is directly related to the locking issues addressed in the other linked bugs. I apparently don't have rights to close my own issues here though? So if someone from MariaDB with rights wants to, I'm pretty certain this is finally solved. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2019-01-29 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the update. Based on it, closing as fixed within the scope of above-mentioned bugs. |