Details

    Description

      I've recently upgraded a few MariaDB Galera clusters from 10.1 to 10.2. Since this upgrade, MariaDB / Galera has been extremely unstable on every single cluster.

      One of the issue I'm seeing quite frequently is the combination of "query end" never ending, which appears to lock the table metadata. This issue is compounded by the fact that simple SELECT statements apparently need a metadata lock on these tables in order to complete.

      The processes with "query end" hanging around cannot be killed. And because SELECT statements are waiting on metadata locks, they cannot complete either. The only "work around" if you could call it that is to literally take the entire cluster offline and start it up again with --wsrep-new-cluster

      This is happening anywhere from every couple of hours to every few days, on clusters that otherwise lasted months of uptime and only ever saw downtime during upgrades.

      Here is an example process list

      MariaDB [(none)]> show processlist;
      +--------+-------------+--------------------+-----------+---------+--------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
      | Id     | User        | Host               | db        | Command | Time   | State                           | Info                                                                                                 | Progress |
      +--------+-------------+--------------------+-----------+---------+--------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
      |      2 | system user |                    | NULL      | Sleep   |   2913 | Waiting for table metadata lock | UPDATE `prfix_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('36bae2de8f3399e891a30b |    0.000 |
      |      1 | system user |                    | NULL      | Sleep   | 888882 | wsrep aborter idle              | NULL                                                                                                 |    0.000 |
      |      3 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge coordinator        | NULL                                                                                                 |    0.000 |
      |      5 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
      |      4 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
      |      6 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
      |      7 | system user |                    | NULL      | Daemon  |   NULL | InnoDB shutdown handler         | NULL                                                                                                 |    0.000 |
      | 485377 | user000     | xxx.xx.xx.53:11779 | database1 | Query   |   2913 | query end                       | INSERT INTO `prfix_user` (`user_name`) VALUES ('new user name')                                      |    0.000 |
      | 485430 | user000     | xxx.xx.xx.52:46210 | database1 | Query   |   2788 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='36964')              |    0.000 |
      | 485933 | user000     | xxx.xx.xx.52:47277 | database1 | Query   |   1410 | query end                       | DELETE FROM `prfix_session` WHERE (`access`<1503723562)                                              |    0.000 |
      | 485993 | user000     | xxx.xx.xx.54:44107 | database1 | Query   |   1242 | query end                       | INSERT INTO `prfix_anchor` (`anchor_source`, `anchor_dest`) VALUES ('https:/www.example.com/some123' |    0.000 |
      | 486023 | user000     | xxx.xx.xx.52:47490 | database1 | Query   |   1163 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='29042')              |    0.000 |
      | 486150 | user000     | xxx.xx.xx.51:60057 | database1 | Query   |    761 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='5240')               |    0.000 |
      | 486223 | user000     | xxx.xx.xx.52:48073 | database1 | Query   |    497 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='25304')              |    0.000 |
      | 486254 | user000     | xxx.xx.xx.51:60409 | database1 | Query   |    395 | query end                       | UPDATE `prfix_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='114141')           |    0.000 |
      | 486285 | user000     | xxx.xx.xx.52:48281 | database1 | Query   |    269 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='13309')              |    0.000 |
      | 486360 | root        | localhost          | NULL      | Query   |      0 | init                            | show processlist                                                                                     |    0.000 |
      | 486363 | user000     | xxx.xx.xx.52:48581 | database1 | Query   |     47 | Waiting for table metadata lock | SELECT SQL_NO_CACHE ga.*, us.user_id, us.user_name, us.user_url, th.thumb_hash, th.file_hash, fl.fil |    0.000 |
      | 486364 | user000     | xxx.xx.xx.53:21831 | database1 | Query   |     46 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `prfix_gallery` AS `ga` LEFT JOIN `prfix_file_thumb` AS `th` O |    0.000 |
      | 486365 | user000     | xxx.xx.xx.51:60884 | database1 | Query   |     34 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `prfix_gallery` AS `ga` LEFT JOIN `prfix_file_thumb` AS `th` O |    0.000 |
      | 486366 | user000     | xxx.xx.xx.51:60887 | database1 | Query   |     31 | Waiting for table metadata lock | SELECT * FROM `prfix_user` AS `us` LEFT JOIN `prfix_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
      | 486373 | user000     | xxx.xx.xx.54:62339 | database1 | Query   |     26 | Waiting for table metadata lock | SELECT * FROM `prfix_user` AS `us` LEFT JOIN `prfix_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
      | 486375 | user000     | xxx.xx.xx.51:60907 | database1 | Query   |     17 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `prfix_gallery` AS `ga` LEFT JOIN `prfix_file_thumb` AS `th` O |    0.000 |
      | 486381 | user000     | xxx.xx.xx.51:60934 | database1 | Query   |      6 | Waiting for table metadata lock | SELECT * FROM `prfix_user` AS `us` LEFT JOIN `prfix_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
      +--------+-------------+--------------------+-----------+---------+--------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
      

      Attachments

        Issue Links

          Activity

            darkain Vincent Milum Jr added a comment - - edited

            Some new notes

            I've had the cluster not using gmcast segments for the past couple months, and all has been stable in that time frame. But like mentioned above, this means I've dropped from multiple data centers for redundancy down to a single data center.

            In my application, there were calls to "SET SESSION wsrep_sync_wait = [MASK]" which have since been removed as they're no longer needed after refactoring other code. Since then, I've re-enabled other gmcast segments, and so far things appear stable.

            Additionally, the cluster has been upgraded to MariaDB Server 10.2.14 (current stable) with Galera 25.3.23.

            There may have been a MariaDB Server or Galera patch that addressed this issue, or it may have been a bug with particular wsrep_sync_wait states. I'm currently betting on the latter, since that is directly responsible for locking the transaction processing queue, and may have had issues unlocking it.

            darkain Vincent Milum Jr added a comment - - edited Some new notes I've had the cluster not using gmcast segments for the past couple months, and all has been stable in that time frame. But like mentioned above, this means I've dropped from multiple data centers for redundancy down to a single data center. In my application, there were calls to "SET SESSION wsrep_sync_wait = [MASK] " which have since been removed as they're no longer needed after refactoring other code. Since then, I've re-enabled other gmcast segments, and so far things appear stable. Additionally, the cluster has been upgraded to MariaDB Server 10.2.14 (current stable) with Galera 25.3.23. There may have been a MariaDB Server or Galera patch that addressed this issue, or it may have been a bug with particular wsrep_sync_wait states. I'm currently betting on the latter, since that is directly responsible for locking the transaction processing queue, and may have had issues unlocking it.

            Just as an update, since removing all references to wsrep_sync_wait in my application code and MariaDB configurations, everything has remained stable and online for over a month now.

            darkain Vincent Milum Jr added a comment - Just as an update, since removing all references to wsrep_sync_wait in my application code and MariaDB configurations, everything has remained stable and online for over a month now.

            Unsure if https://jira.mariadb.org/browse/MDEV-17073 or https://jira.mariadb.org/browse/MDEV-17541 are related. Ever since switching to MariaDB 10.3.11, things seem stable thus far. The descriptions in those other bugs seems quite similarly related to possible underlying issues with this bug. For the time being, I'm making this one as "FIXED" unless I see this issue pop up again in any of my Galera clusters.

            darkain Vincent Milum Jr added a comment - Unsure if https://jira.mariadb.org/browse/MDEV-17073 or https://jira.mariadb.org/browse/MDEV-17541 are related. Ever since switching to MariaDB 10.3.11, things seem stable thus far. The descriptions in those other bugs seems quite similarly related to possible underlying issues with this bug. For the time being, I'm making this one as "FIXED" unless I see this issue pop up again in any of my Galera clusters.

            It has now been well over a month since switching to 10.3.11 and there has been zero issues with "query end" locking ever since. Previously my cluster would die at least once a week. I now fully believe this issue is directly related to the locking issues addressed in the other linked bugs. I apparently don't have rights to close my own issues here though? So if someone from MariaDB with rights wants to, I'm pretty certain this is finally solved.

            darkain Vincent Milum Jr added a comment - It has now been well over a month since switching to 10.3.11 and there has been zero issues with "query end" locking ever since. Previously my cluster would die at least once a week. I now fully believe this issue is directly related to the locking issues addressed in the other linked bugs. I apparently don't have rights to close my own issues here though? So if someone from MariaDB with rights wants to, I'm pretty certain this is finally solved.

            darkain,

            Thanks for the update. Based on it, closing as fixed within the scope of above-mentioned bugs.

            elenst Elena Stepanova added a comment - darkain , Thanks for the update. Based on it, closing as fixed within the scope of above-mentioned bugs.

            People

              jplindst Jan Lindström (Inactive)
              darkain Vincent Milum Jr
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.