Details

    Description

      I've recently upgraded a few MariaDB Galera clusters from 10.1 to 10.2. Since this upgrade, MariaDB / Galera has been extremely unstable on every single cluster.

      One of the issue I'm seeing quite frequently is the combination of "query end" never ending, which appears to lock the table metadata. This issue is compounded by the fact that simple SELECT statements apparently need a metadata lock on these tables in order to complete.

      The processes with "query end" hanging around cannot be killed. And because SELECT statements are waiting on metadata locks, they cannot complete either. The only "work around" if you could call it that is to literally take the entire cluster offline and start it up again with --wsrep-new-cluster

      This is happening anywhere from every couple of hours to every few days, on clusters that otherwise lasted months of uptime and only ever saw downtime during upgrades.

      Here is an example process list

      MariaDB [(none)]> show processlist;
      +--------+-------------+--------------------+-----------+---------+--------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
      | Id     | User        | Host               | db        | Command | Time   | State                           | Info                                                                                                 | Progress |
      +--------+-------------+--------------------+-----------+---------+--------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
      |      2 | system user |                    | NULL      | Sleep   |   2913 | Waiting for table metadata lock | UPDATE `prfix_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('36bae2de8f3399e891a30b |    0.000 |
      |      1 | system user |                    | NULL      | Sleep   | 888882 | wsrep aborter idle              | NULL                                                                                                 |    0.000 |
      |      3 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge coordinator        | NULL                                                                                                 |    0.000 |
      |      5 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
      |      4 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
      |      6 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
      |      7 | system user |                    | NULL      | Daemon  |   NULL | InnoDB shutdown handler         | NULL                                                                                                 |    0.000 |
      | 485377 | user000     | xxx.xx.xx.53:11779 | database1 | Query   |   2913 | query end                       | INSERT INTO `prfix_user` (`user_name`) VALUES ('new user name')                                      |    0.000 |
      | 485430 | user000     | xxx.xx.xx.52:46210 | database1 | Query   |   2788 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='36964')              |    0.000 |
      | 485933 | user000     | xxx.xx.xx.52:47277 | database1 | Query   |   1410 | query end                       | DELETE FROM `prfix_session` WHERE (`access`<1503723562)                                              |    0.000 |
      | 485993 | user000     | xxx.xx.xx.54:44107 | database1 | Query   |   1242 | query end                       | INSERT INTO `prfix_anchor` (`anchor_source`, `anchor_dest`) VALUES ('https:/www.example.com/some123' |    0.000 |
      | 486023 | user000     | xxx.xx.xx.52:47490 | database1 | Query   |   1163 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='29042')              |    0.000 |
      | 486150 | user000     | xxx.xx.xx.51:60057 | database1 | Query   |    761 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='5240')               |    0.000 |
      | 486223 | user000     | xxx.xx.xx.52:48073 | database1 | Query   |    497 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='25304')              |    0.000 |
      | 486254 | user000     | xxx.xx.xx.51:60409 | database1 | Query   |    395 | query end                       | UPDATE `prfix_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='114141')           |    0.000 |
      | 486285 | user000     | xxx.xx.xx.52:48281 | database1 | Query   |    269 | query end                       | UPDATE `prfix_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='13309')              |    0.000 |
      | 486360 | root        | localhost          | NULL      | Query   |      0 | init                            | show processlist                                                                                     |    0.000 |
      | 486363 | user000     | xxx.xx.xx.52:48581 | database1 | Query   |     47 | Waiting for table metadata lock | SELECT SQL_NO_CACHE ga.*, us.user_id, us.user_name, us.user_url, th.thumb_hash, th.file_hash, fl.fil |    0.000 |
      | 486364 | user000     | xxx.xx.xx.53:21831 | database1 | Query   |     46 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `prfix_gallery` AS `ga` LEFT JOIN `prfix_file_thumb` AS `th` O |    0.000 |
      | 486365 | user000     | xxx.xx.xx.51:60884 | database1 | Query   |     34 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `prfix_gallery` AS `ga` LEFT JOIN `prfix_file_thumb` AS `th` O |    0.000 |
      | 486366 | user000     | xxx.xx.xx.51:60887 | database1 | Query   |     31 | Waiting for table metadata lock | SELECT * FROM `prfix_user` AS `us` LEFT JOIN `prfix_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
      | 486373 | user000     | xxx.xx.xx.54:62339 | database1 | Query   |     26 | Waiting for table metadata lock | SELECT * FROM `prfix_user` AS `us` LEFT JOIN `prfix_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
      | 486375 | user000     | xxx.xx.xx.51:60907 | database1 | Query   |     17 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `prfix_gallery` AS `ga` LEFT JOIN `prfix_file_thumb` AS `th` O |    0.000 |
      | 486381 | user000     | xxx.xx.xx.51:60934 | database1 | Query   |      6 | Waiting for table metadata lock | SELECT * FROM `prfix_user` AS `us` LEFT JOIN `prfix_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
      +--------+-------------+--------------------+-----------+---------+--------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
      

      Attachments

        Issue Links

          Activity

            anikitin Andrii Nikitin (Inactive) added a comment - - edited

            This may be caused by MDEV-13534 : Is there chance that you restart all nodes with innodb_stats_persistent=0 or try 10.2.9 where the bug should be fixed ?

            Otherwise - if the problem still occurs - please provide outputs from each node when the problem occurs:

            show processlist;
            show global variables;
            show global status;
            

            Gdb stack traces also may help to troubleshoot the problem on node with the problem:

            gdb -ex "set pagination 0" -ex "thread apply all bt"  --batch -p $(pidof mysqld) >> stack.txt
            gdb -ex "set pagination 0" -ex "thread apply all bt"  --batch -p $(pidof mysqld) >> stack.txt
            gdb -ex "set pagination 0" -ex "thread apply all bt"  --batch -p $(pidof mysqld) >> stack.txt
            gdb -ex "set pagination 0" -ex "thread apply all bt"  --batch -p $(pidof mysqld) >> stack.txt
            

            anikitin Andrii Nikitin (Inactive) added a comment - - edited This may be caused by MDEV-13534 : Is there chance that you restart all nodes with innodb_stats_persistent=0 or try 10.2.9 where the bug should be fixed ? Otherwise - if the problem still occurs - please provide outputs from each node when the problem occurs: show processlist; show global variables; show global status; Gdb stack traces also may help to troubleshoot the problem on node with the problem: gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld) >> stack.txt gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld) >> stack.txt gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld) >> stack.txt gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld) >> stack.txt

            I'm currently upgrading two separate clusters from 10.2.8 to 10.2.9 that use Debian Linux. If issue happens again, I'll report back.

            However, I'm not able to upgrade the FreeBSD cluster yet as there are no packaged binaries available newer than 10.2.7. I may look into setting up a compile environment in FreeBSD when I get some more time to play around with this issue.

            darkain Vincent Milum Jr added a comment - I'm currently upgrading two separate clusters from 10.2.8 to 10.2.9 that use Debian Linux. If issue happens again, I'll report back. However, I'm not able to upgrade the FreeBSD cluster yet as there are no packaged binaries available newer than 10.2.7. I may look into setting up a compile environment in FreeBSD when I get some more time to play around with this issue.

            This issue still exists on 10.2.9. I just had the cluster die out on me again following a DELETE operation that got stuck in "query end" status, locking the entire database from updates. When this occurs, it seems to happen on ALL nodes in the cluster at the same time, not just one. The table in question has less than 1K rows total, yet locked for over 30 minutes before I killed the process and rebooted the cluster.

            MariaDB [(none)]> show processlist;
            +--------+-------------+--------------------+-----------+---------+--------+--------------------------+------------------------------------------------------------------------------------------------------+----------+
            | Id     | User        | Host               | db        | Command | Time   | State                    | Info                                                                                                 | Progress |
            +--------+-------------+--------------------+-----------+---------+--------+--------------------------+------------------------------------------------------------------------------------------------------+----------+
            |      1 | system user |                    | NULL      | Sleep   |   4235 | committed 102718         | NULL                                                                                                 |    0.000 |
            |      2 | system user |                    | NULL      | Sleep   | 889646 | wsrep aborter idle       | NULL                                                                                                 |    0.000 |
            |      4 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker      | NULL                                                                                                 |    0.000 |
            |      3 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge coordinator | NULL                                                                                                 |    0.000 |
            |      5 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker      | NULL                                                                                                 |    0.000 |
            |      6 | system user |                    | NULL      | Daemon  |   NULL | InnoDB purge worker      | NULL                                                                                                 |    0.000 |
            |      7 | system user |                    | NULL      | Daemon  |   NULL | InnoDB shutdown handler  | NULL                                                                                                 |    0.000 |
            | 732459 | DB_USER     | XXX.XX.XX.51:35971 | DATABASE  | Query   |   2079 | query end                | DELETE FROM `dbase_session` WHERE (`access`<1505529464)                                              |    0.000 |
            | 732544 | DB_USER     | XXX.XX.XX.51:36150 | DATABASE  | Query   |   1916 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 732565 | DB_USER     | XXX.XX.XX.52:50041 | DATABASE  | Query   |   1909 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 732675 | DB_USER     | XXX.XX.XX.51:36419 | DATABASE  | Query   |   1788 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 732759 | DB_USER     | XXX.XX.XX.52:50439 | DATABASE  | Query   |   1662 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 732762 | DB_USER     | XXX.XX.XX.51:36576 | DATABASE  | Query   |   1660 | query end                | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 732976 | DB_USER     | XXX.XX.XX.52:50899 | DATABASE  | Query   |   1367 | query end                | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX')           |    0.000 |
            | 733066 | DB_USER     | XXX.XX.XX.51:37310 | DATABASE  | Query   |   1249 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 733086 | DB_USER     | XXX.XX.XX.51:37345 | DATABASE  | Killed  |   1221 | init                     | COMMIT                                                                                               |    0.000 |
            | 733106 | DB_USER     | XXX.XX.XX.52:51145 | DATABASE  | Killed  |   1193 | init                     | COMMIT                                                                                               |    0.000 |
            | 733237 | DB_USER     | XXX.XX.XX.51:37693 | DATABASE  | Query   |   1021 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 733375 | DB_USER     | XXX.XX.XX.52:51750 | DATABASE  | Query   |    838 | query end                | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 733429 | DB_USER     | XXX.XX.XX.51:38065 | DATABASE  | Query   |    753 | query end                | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX')           |    0.000 |
            | 733494 | DB_USER     | XXX.XX.XX.52:52020 | DATABASE  | Query   |    668 | query end                | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX')           |    0.000 |
            | 733559 | DB_USER     | XXX.XX.XX.51:38308 | DATABASE  | Query   |    539 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 733602 | DB_USER     | XXX.XX.XX.51:38437 | DATABASE  | Query   |    500 | query end                | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX')           |    0.000 |
            | 733740 | DB_USER     | XXX.XX.XX.52:52561 | DATABASE  | Query   |    350 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 733753 | DB_USER     | XXX.XX.XX.52:52601 | DATABASE  | Query   |    346 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 733892 | DB_USER     | XXX.XX.XX.52:52916 | DATABASE  | Query   |    187 | query end                | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX')           |    0.000 |
            | 733908 | DB_USER     | XXX.XX.XX.52:52925 | DATABASE  | Query   |    181 | query end                | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXX')               |    0.000 |
            | 733914 | DB_USER     | XXX.XX.XX.52:52955 | DATABASE  | Query   |    166 | query end                | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 733941 | DB_USER     | XXX.XX.XX.52:52970 | DATABASE  | Query   |     33 | updating                 | DELETE FROM `dbase_session` WHERE (`access`<1505531407)                                              |    0.000 |
            | 734004 | root        | localhost          | NULL      | Query   |      0 | init                     | show processlist                                                                                     |    0.000 |
            +--------+-------------+--------------------+-----------+---------+--------+--------------------------+------------------------------------------------------------------------------------------------------+----------+
            30 rows in set (0.00 sec)
             
             
            MariaDB [DATABASE]> SELECT COUNT(*) FROM dbase_session;
            +----------+
            | count(*) |
            +----------+
            |      799 |
            +----------+
             
             
            MariaDB [DATABASE]> SHOW VARIABLES LIKE "%version%";
            +-------------------------+-----------------------------------------+
            | Variable_name           | Value                                   |
            +-------------------------+-----------------------------------------+
            | innodb_version          | 5.7.19                                  |
            | protocol_version        | 10                                      |
            | slave_type_conversions  |                                         |
            | version                 | 10.2.9-MariaDB-10.2.9+maria~stretch-log |
            | version_comment         | mariadb.org binary distribution         |
            | version_compile_machine | x86_64                                  |
            | version_compile_os      | debian-linux-gnu                        |
            | version_malloc_library  | system                                  |
            | version_ssl_library     | OpenSSL 1.1.0f  25 May 2017             |
            | wsrep_patch_version     | wsrep_25.20                             |
            +-------------------------+-----------------------------------------+
            

            darkain Vincent Milum Jr added a comment - This issue still exists on 10.2.9. I just had the cluster die out on me again following a DELETE operation that got stuck in "query end" status, locking the entire database from updates. When this occurs, it seems to happen on ALL nodes in the cluster at the same time, not just one. The table in question has less than 1K rows total, yet locked for over 30 minutes before I killed the process and rebooted the cluster. MariaDB [(none)]> show processlist; +--------+-------------+--------------------+-----------+---------+--------+--------------------------+------------------------------------------------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +--------+-------------+--------------------+-----------+---------+--------+--------------------------+------------------------------------------------------------------------------------------------------+----------+ | 1 | system user | | NULL | Sleep | 4235 | committed 102718 | NULL | 0.000 | | 2 | system user | | NULL | Sleep | 889646 | wsrep aborter idle | NULL | 0.000 | | 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 3 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 | | 5 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 6 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 7 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 | | 732459 | DB_USER | XXX.XX.XX.51:35971 | DATABASE | Query | 2079 | query end | DELETE FROM `dbase_session` WHERE (`access`<1505529464) | 0.000 | | 732544 | DB_USER | XXX.XX.XX.51:36150 | DATABASE | Query | 1916 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 732565 | DB_USER | XXX.XX.XX.52:50041 | DATABASE | Query | 1909 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 732675 | DB_USER | XXX.XX.XX.51:36419 | DATABASE | Query | 1788 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 732759 | DB_USER | XXX.XX.XX.52:50439 | DATABASE | Query | 1662 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 732762 | DB_USER | XXX.XX.XX.51:36576 | DATABASE | Query | 1660 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 732976 | DB_USER | XXX.XX.XX.52:50899 | DATABASE | Query | 1367 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX') | 0.000 | | 733066 | DB_USER | XXX.XX.XX.51:37310 | DATABASE | Query | 1249 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 733086 | DB_USER | XXX.XX.XX.51:37345 | DATABASE | Killed | 1221 | init | COMMIT | 0.000 | | 733106 | DB_USER | XXX.XX.XX.52:51145 | DATABASE | Killed | 1193 | init | COMMIT | 0.000 | | 733237 | DB_USER | XXX.XX.XX.51:37693 | DATABASE | Query | 1021 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 733375 | DB_USER | XXX.XX.XX.52:51750 | DATABASE | Query | 838 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 733429 | DB_USER | XXX.XX.XX.51:38065 | DATABASE | Query | 753 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX') | 0.000 | | 733494 | DB_USER | XXX.XX.XX.52:52020 | DATABASE | Query | 668 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX') | 0.000 | | 733559 | DB_USER | XXX.XX.XX.51:38308 | DATABASE | Query | 539 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 733602 | DB_USER | XXX.XX.XX.51:38437 | DATABASE | Query | 500 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX') | 0.000 | | 733740 | DB_USER | XXX.XX.XX.52:52561 | DATABASE | Query | 350 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 733753 | DB_USER | XXX.XX.XX.52:52601 | DATABASE | Query | 346 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 733892 | DB_USER | XXX.XX.XX.52:52916 | DATABASE | Query | 187 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXXX') | 0.000 | | 733908 | DB_USER | XXX.XX.XX.52:52925 | DATABASE | Query | 181 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXX') | 0.000 | | 733914 | DB_USER | XXX.XX.XX.52:52955 | DATABASE | Query | 166 | query end | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 733941 | DB_USER | XXX.XX.XX.52:52970 | DATABASE | Query | 33 | updating | DELETE FROM `dbase_session` WHERE (`access`<1505531407) | 0.000 | | 734004 | root | localhost | NULL | Query | 0 | init | show processlist | 0.000 | +--------+-------------+--------------------+-----------+---------+--------+--------------------------+------------------------------------------------------------------------------------------------------+----------+ 30 rows in set (0.00 sec)     MariaDB [DATABASE]> SELECT COUNT(*) FROM dbase_session; +----------+ | count(*) | +----------+ | 799 | +----------+     MariaDB [DATABASE]> SHOW VARIABLES LIKE "%version%"; +-------------------------+-----------------------------------------+ | Variable_name | Value | +-------------------------+-----------------------------------------+ | innodb_version | 5.7.19 | | protocol_version | 10 | | slave_type_conversions | | | version | 10.2.9-MariaDB-10.2.9+maria~stretch-log | | version_comment | mariadb.org binary distribution | | version_compile_machine | x86_64 | | version_compile_os | debian-linux-gnu | | version_malloc_library | system | | version_ssl_library | OpenSSL 1.1.0f 25 May 2017 | | wsrep_patch_version | wsrep_25.20 | +-------------------------+-----------------------------------------+

            I had a little bit more time to investigate this today. On server A in the cluster, it happened with the same DELETE query mentioned above. However, on server B, it happened from an INSERT query. These didn't actually happen at the same time, but actually several minutes apart. I've also got the gdb dumps attached from each of these two servers instances. stack-1.txt stack-2.txt

            +-------+-------------+--------------------+-----------+---------+-------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
            | Id    | User        | Host               | db        | Command | Time  | State                           | Info                                                                                                 | Progress |
            +-------+-------------+--------------------+-----------+---------+-------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
            |     1 | system user |                    | NULL      | Sleep   | 52535 | wsrep aborter idle              | NULL                                                                                                 |    0.000 |
            |     2 | system user |                    | NULL      | Sleep   | 14361 | Waiting for table metadata lock | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            |     3 | system user |                    | NULL      | Daemon  |  NULL | InnoDB purge coordinator        | NULL                                                                                                 |    0.000 |
            |     4 | system user |                    | NULL      | Daemon  |  NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
            |     6 | system user |                    | NULL      | Daemon  |  NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
            |     5 | system user |                    | NULL      | Daemon  |  NULL | InnoDB purge worker             | NULL                                                                                                 |    0.000 |
            |     7 | system user |                    | NULL      | Daemon  |  NULL | InnoDB shutdown handler         | NULL                                                                                                 |    0.000 |
            | 35840 | DB_USER     | 172.16.12.51:57540 | DATABASE  | Query   | 14361 | query end                       | INSERT INTO `dbase_user_profile` (`user_id`, `user_email`, `user_attributes`) VALUES (XXXXX, 'XXXXXX |    0.000 |
            | 36053 | DB_USER     | 172.16.12.52:38395 | DATABASE  | Query   | 14068 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 36892 | DB_USER     | 172.16.12.52:40433 | DATABASE  | Query   | 12839 | query end                       | DELETE FROM `dbase_session` WHERE (`access`<1505571796)                                              |    0.000 |
            | 37353 | DB_USER     | 172.16.12.51:32926 | DATABASE  | Query   | 12156 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 39063 | DB_USER     | 172.16.12.51:36942 | DATABASE  | Query   |  9849 | query end                       | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXX')             |    0.000 |
            | 39092 | DB_USER     | 172.16.12.51:37062 | DATABASE  | Query   |  9796 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 39910 | DB_USER     | 172.16.12.52:47571 | DATABASE  | Query   |  8688 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 41067 | DB_USER     | 172.16.12.51:41929 | DATABASE  | Query   |  7084 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 41170 | DB_USER     | 172.16.12.51:42169 | DATABASE  | Query   |  6967 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 41703 | DB_USER     | 172.16.12.51:43391 | DATABASE  | Query   |  6193 | query end                       | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXX')            |    0.000 |
            | 42287 | DB_USER     | 172.16.12.51:44804 | DATABASE  | Query   |  5389 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 42462 | DB_USER     | 172.16.12.52:53920 | DATABASE  | Query   |  5178 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 44490 | DB_USER     | 172.16.12.51:50059 | DATABASE  | Query   |  2524 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 44581 | DB_USER     | 172.16.12.51:50281 | DATABASE  | Query   |  2407 | query end                       | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXX')            |    0.000 |
            | 45457 | DB_USER     | 172.16.12.51:52416 | DATABASE  | Query   |  1179 | query end                       | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX')              |    0.000 |
            | 46237 | DB_USER     | 172.16.12.51:54332 | DATABASE  | Query   |    55 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46242 | DB_USER     | 172.16.12.51:54356 | DATABASE  | Query   |    53 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46243 | DB_USER     | 172.16.12.52:35146 | DATABASE  | Query   |    50 | Waiting for table metadata lock | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX |    0.000 |
            | 46246 | DB_USER     | 172.16.12.51:54371 | DATABASE  | Query   |    46 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46249 | DB_USER     | 172.16.12.52:35159 | DATABASE  | Query   |    41 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `dbase_gallery` AS `ga` LEFT JOIN `dbase_file_thumb` AS `th` O |    0.000 |
            | 46254 | DB_USER     | 172.16.12.51:54406 | DATABASE  | Query   |    36 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46260 | DB_USER     | 172.16.12.52:35184 | DATABASE  | Query   |    36 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46264 | root        | localhost          | NULL      | Query   |     0 | init                            | show processlist                                                                                     |    0.000 |
            | 46269 | DB_USER     | 172.16.12.52:35223 | DATABASE  | Query   |    23 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46276 | DB_USER     | 172.16.12.52:35268 | DATABASE  | Query   |    19 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46278 | DB_USER     | 172.16.12.51:54411 | DATABASE  | Query   |    18 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46283 | DB_USER     | 172.16.12.51:54415 | DATABASE  | Query   |    15 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46288 | DB_USER     | 172.16.12.52:35317 | DATABASE  | Query   |     5 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            | 46289 | DB_USER     | 172.16.12.51:54419 | DATABASE  | Query   |     3 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u |    0.000 |
            +-------+-------------+--------------------+-----------+---------+-------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+
            

            darkain Vincent Milum Jr added a comment - I had a little bit more time to investigate this today. On server A in the cluster, it happened with the same DELETE query mentioned above. However, on server B, it happened from an INSERT query. These didn't actually happen at the same time, but actually several minutes apart. I've also got the gdb dumps attached from each of these two servers instances. stack-1.txt stack-2.txt +-------+-------------+--------------------+-----------+---------+-------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +-------+-------------+--------------------+-----------+---------+-------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+ | 1 | system user | | NULL | Sleep | 52535 | wsrep aborter idle | NULL | 0.000 | | 2 | system user | | NULL | Sleep | 14361 | Waiting for table metadata lock | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 3 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 | | 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 6 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 5 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 7 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 | | 35840 | DB_USER | 172.16.12.51:57540 | DATABASE | Query | 14361 | query end | INSERT INTO `dbase_user_profile` (`user_id`, `user_email`, `user_attributes`) VALUES (XXXXX, 'XXXXXX | 0.000 | | 36053 | DB_USER | 172.16.12.52:38395 | DATABASE | Query | 14068 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 36892 | DB_USER | 172.16.12.52:40433 | DATABASE | Query | 12839 | query end | DELETE FROM `dbase_session` WHERE (`access`<1505571796) | 0.000 | | 37353 | DB_USER | 172.16.12.51:32926 | DATABASE | Query | 12156 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 39063 | DB_USER | 172.16.12.51:36942 | DATABASE | Query | 9849 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXX') | 0.000 | | 39092 | DB_USER | 172.16.12.51:37062 | DATABASE | Query | 9796 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 39910 | DB_USER | 172.16.12.52:47571 | DATABASE | Query | 8688 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 41067 | DB_USER | 172.16.12.51:41929 | DATABASE | Query | 7084 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 41170 | DB_USER | 172.16.12.51:42169 | DATABASE | Query | 6967 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 41703 | DB_USER | 172.16.12.51:43391 | DATABASE | Query | 6193 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXX') | 0.000 | | 42287 | DB_USER | 172.16.12.51:44804 | DATABASE | Query | 5389 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 42462 | DB_USER | 172.16.12.52:53920 | DATABASE | Query | 5178 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 44490 | DB_USER | 172.16.12.51:50059 | DATABASE | Query | 2524 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 44581 | DB_USER | 172.16.12.51:50281 | DATABASE | Query | 2407 | query end | UPDATE `dbase_gallery` SET `gallery_views`=`gallery_views`+1 WHERE (`gallery_id`='XXXXX') | 0.000 | | 45457 | DB_USER | 172.16.12.51:52416 | DATABASE | Query | 1179 | query end | UPDATE `dbase_anchor` SET `anchor_clicks`=`anchor_clicks`+1 WHERE (`anchor_id`='XXXXX') | 0.000 | | 46237 | DB_USER | 172.16.12.51:54332 | DATABASE | Query | 55 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46242 | DB_USER | 172.16.12.51:54356 | DATABASE | Query | 53 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46243 | DB_USER | 172.16.12.52:35146 | DATABASE | Query | 50 | Waiting for table metadata lock | UPDATE `dbase_file` SET `file_views`=`file_views`+1 WHERE (`file_hash`=UNHEX('XXXXXXXXXXXXXXXXXXXXXX | 0.000 | | 46246 | DB_USER | 172.16.12.51:54371 | DATABASE | Query | 46 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46249 | DB_USER | 172.16.12.52:35159 | DATABASE | Query | 41 | Waiting for table metadata lock | SELECT us.*, ga.*, th.thumb_hash FROM `dbase_gallery` AS `ga` LEFT JOIN `dbase_file_thumb` AS `th` O | 0.000 | | 46254 | DB_USER | 172.16.12.51:54406 | DATABASE | Query | 36 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46260 | DB_USER | 172.16.12.52:35184 | DATABASE | Query | 36 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46264 | root | localhost | NULL | Query | 0 | init | show processlist | 0.000 | | 46269 | DB_USER | 172.16.12.52:35223 | DATABASE | Query | 23 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46276 | DB_USER | 172.16.12.52:35268 | DATABASE | Query | 19 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46278 | DB_USER | 172.16.12.51:54411 | DATABASE | Query | 18 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46283 | DB_USER | 172.16.12.51:54415 | DATABASE | Query | 15 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46288 | DB_USER | 172.16.12.52:35317 | DATABASE | Query | 5 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | | 46289 | DB_USER | 172.16.12.51:54419 | DATABASE | Query | 3 | Waiting for table metadata lock | SELECT * FROM `dbase_user` AS `us` LEFT JOIN `dbase_file_thumb` AS `th` ON (`th`.`file_hash`=`us`.`u | 0.000 | +-------+-------------+--------------------+-----------+---------+-------+---------------------------------+------------------------------------------------------------------------------------------------------+----------+

            Could you also upload Error log for the problem servers?

            anikitin Andrii Nikitin (Inactive) added a comment - Could you also upload Error log for the problem servers?

            Not a whole lot went into the actual logs, sadly. There were no other MariaDB related syslog messages within a few hours before this occurred. This particular time, it happened on all nodes within the cluster within a few minutes. Last time it occurred earlier this week, only 1 node was effected, caught, and restarted before any other nodes had any issues. There are six nodes total in the cluster when this is happening.

            Node A:

            2017-11-15 15:48:20 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4209ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:50:29 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5043ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:50:34 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4341ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:50:55 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5168ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:51:04 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5920ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:51:12 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4733ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:52:06 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8209ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:52:15 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8409ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:52:41 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6318ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 15:53:17 [Note] InnoDB: page_cleaner: 1000ms intended loop took 7551ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
            2017-11-15 17:21:22 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') connection to peer a044f249 with addr tcp://SERVER_2:4567 timed out, no messages seen in PT3S
            2017-11-15 17:21:22 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://SERVER_2:4567
            2017-11-15 17:21:23 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') connection established to a044f249 tcp://SERVER_2:4567
            2017-11-15 17:21:23 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') connection established to 4c53efd7 tcp://SERVER_1:4567
            2017-11-15 17:21:26 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') turning message relay requesting off
            2017-11-15 17:51:42 [Warning] Aborted connection 130440 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets)
            2017-11-15 17:57:24 [Warning] Aborted connection 130593 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets)
            2017-11-15 17:57:47 [Warning] Aborted connection 130554 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets)
            

            Node B:

            2017-11-15 17:13:56 [Warning] Aborted connection 264087 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets)
            2017-11-15 17:21:21 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection to peer a044f249 with addr tcp://SERVER_2:4567 timed out, no messages seen in PT3S
            2017-11-15 17:21:21 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://SERVER_2:4567
            2017-11-15 17:21:22 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') reconnecting to a044f249 (tcp://SERVER_2:4567), attempt 0
            2017-11-15 17:21:23 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection established to a044f249 tcp://SERVER_2:4567
            2017-11-15 17:21:23 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection established to a044f249 tcp://SERVER_2:4567
            2017-11-15 17:21:23 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection established to 4c53efd7 tcp://SERVER_1:4567
            2017-11-15 17:21:26 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') turning message relay requesting off
            2017-11-15 17:25:32 [Warning] Aborted connection 264371 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got timeout reading communication packets)
            

            darkain Vincent Milum Jr added a comment - Not a whole lot went into the actual logs, sadly. There were no other MariaDB related syslog messages within a few hours before this occurred. This particular time, it happened on all nodes within the cluster within a few minutes. Last time it occurred earlier this week, only 1 node was effected, caught, and restarted before any other nodes had any issues. There are six nodes total in the cluster when this is happening. Node A: 2017-11-15 15:48:20 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4209ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:50:29 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5043ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:50:34 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4341ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:50:55 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5168ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:51:04 [Note] InnoDB: page_cleaner: 1000ms intended loop took 5920ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:51:12 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4733ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:52:06 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8209ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:52:15 [Note] InnoDB: page_cleaner: 1000ms intended loop took 8409ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:52:41 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6318ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 15:53:17 [Note] InnoDB: page_cleaner: 1000ms intended loop took 7551ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.) 2017-11-15 17:21:22 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') connection to peer a044f249 with addr tcp://SERVER_2:4567 timed out, no messages seen in PT3S 2017-11-15 17:21:22 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://SERVER_2:4567 2017-11-15 17:21:23 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') connection established to a044f249 tcp://SERVER_2:4567 2017-11-15 17:21:23 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') connection established to 4c53efd7 tcp://SERVER_1:4567 2017-11-15 17:21:26 [Note] WSREP: (4e871fe5, 'tcp://0.0.0.0:4567') turning message relay requesting off 2017-11-15 17:51:42 [Warning] Aborted connection 130440 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets) 2017-11-15 17:57:24 [Warning] Aborted connection 130593 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets) 2017-11-15 17:57:47 [Warning] Aborted connection 130554 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets) Node B: 2017-11-15 17:13:56 [Warning] Aborted connection 264087 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got an error reading communication packets) 2017-11-15 17:21:21 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection to peer a044f249 with addr tcp://SERVER_2:4567 timed out, no messages seen in PT3S 2017-11-15 17:21:21 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://SERVER_2:4567 2017-11-15 17:21:22 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') reconnecting to a044f249 (tcp://SERVER_2:4567), attempt 0 2017-11-15 17:21:23 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection established to a044f249 tcp://SERVER_2:4567 2017-11-15 17:21:23 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection established to a044f249 tcp://SERVER_2:4567 2017-11-15 17:21:23 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') connection established to 4c53efd7 tcp://SERVER_1:4567 2017-11-15 17:21:26 [Note] WSREP: (7f23484c, 'tcp://0.0.0.0:4567') turning message relay requesting off 2017-11-15 17:25:32 [Warning] Aborted connection 264371 to db: 'DB_NAME' user: 'USER_NAME' host: 'THIS_NODE_IP_ADDRESS' (Got timeout reading communication packets)

            I've not fully confirmed this yet, but I believe the issue may be related to using multiple gmcast segments. So far, I've been running stable for a few days now with disabling all but the primary data center cluster. This, however, means I currently do not have my real-time off-site copies of the data (the whole reason we use gmcast segments across multiple datacenters to begin with)

            darkain Vincent Milum Jr added a comment - I've not fully confirmed this yet, but I believe the issue may be related to using multiple gmcast segments. So far, I've been running stable for a few days now with disabling all but the primary data center cluster. This, however, means I currently do not have my real-time off-site copies of the data (the whole reason we use gmcast segments across multiple datacenters to begin with)
            darkain Vincent Milum Jr added a comment - - edited

            Some new notes

            I've had the cluster not using gmcast segments for the past couple months, and all has been stable in that time frame. But like mentioned above, this means I've dropped from multiple data centers for redundancy down to a single data center.

            In my application, there were calls to "SET SESSION wsrep_sync_wait = [MASK]" which have since been removed as they're no longer needed after refactoring other code. Since then, I've re-enabled other gmcast segments, and so far things appear stable.

            Additionally, the cluster has been upgraded to MariaDB Server 10.2.14 (current stable) with Galera 25.3.23.

            There may have been a MariaDB Server or Galera patch that addressed this issue, or it may have been a bug with particular wsrep_sync_wait states. I'm currently betting on the latter, since that is directly responsible for locking the transaction processing queue, and may have had issues unlocking it.

            darkain Vincent Milum Jr added a comment - - edited Some new notes I've had the cluster not using gmcast segments for the past couple months, and all has been stable in that time frame. But like mentioned above, this means I've dropped from multiple data centers for redundancy down to a single data center. In my application, there were calls to "SET SESSION wsrep_sync_wait = [MASK] " which have since been removed as they're no longer needed after refactoring other code. Since then, I've re-enabled other gmcast segments, and so far things appear stable. Additionally, the cluster has been upgraded to MariaDB Server 10.2.14 (current stable) with Galera 25.3.23. There may have been a MariaDB Server or Galera patch that addressed this issue, or it may have been a bug with particular wsrep_sync_wait states. I'm currently betting on the latter, since that is directly responsible for locking the transaction processing queue, and may have had issues unlocking it.

            Just as an update, since removing all references to wsrep_sync_wait in my application code and MariaDB configurations, everything has remained stable and online for over a month now.

            darkain Vincent Milum Jr added a comment - Just as an update, since removing all references to wsrep_sync_wait in my application code and MariaDB configurations, everything has remained stable and online for over a month now.

            Unsure if https://jira.mariadb.org/browse/MDEV-17073 or https://jira.mariadb.org/browse/MDEV-17541 are related. Ever since switching to MariaDB 10.3.11, things seem stable thus far. The descriptions in those other bugs seems quite similarly related to possible underlying issues with this bug. For the time being, I'm making this one as "FIXED" unless I see this issue pop up again in any of my Galera clusters.

            darkain Vincent Milum Jr added a comment - Unsure if https://jira.mariadb.org/browse/MDEV-17073 or https://jira.mariadb.org/browse/MDEV-17541 are related. Ever since switching to MariaDB 10.3.11, things seem stable thus far. The descriptions in those other bugs seems quite similarly related to possible underlying issues with this bug. For the time being, I'm making this one as "FIXED" unless I see this issue pop up again in any of my Galera clusters.

            It has now been well over a month since switching to 10.3.11 and there has been zero issues with "query end" locking ever since. Previously my cluster would die at least once a week. I now fully believe this issue is directly related to the locking issues addressed in the other linked bugs. I apparently don't have rights to close my own issues here though? So if someone from MariaDB with rights wants to, I'm pretty certain this is finally solved.

            darkain Vincent Milum Jr added a comment - It has now been well over a month since switching to 10.3.11 and there has been zero issues with "query end" locking ever since. Previously my cluster would die at least once a week. I now fully believe this issue is directly related to the locking issues addressed in the other linked bugs. I apparently don't have rights to close my own issues here though? So if someone from MariaDB with rights wants to, I'm pretty certain this is finally solved.

            darkain,

            Thanks for the update. Based on it, closing as fixed within the scope of above-mentioned bugs.

            elenst Elena Stepanova added a comment - darkain , Thanks for the update. Based on it, closing as fixed within the scope of above-mentioned bugs.

            People

              jplindst Jan Lindström (Inactive)
              darkain Vincent Milum Jr
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.