Details

    Description

      Hi there,

      Every few days our cluster stalls. You always know doomsday has arrived, when this message appears in the log:

      May 17 06:46:25 uhu1 mysqld[27227]: WSREP: BF lock wait long
      

      This message then repeats a lot and other weird stuff finally appears (see below). The cluster never recovers. We always need to restart all nodes. I find this especially surprising, since there are various timeouts defined. If there is a deadlock, I wonder why it doesnt resolve automatically after some time by aborting with some error message. Googling for this message to find out whether it might be an application problem didnt yield any results either.

      May 17 06:46:26 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:28 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:29 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:30 uhu1 mysqld[27227]: message repeated 6 times: [ WSREP: BF lock wait long]
      May 17 06:46:31 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:32 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:33 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:34 uhu1 mysqld[27227]: message repeated 8 times: [ WSREP: BF lock wait long]
      May 17 06:46:35 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:36 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]
      May 17 06:46:37 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:38 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:39 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]
      May 17 06:46:40 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:40 uhu1 mysqld[27227]: =====================================
      May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 06:46:40 0x7f52c3cdd700 INNODB MONITOR OUTPUT
      May 17 06:46:40 uhu1 mysqld[27227]: =====================================
      May 17 06:46:40 uhu1 mysqld[27227]: Per second averages calculated from the last 16 seconds
      May 17 06:46:40 uhu1 mysqld[27227]: -----------------
      May 17 06:46:40 uhu1 mysqld[27227]: BACKGROUND THREAD
      May 17 06:46:40 uhu1 mysqld[27227]: -----------------
      May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread loops: 570080 srv_active, 0 srv_shutdown, 120059 srv_idle
      May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread log flush and writes: 690137
      May 17 06:46:40 uhu1 mysqld[27227]: ----------
      May 17 06:46:40 uhu1 mysqld[27227]: SEMAPHORES
      May 17 06:46:40 uhu1 mysqld[27227]: ----------
      May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: reservation count 2567993
      May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: signal count 1741581
      May 17 06:46:40 uhu1 mysqld[27227]: RW-shared spins 0, rounds 1276685, OS waits 365942
      May 17 06:46:40 uhu1 mysqld[27227]: RW-excl spins 0, rounds 16000314, OS waits 198061
      May 17 06:46:40 uhu1 mysqld[27227]: RW-sx spins 210079, rounds 4201641, OS waits 50893
      May 17 06:46:40 uhu1 mysqld[27227]: Spin rounds per wait: 1276685.00 RW-shared, 16000314.00 RW-excl, 20.00 RW-sx
      May 17 06:46:40 uhu1 mysqld[27227]: ------------------------
      May 17 06:46:40 uhu1 mysqld[27227]: LATEST DETECTED DEADLOCK
      May 17 06:46:40 uhu1 mysqld[27227]: ------------------------
      May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 03:24:35 0x7f52add94700
      May 17 06:46:40 uhu1 mysqld[27227]: *** (1) TRANSACTION:
      May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857376, ACTIVE 0 sec inserting
      May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1
      May 17 06:46:40 uhu1 mysqld[27227]: LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
      May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704966, OS thread handle 0x7f52ae768700, query id 400303172 localhost 127.0.0.1 eposservicescom update
      May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1193996601591ba65393eb33.09632175', '1494984305.6058')
      May 17 06:46:40 uhu1 mysqld[27227]: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
      May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857376 lock_mode X locks rec but not gap waiting
      May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
      May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;
      May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;
      May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;
      May 17 06:46:40 uhu1 mysqld[27227]: *** (2) TRANSACTION:
      May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857375, ACTIVE 0 sec inserting
      May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1
      May 17 06:46:40 uhu1 mysqld[27227]: 3 lock struct(s), heap size 1136, 2 row lock(s)
      May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704967, OS thread handle 0x7f52add94700, query id 400303173 localhost 127.0.0.1 eposservicescom update
      May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1474565599591ba65393ebe9.44223691', '1494984305.6058')
      May 17 06:46:40 uhu1 mysqld[27227]: *** (2) HOLDS THE LOCK(S):
      May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock mode S locks rec but not gap
      May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
      May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;
      May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;
      May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;
      May 17 06:46:40 uhu1 mysqld[27227]: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
      May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock_mode X locks rec but not gap waiting
      May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
      May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;
      May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;
      May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;
      May 17 06:46:40 uhu1 mysqld[27227]: *** WE ROLL BACK TRANSACTION (2)
      

      Attachments

        Issue Links

          Activity

            adrian.tarau Adrian Tarau added a comment -

            I have 500MB of logs (uncompressed) ... I'll send you an email with a link to download the logs ....

            Forgot to mention ... we are using a 3 node cluster (development database) with HA Proxy, with one node dedicated for writes...nothing interesting is logged in (mostly) read-only nodes .... A stress test is started with the database in the same state ... after a while it locks and it is always triggered by a DDL statement (truncate in this case, in the past I've seen these dead-locks with ALTER TABLE ... ADD PARTITION ...). Once the DDL enters in dead-lock, no other query can be executed (and there are plenty of threads/connections available), even queries not related to that table lock. Same with ALTER TABLE ... ADD PARTITION

            First reference of a dead-lock starts at 2018-01-24 15:39:09 139784560858880 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1e9eb4f query

            adrian.tarau Adrian Tarau added a comment - I have 500MB of logs (uncompressed) ... I'll send you an email with a link to download the logs .... Forgot to mention ... we are using a 3 node cluster (development database) with HA Proxy, with one node dedicated for writes...nothing interesting is logged in (mostly) read-only nodes .... A stress test is started with the database in the same state ... after a while it locks and it is always triggered by a DDL statement (truncate in this case, in the past I've seen these dead-locks with ALTER TABLE ... ADD PARTITION ...). Once the DDL enters in dead-lock, no other query can be executed (and there are plenty of threads/connections available), even queries not related to that table lock. Same with ALTER TABLE ... ADD PARTITION First reference of a dead-lock starts at 2018-01-24 15:39:09 139784560858880 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1e9eb4f query
            neiltembo Neil Skrypuch added a comment -

            I am also still seeing this issue with 10.2.12, although the behaviour has changed somewhat compared to 10.2.9 (the previous version we were running).

            Now with 10.2.12, it seems to only happen on one master at a time (not multiple masters simultaneously). Also, the query that gets regularly printed to the logs is always a DDL statement now, which it was not before.

            The other noteworthy change with 10.2.12 is that the log output is much terser. Instead of a complete dump of SHOW ENGINE INNODB STATUS; every second, we now just see 5 lines per second, like so:

            Jan 18 02:53:12 tm2 mysqld[16437]: 2018-01-18 2:53:12 140246917351168 [Note] InnoDB: WSREP: BF lock wait long for trx:0x76a45d4 query: CREATE TABLE IF NOT EXISTS ...
            Jan 18 02:53:12 tm2 mysqld[16437]: TRANSACTION 124405204, ACTIVE 269206 sec
            Jan 18 02:53:12 tm2 mysqld[16437]: LOCK WAIT
            Jan 18 02:53:12 tm2 mysqld[16437]: MySQL thread id 16, OS thread handle 140246857389824, query id 27596013 creating table
            Jan 18 02:53:12 tm2 mysqld[16437]: CREATE TABLE IF NOT EXISTS ...

            neiltembo Neil Skrypuch added a comment - I am also still seeing this issue with 10.2.12, although the behaviour has changed somewhat compared to 10.2.9 (the previous version we were running). Now with 10.2.12, it seems to only happen on one master at a time (not multiple masters simultaneously). Also, the query that gets regularly printed to the logs is always a DDL statement now, which it was not before. The other noteworthy change with 10.2.12 is that the log output is much terser. Instead of a complete dump of SHOW ENGINE INNODB STATUS; every second, we now just see 5 lines per second, like so: Jan 18 02:53:12 tm2 mysqld [16437] : 2018-01-18 2:53:12 140246917351168 [Note] InnoDB: WSREP: BF lock wait long for trx:0x76a45d4 query: CREATE TABLE IF NOT EXISTS ... Jan 18 02:53:12 tm2 mysqld [16437] : TRANSACTION 124405204, ACTIVE 269206 sec Jan 18 02:53:12 tm2 mysqld [16437] : LOCK WAIT Jan 18 02:53:12 tm2 mysqld [16437] : MySQL thread id 16, OS thread handle 140246857389824, query id 27596013 creating table Jan 18 02:53:12 tm2 mysqld [16437] : CREATE TABLE IF NOT EXISTS ...
            adrian.tarau Adrian Tarau added a comment -

            I do not see a way to re-open the defect ....

            adrian.tarau Adrian Tarau added a comment - I do not see a way to re-open the defect ....

            adrian.tarau, please create a new one. You can refer to this one in the description if you'd like or copy-paste from it.

            serg Sergei Golubchik added a comment - adrian.tarau , please create a new one. You can refer to this one in the description if you'd like or copy-paste from it.

            Updated correct releases where actual fix is:

            commit 57cde8ccd19675dc98e3cbacf0ef5c72cb188e49 (HEAD -> 10.1, origin/10.1)
            Author: Jan Lindström <jan.lindstrom@mariadb.com>
            Date: Wed Jul 25 08:21:25 2018 +0300

            MDEV-15822: WSREP: BF lock wait long for trx

            In Galera BF (brute force) transactions may not wait for lock requests
            and normally BF-transaction would select transaction holding conflicting
            locks as a victim for rollback. However, background statistic calculation
            transaction is InnoDB internal transaction and it has no thd i.e. it can't be
            selected as a victim. If background statistics calculation transaction holds
            conflicting locks to statistics tables it will cause BF lock wait long
            error message. Correct way to handle background statistics calculation is to
            acquire thd for transaction but that change is too big for GA-releases and
            there are other reported problems on background statistics calculation.

            This fix avoids adding a table to background statistics calculation if

            jplindst Jan Lindström (Inactive) added a comment - Updated correct releases where actual fix is: commit 57cde8ccd19675dc98e3cbacf0ef5c72cb188e49 (HEAD -> 10.1, origin/10.1) Author: Jan Lindström <jan.lindstrom@mariadb.com> Date: Wed Jul 25 08:21:25 2018 +0300 MDEV-15822 : WSREP: BF lock wait long for trx In Galera BF (brute force) transactions may not wait for lock requests and normally BF-transaction would select transaction holding conflicting locks as a victim for rollback. However, background statistic calculation transaction is InnoDB internal transaction and it has no thd i.e. it can't be selected as a victim. If background statistics calculation transaction holds conflicting locks to statistics tables it will cause BF lock wait long error message. Correct way to handle background statistics calculation is to acquire thd for transaction but that change is too big for GA-releases and there are other reported problems on background statistics calculation. This fix avoids adding a table to background statistics calculation if

            People

              jplindst Jan Lindström (Inactive)
              rgpublic Ranjan Ghosh
              Votes:
              19 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.