Details

    Description

      Hi there,

      Every few days our cluster stalls. You always know doomsday has arrived, when this message appears in the log:

      May 17 06:46:25 uhu1 mysqld[27227]: WSREP: BF lock wait long
      

      This message then repeats a lot and other weird stuff finally appears (see below). The cluster never recovers. We always need to restart all nodes. I find this especially surprising, since there are various timeouts defined. If there is a deadlock, I wonder why it doesnt resolve automatically after some time by aborting with some error message. Googling for this message to find out whether it might be an application problem didnt yield any results either.

      May 17 06:46:26 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:28 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:29 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:30 uhu1 mysqld[27227]: message repeated 6 times: [ WSREP: BF lock wait long]
      May 17 06:46:31 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:32 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:33 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:34 uhu1 mysqld[27227]: message repeated 8 times: [ WSREP: BF lock wait long]
      May 17 06:46:35 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:36 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]
      May 17 06:46:37 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:38 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:39 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]
      May 17 06:46:40 uhu1 mysqld[27227]: WSREP: BF lock wait long
      May 17 06:46:40 uhu1 mysqld[27227]: =====================================
      May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 06:46:40 0x7f52c3cdd700 INNODB MONITOR OUTPUT
      May 17 06:46:40 uhu1 mysqld[27227]: =====================================
      May 17 06:46:40 uhu1 mysqld[27227]: Per second averages calculated from the last 16 seconds
      May 17 06:46:40 uhu1 mysqld[27227]: -----------------
      May 17 06:46:40 uhu1 mysqld[27227]: BACKGROUND THREAD
      May 17 06:46:40 uhu1 mysqld[27227]: -----------------
      May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread loops: 570080 srv_active, 0 srv_shutdown, 120059 srv_idle
      May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread log flush and writes: 690137
      May 17 06:46:40 uhu1 mysqld[27227]: ----------
      May 17 06:46:40 uhu1 mysqld[27227]: SEMAPHORES
      May 17 06:46:40 uhu1 mysqld[27227]: ----------
      May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: reservation count 2567993
      May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: signal count 1741581
      May 17 06:46:40 uhu1 mysqld[27227]: RW-shared spins 0, rounds 1276685, OS waits 365942
      May 17 06:46:40 uhu1 mysqld[27227]: RW-excl spins 0, rounds 16000314, OS waits 198061
      May 17 06:46:40 uhu1 mysqld[27227]: RW-sx spins 210079, rounds 4201641, OS waits 50893
      May 17 06:46:40 uhu1 mysqld[27227]: Spin rounds per wait: 1276685.00 RW-shared, 16000314.00 RW-excl, 20.00 RW-sx
      May 17 06:46:40 uhu1 mysqld[27227]: ------------------------
      May 17 06:46:40 uhu1 mysqld[27227]: LATEST DETECTED DEADLOCK
      May 17 06:46:40 uhu1 mysqld[27227]: ------------------------
      May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 03:24:35 0x7f52add94700
      May 17 06:46:40 uhu1 mysqld[27227]: *** (1) TRANSACTION:
      May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857376, ACTIVE 0 sec inserting
      May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1
      May 17 06:46:40 uhu1 mysqld[27227]: LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
      May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704966, OS thread handle 0x7f52ae768700, query id 400303172 localhost 127.0.0.1 eposservicescom update
      May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1193996601591ba65393eb33.09632175', '1494984305.6058')
      May 17 06:46:40 uhu1 mysqld[27227]: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
      May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857376 lock_mode X locks rec but not gap waiting
      May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
      May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;
      May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;
      May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;
      May 17 06:46:40 uhu1 mysqld[27227]: *** (2) TRANSACTION:
      May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857375, ACTIVE 0 sec inserting
      May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1
      May 17 06:46:40 uhu1 mysqld[27227]: 3 lock struct(s), heap size 1136, 2 row lock(s)
      May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704967, OS thread handle 0x7f52add94700, query id 400303173 localhost 127.0.0.1 eposservicescom update
      May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1474565599591ba65393ebe9.44223691', '1494984305.6058')
      May 17 06:46:40 uhu1 mysqld[27227]: *** (2) HOLDS THE LOCK(S):
      May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock mode S locks rec but not gap
      May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
      May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;
      May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;
      May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;
      May 17 06:46:40 uhu1 mysqld[27227]: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
      May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock_mode X locks rec but not gap waiting
      May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
      May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;
      May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;
      May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
      May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;
      May 17 06:46:40 uhu1 mysqld[27227]: *** WE ROLL BACK TRANSACTION (2)
      

      Attachments

        Issue Links

          Activity

            chises Michael Weber added a comment - - edited

            Hello all.
            Since the update from 10.1 to 10.2.6 yesterday we are facing the same issue "WSREP: BF lock wait long".
            The whole cluster is stuck after that. we only can bring it back if we restart the node with the "WSREP: BF lock wait long".
            Note: we are using Debian 3.16.39-1+deb8u2

            May 23 17:39:00 he-rz22-sql01 mysqld[14930]: WSREP: BF lock wait long
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: WSREP: BF lock wait long
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: =====================================
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: 2017-05-23 17:39:01 0x7f7365ba3700 INNODB MONITOR OUTPUT
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: =====================================
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Per second averages calculated from the last 16 seconds
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: -----------------
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: BACKGROUND THREAD
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: -----------------
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: srv_master_thread loops: 632 srv_active, 0 srv_shutdown, 788 srv_idle
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: srv_master_thread log flush and writes: 1420
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ----------
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: SEMAPHORES
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ----------
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: OS WAIT ARRAY INFO: reservation count 11612
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: OS WAIT ARRAY INFO: signal count 7425
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RW-shared spins 0, rounds 1553, OS waits 655
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RW-excl spins 0, rounds 4065, OS waits 84
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RW-sx spins 392, rounds 3750, OS waits 34
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Spin rounds per wait: 1553.00 RW-shared, 4065.00 RW-excl, 9.57 RW-sx
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ------------
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: TRANSACTIONS
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ------------
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Trx id counter 246763323
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Purge done for trx's n:o < 246762955 undo n:o < 0 state: running but idle
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: History list length 35
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: LIST OF TRANSACTIONS FOR EACH SESSION:
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ---TRANSACTION 421620296599400, not started
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: 0 lock struct(s), heap size 1136, 0 row lock(s)
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ---TRANSACTION 246763322, ACTIVE 3 sec starting index read
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: mysql tables in use 1, locked 1
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: MySQL thread id 2113, OS thread handle 0x7f735eac8700, query id 47052 172.16.0.12 USER updating
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: UPDATE `devices` SET `updated_at`=1495553938, `last_contact`=1495553938 WHERE `id`=145
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ------- TRX HAS BEEN WAITING 3 SEC FOR THIS LOCK TO BE GRANTED:
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RECORD LOCKS space id 851 page no 5 n bits 128 index PRIMARY of table `DATABASE`.`devices` trx id 2467633
            22 lock_mode X locks rec but not gap waiting
            May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Record lock, heap no 18 PHYSICAL RECORD: n_fields 54; compact format; info bits 0
            
            

            chises Michael Weber added a comment - - edited Hello all. Since the update from 10.1 to 10.2.6 yesterday we are facing the same issue "WSREP: BF lock wait long". The whole cluster is stuck after that. we only can bring it back if we restart the node with the "WSREP: BF lock wait long". Note: we are using Debian 3.16.39-1+deb8u2 May 23 17 : 39 : 00 he-rz22-sql01 mysqld[ 14930 ]: WSREP: BF lock wait long May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: WSREP: BF lock wait long May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ===================================== May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: 2017 - 05 - 23 17 : 39 : 01 0x7f7365ba3700 INNODB MONITOR OUTPUT May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ===================================== May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Per second averages calculated from the last 16 seconds May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ----------------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: BACKGROUND THREAD May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ----------------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: srv_master_thread loops: 632 srv_active, 0 srv_shutdown, 788 srv_idle May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: srv_master_thread log flush and writes: 1420 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: SEMAPHORES May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: OS WAIT ARRAY INFO: reservation count 11612 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: OS WAIT ARRAY INFO: signal count 7425 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RW-shared spins 0 , rounds 1553 , OS waits 655 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RW-excl spins 0 , rounds 4065 , OS waits 84 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RW-sx spins 392 , rounds 3750 , OS waits 34 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Spin rounds per wait: 1553.00 RW-shared, 4065.00 RW-excl, 9.57 RW-sx May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ------------ May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: TRANSACTIONS May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ------------ May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Trx id counter 246763323 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Purge done for trx's n:o < 246762955 undo n:o < 0 state: running but idle May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: History list length 35 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: LIST OF TRANSACTIONS FOR EACH SESSION: May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---TRANSACTION 421620296599400 , not started May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: 0 lock struct(s), heap size 1136 , 0 row lock(s) May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---TRANSACTION 246763322 , ACTIVE 3 sec starting index read May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: mysql tables in use 1 , locked 1 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: LOCK WAIT 2 lock struct(s), heap size 1136 , 1 row lock(s) May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: MySQL thread id 2113 , OS thread handle 0x7f735eac8700 , query id 47052 172.16 . 0.12 USER updating May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: UPDATE `devices` SET `updated_at`= 1495553938 , `last_contact`= 1495553938 WHERE `id`= 145 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ------- TRX HAS BEEN WAITING 3 SEC FOR THIS LOCK TO BE GRANTED: May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RECORD LOCKS space id 851 page no 5 n bits 128 index PRIMARY of table `DATABASE`.`devices` trx id 2467633 22 lock_mode X locks rec but not gap waiting May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Record lock, heap no 18 PHYSICAL RECORD: n_fields 54 ; compact format; info bits 0
            rgpublic Ranjan Ghosh added a comment -

            Does anybody at least have an idea what causes this? Our cluster somtimes stalls multiple times a day and needs to be restarted which is very, very ugly. I can't say for sure but I have the feeling this bug has even been introducted in recent versions (10.2.3, 10.2.4 ... ?). If I only knew a workaround or some setting that made this error at least less probable too appear. Right now, it's not really usable. I'm even considering turning off Galera and let MariaDB run as a single node for the time being. My current config in case that's of any help:

            [mysqld]
             
            # Enough buffer for data, indexes etc
            innodb_buffer_pool_size=4G
             
            # Do not wait too long for locks - connections might build up (Especially important for Drupal semaphore table)
            innodb_lock_wait_timeout=10
             
            #Recommended by Drupal
            max_allowed_packet=512M
             
            #Try to prevent crashes
            skip-log-bin
             
            # Purge faster :-)
            innodb_purge_threads = 16
             
            # Replicate faster :-)
            wsrep_slave_threads = 32
             
            # Reduce dead-locks
            transaction-isolation = READ-COMMITTED
             
            #Query Cache for faster performance
            query_cache_type = 1
            query_cache_size = 128M # Should not be too large to avoid cache contention
             
            #mysql settings mandatory for galera
            binlog_format=ROW
            binlog_row_image=MINIMAL
            default_storage_engine=InnoDB
            innodb_autoinc_lock_mode=2
            bind-address=0.0.0.0
            innodb_flush_log_at_trx_commit=0
             
            # Accept up to 300 incoming connections but use the thread pool to prevent too many simultaneous connections actually working at the same time
            max_connections=300
            thread-handling=pool-of-threads
            thread_pool_size=32
            extra-port=3307
            extra-max-connections=1
             
            #galera settings
            wsrep_on=ON
            wsrep_provider=/usr/lib/galera/libgalera_smm.so
            wsrep_cluster_name="uhu_cluster"
            wsrep_cluster_address="gcomm://node1,node2"
            wsrep_node_address=192.168.0.1
            wsrep_provider_options='gcache.size=2G'   
             
            enforce_storage_engine=InnoDB
            sql_mode=NO_AUTO_CREATE_USER
            

            rgpublic Ranjan Ghosh added a comment - Does anybody at least have an idea what causes this? Our cluster somtimes stalls multiple times a day and needs to be restarted which is very, very ugly. I can't say for sure but I have the feeling this bug has even been introducted in recent versions (10.2.3, 10.2.4 ... ?). If I only knew a workaround or some setting that made this error at least less probable too appear. Right now, it's not really usable. I'm even considering turning off Galera and let MariaDB run as a single node for the time being. My current config in case that's of any help: [mysqld]   # Enough buffer for data, indexes etc innodb_buffer_pool_size=4G   # Do not wait too long for locks - connections might build up (Especially important for Drupal semaphore table) innodb_lock_wait_timeout=10   #Recommended by Drupal max_allowed_packet=512M   #Try to prevent crashes skip-log-bin   # Purge faster :-) innodb_purge_threads = 16   # Replicate faster :-) wsrep_slave_threads = 32   # Reduce dead-locks transaction-isolation = READ-COMMITTED   #Query Cache for faster performance query_cache_type = 1 query_cache_size = 128M # Should not be too large to avoid cache contention   #mysql settings mandatory for galera binlog_format=ROW binlog_row_image=MINIMAL default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 bind-address=0.0.0.0 innodb_flush_log_at_trx_commit=0   # Accept up to 300 incoming connections but use the thread pool to prevent too many simultaneous connections actually working at the same time max_connections=300 thread-handling=pool-of-threads thread_pool_size=32 extra-port=3307 extra-max-connections=1   #galera settings wsrep_on=ON wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_cluster_name="uhu_cluster" wsrep_cluster_address="gcomm://node1,node2" wsrep_node_address=192.168.0.1 wsrep_provider_options='gcache.size=2G'   enforce_storage_engine=InnoDB sql_mode=NO_AUTO_CREATE_USER
            chises Michael Weber added a comment -

            Bad News: we still got this issue.
            At the moment i configured the Loadbalancer to send requests only to 1 node. that seems to work for a while. since 3 days we had no crash.
            waiting for a update/patch..

            chises Michael Weber added a comment - Bad News: we still got this issue. At the moment i configured the Loadbalancer to send requests only to 1 node. that seems to work for a while. since 3 days we had no crash. waiting for a update/patch..
            rgpublic Ranjan Ghosh added a comment -

            Unfortunately, even if I send requests only to 1 node, our cluster still crashes. Currently, I've got only 1 single node running. This defeats the whole purpose why we are using MariaDB/Galera in the first place, though. It is, however, stable.

            When researching this bug, I find many different anecdotal reports over the years:

            None with a real solution or any explanation on why (i.e. under which circumstances) this happens. The error seems very random to me. Sometimes it happens at night when there is hardly any load on the server at all. All reports have in common that the database never seems to recover by itself. In all reported cases a restart seemed to be required. Therefore, at the very least, I guess the database should just shutdown (signal 11) in case this error happens. This would be far better than the current situation where the database is happily accepting connections which then subsequently just hang/freeze forever. I think this bug is particularly evil just because of that said property: We use Galera in particular to have an automatic failover if sth. goes wrong. This bug stalls the database and because it isn't properly shutdown, the second node cannot take over and hangs as well.

            rgpublic Ranjan Ghosh added a comment - Unfortunately, even if I send requests only to 1 node, our cluster still crashes. Currently, I've got only 1 single node running. This defeats the whole purpose why we are using MariaDB/Galera in the first place, though. It is, however, stable. When researching this bug, I find many different anecdotal reports over the years: https://groups.google.com/forum/#!topic/codership-team/6wsv_jUmc5w https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1671542 https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1197771 https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/43848-cluster-locks-insert-update-issues https://www.mail-archive.com/maria-discuss@lists.launchpad.net/msg02441.html None with a real solution or any explanation on why (i.e. under which circumstances) this happens. The error seems very random to me. Sometimes it happens at night when there is hardly any load on the server at all. All reports have in common that the database never seems to recover by itself. In all reported cases a restart seemed to be required. Therefore, at the very least , I guess the database should just shutdown (signal 11) in case this error happens. This would be far better than the current situation where the database is happily accepting connections which then subsequently just hang/freeze forever. I think this bug is particularly evil just because of that said property: We use Galera in particular to have an automatic failover if sth. goes wrong. This bug stalls the database and because it isn't properly shutdown, the second node cannot take over and hangs as well.
            danblack Daniel Black added a comment - - edited

            Please try to get a backtrace

            Also what galera version are you running? Have you tried 25.3.25?

            edit: camera -> galera (groan - phone autospell)

            danblack Daniel Black added a comment - - edited Please try to get a backtrace Also what galera version are you running? Have you tried 25.3.25? edit: camera -> galera (groan - phone autospell)
            rgpublic Ranjan Ghosh added a comment -

            You meant Galera version, I assume? Well, I use the one that comes with the OS (Ubuntu) AFAICT:

            ~ # dpkg -l | grep galera
            ii  galera-3                             25.3.19-xenial                       amd64        Replication framework for transactional applications
            

            I've also added the MariaDB PPAs but the most recent version I can find there is 25.3.20. How can I upgrade to a newer version? Is there a separate PPA?

            Futhermore, I understand why you'd like to have a backtrace, although it's obviously very, very difficult to get that in this case. I'd first need to enable the second node again. Then I'd need to compile a separate MariaDB version. Then wait until it hangs. It doesnt crash. It only hangs. Sometimes this is only once every 1-2 days. Sometimes in the very early morning hours. Then I would need to send it a "kill -6" if I understand correctly and grab the backtrace. As this is a live server with important websites on it not an easy feat. I'll need sometime to make that possible. In the mean time: Any insights into why this might happen would be much appreciated. Right now, I'm totally in the dark on what things I should avoid or watch out for...

            rgpublic Ranjan Ghosh added a comment - You meant Galera version, I assume? Well, I use the one that comes with the OS (Ubuntu) AFAICT: ~ # dpkg -l | grep galera ii galera-3 25.3.19-xenial amd64 Replication framework for transactional applications I've also added the MariaDB PPAs but the most recent version I can find there is 25.3.20. How can I upgrade to a newer version? Is there a separate PPA? Futhermore, I understand why you'd like to have a backtrace, although it's obviously very, very difficult to get that in this case. I'd first need to enable the second node again. Then I'd need to compile a separate MariaDB version. Then wait until it hangs. It doesnt crash. It only hangs. Sometimes this is only once every 1-2 days. Sometimes in the very early morning hours. Then I would need to send it a "kill -6" if I understand correctly and grab the backtrace. As this is a live server with important websites on it not an easy feat. I'll need sometime to make that possible. In the mean time: Any insights into why this might happen would be much appreciated. Right now, I'm totally in the dark on what things I should avoid or watch out for...
            rgpublic Ranjan Ghosh added a comment -

            OK, I wrote a small BASH script as an intermediate step (see below, SystemD required). Perhaps this is also useful to others. I will add this to the crontab to run every minute. It will watch the server if the problem occurs and shut down the node. This makes it hopefully at least possible again to run with two nodes unsupervised, because the system should recover automatically. Now, next step: Compile MariaDB with debug symbols. Let's see how this works out...

            #!/bin/bash
             
            # Dont do anything if MariaDB isnt running or is already in the process of shutting down
            if ! (systemctl status mariadb | grep --quiet 'active (running)'); then
                    echo "MariaDB not running";
                    exit;
            fi
             
            # Get the start date/time of MariaDB and the date/time of 10 minutes ago
            ts_start=$(systemctl show -p ExecMainStartTimestamp mariadb | cut -d "=" -f 2 | awk '{print $2 " " $3}')
            ts_ago=$(date --date="10 minutes ago" +"%Y-%m-%d %H:%M:%S")
             
            # Use whatever is greater as a log start date (i.e: Get the logs only for the current run but only the most recent 10 minutes)
            if [[ "$ts_start" > "$ts_ago" ]]; then
                    ts=$ts_start;
            else
                    ts=$ts_ago;
            fi
             
            # If the log contains "BF lock wait long" the database is frozen and needs to be shutdown.
            if (journalctl -u mariadb --since "$ts" | grep --quiet "BF lock wait long"); then
                    echo "BF lock wait detected! Shutting down MariaDB.";
                    systemctl --no-block stop mariadb
            fi
            

            rgpublic Ranjan Ghosh added a comment - OK, I wrote a small BASH script as an intermediate step (see below, SystemD required). Perhaps this is also useful to others. I will add this to the crontab to run every minute. It will watch the server if the problem occurs and shut down the node. This makes it hopefully at least possible again to run with two nodes unsupervised, because the system should recover automatically. Now, next step: Compile MariaDB with debug symbols. Let's see how this works out... #!/bin/bash   # Dont do anything if MariaDB isnt running or is already in the process of shutting down if ! (systemctl status mariadb | grep --quiet 'active (running)'); then echo "MariaDB not running"; exit; fi   # Get the start date/time of MariaDB and the date/time of 10 minutes ago ts_start=$(systemctl show -p ExecMainStartTimestamp mariadb | cut -d "=" -f 2 | awk '{print $2 " " $3}') ts_ago=$(date --date="10 minutes ago" +"%Y-%m-%d %H:%M:%S")   # Use whatever is greater as a log start date (i.e: Get the logs only for the current run but only the most recent 10 minutes) if [[ "$ts_start" > "$ts_ago" ]]; then ts=$ts_start; else ts=$ts_ago; fi   # If the log contains "BF lock wait long" the database is frozen and needs to be shutdown. if (journalctl -u mariadb --since "$ts" | grep --quiet "BF lock wait long"); then echo "BF lock wait detected! Shutting down MariaDB."; systemctl --no-block stop mariadb fi
            rgpublic Ranjan Ghosh added a comment -

            @danblack: I hope it will be sufficient to build MariaDB with just the additional debug symbols to get the backtrace, right? I won't need the 30% slower full-debug, version? Obviously, I'd like to avoid that as it is a production server. I'm referring to this page here:

            https://mariadb.com/kb/en/mariadb/compiling-mariadb-for-debugging/

            rgpublic Ranjan Ghosh added a comment - @danblack: I hope it will be sufficient to build MariaDB with just the additional debug symbols to get the backtrace, right? I won't need the 30% slower full-debug, version? Obviously, I'd like to avoid that as it is a production server. I'm referring to this page here: https://mariadb.com/kb/en/mariadb/compiling-mariadb-for-debugging/
            danblack Daniel Black added a comment - - edited

            There are debug symbol packages for the existing install in the mariadb repos that should be installable. i.e.:
            http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/m/mariadb-10.2/mariadb-server-core-10.2-dbgsym_10.2.6+maria~xenial_amd64.deb

            Don't forget galera syms too:
            http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/g/galera-3/galera-3-dbg_25.3.20-xenial_amd64.deb

            Otherwise use -DCMAKE_BUILD_TYPE=RelWithDebInfo and it won't be any slower.

            Nice script. An alternate way:

            journalctl -u mariadb -f | while read line
            do
              if [[ $line =~ "BF lock wait long" ]] ....
            done
            

            danblack Daniel Black added a comment - - edited There are debug symbol packages for the existing install in the mariadb repos that should be installable. i.e.: http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/m/mariadb-10.2/mariadb-server-core-10.2-dbgsym_10.2.6+maria~xenial_amd64.deb Don't forget galera syms too: http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/g/galera-3/galera-3-dbg_25.3.20-xenial_amd64.deb Otherwise use -DCMAKE_BUILD_TYPE=RelWithDebInfo and it won't be any slower. Nice script. An alternate way: journalctl -u mariadb -f | while read line do if [[ $line =~ "BF lock wait long" ]] .... done
            rgpublic Ranjan Ghosh added a comment -

            OK, I've now installed the dbgsym packages. I wondered if I needed to restart the DB to actually "enable" the symbols. Well, I did it anyway just to be sure.
            D'oh - your script is of course waaaay shorter than mine Thanks for sharing. I've added the line to create the backtrace I got from here:

            https://mariadb.com/kb/en/mariadb/how-to-produce-a-full-stack-trace-for-mysqld/#getting-a-stack-trace-for-a-running-mysqld-server

            Although, I think the docs were a bit wrong/misleading (didnt work for me) and "program" should really be "-p". I now have this line to get the backtrace:

            gdb --batch --eval-command="thread apply all bt" -p $(pgrep -xn mysqld) > /root/mysql_backtrace.log
            

            When I ran it, it created a backtrace most of the time. Sometimes, though, it did just hang at the end and I had to kill gdb Perhaps I was impatient. Hope this wont happen during the BFLockWait.

            OK, now that everything's setup I guess we'll just have to wait until the rare striped tiger runs in front of the camera :-/
            Thanks again, @danblack for all the helpful infos. It's much appreciated.

            rgpublic Ranjan Ghosh added a comment - OK, I've now installed the dbgsym packages. I wondered if I needed to restart the DB to actually "enable" the symbols. Well, I did it anyway just to be sure. D'oh - your script is of course waaaay shorter than mine Thanks for sharing. I've added the line to create the backtrace I got from here: https://mariadb.com/kb/en/mariadb/how-to-produce-a-full-stack-trace-for-mysqld/#getting-a-stack-trace-for-a-running-mysqld-server Although, I think the docs were a bit wrong/misleading (didnt work for me) and "program" should really be "-p". I now have this line to get the backtrace: gdb --batch --eval-command="thread apply all bt" -p $(pgrep -xn mysqld) > /root/mysql_backtrace.log When I ran it, it created a backtrace most of the time. Sometimes, though, it did just hang at the end and I had to kill gdb Perhaps I was impatient. Hope this wont happen during the BFLockWait. OK, now that everything's setup I guess we'll just have to wait until the rare striped tiger runs in front of the camera :-/ Thanks again, @danblack for all the helpful infos. It's much appreciated.
            rgpublic Ranjan Ghosh added a comment -

            Ha! Finally the crash happened. The script worked perfectly. The DB was shut down. The backtrace was automatically created. Here it is:

            mysql_backtrace.log

            rgpublic Ranjan Ghosh added a comment - Ha! Finally the crash happened. The script worked perfectly. The DB was shut down. The backtrace was automatically created. Here it is: mysql_backtrace.log
            RaffiOl Ralf Fischer added a comment -

            Exactly the same here: Just upgraded a 3 nodes cluster from 10.1.25 to 10.2.7. After 4 minutes i got: WSREP: BF lock wait long and the cluster hangs for all Updates/Inserts until restarting the hanging one. After less then 5 Minutes, got the same message again. Any good hints to downgrade back to 10.1?

            Debian 7.1,
            Linux mariadb-node1 3.2.0-4-amd64 #1 SMP Debian 3.2.81-2 x86_64 GNU/Linux

            Thanks in advance

            RaffiOl Ralf Fischer added a comment - Exactly the same here: Just upgraded a 3 nodes cluster from 10.1.25 to 10.2.7. After 4 minutes i got: WSREP: BF lock wait long and the cluster hangs for all Updates/Inserts until restarting the hanging one. After less then 5 Minutes, got the same message again. Any good hints to downgrade back to 10.1? Debian 7.1, Linux mariadb-node1 3.2.0-4-amd64 #1 SMP Debian 3.2.81-2 x86_64 GNU/Linux Thanks in advance

            Same issue here.. but on Centos 7.3

            Linux taskdb02 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
            KVM

            Jul 14 14:50:22 vmtaskdb03 mysqld: mysql tables in use 1, locked 1
            Jul 14 14:50:22 vmtaskdb03 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
            Jul 14 14:50:22 vmtaskdb03 mysqld: MySQL thread id 2027, OS thread handle 0x7f5196e33700, query id 30214 10.232.10.31 taskitud_ianis query end
            Jul 14 14:50:22 vmtaskdb03 mysqld: INSERT INTO `mw_campaign_track_open` (`campaign_id`, `subscriber_id`, `ip_address`, `user_agent`, `date_added`) VALUES (6351, 2314769, '90.9.29.141', 'Mozilla/5.0 (Windows NT 10.0; Win6
            4; x64; rv:54.0) Gecko/20100101 Firefox/54.0', NOW())
            Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign_track_open` trx id 16643769 lock mode IX
            Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode IS
            Jul 14 14:50:22 vmtaskdb03 mysqld: RECORD LOCKS space id 8 page no 76 n bits 120 index PRIMARY of table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode S locks rec but not gap
            Jul 14 14:50:45 vmtaskdb03 journal: Suppressed 5655 messages from /system.slice/mariadb.service
            Jul 14 14:50:45 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:46 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:47 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:48 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:49 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:50 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:51 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:52 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:53 vmtaskdb03 mysqld: WSREP: BF lock wait long
            Jul 14 14:50:54 vmtaskdb03 mysqld: WSREP: BF lock wait long

            thanks

            tmihaldinec Tomislav Mihaldinec added a comment - Same issue here.. but on Centos 7.3 Linux taskdb02 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux KVM Jul 14 14:50:22 vmtaskdb03 mysqld: mysql tables in use 1, locked 1 Jul 14 14:50:22 vmtaskdb03 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1 Jul 14 14:50:22 vmtaskdb03 mysqld: MySQL thread id 2027, OS thread handle 0x7f5196e33700, query id 30214 10.232.10.31 taskitud_ianis query end Jul 14 14:50:22 vmtaskdb03 mysqld: INSERT INTO `mw_campaign_track_open` (`campaign_id`, `subscriber_id`, `ip_address`, `user_agent`, `date_added`) VALUES (6351, 2314769, '90.9.29.141', 'Mozilla/5.0 (Windows NT 10.0; Win6 4; x64; rv:54.0) Gecko/20100101 Firefox/54.0', NOW()) Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign_track_open` trx id 16643769 lock mode IX Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode IS Jul 14 14:50:22 vmtaskdb03 mysqld: RECORD LOCKS space id 8 page no 76 n bits 120 index PRIMARY of table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode S locks rec but not gap Jul 14 14:50:45 vmtaskdb03 journal: Suppressed 5655 messages from /system.slice/mariadb.service Jul 14 14:50:45 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:46 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:47 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:48 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:49 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:50 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:51 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:52 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:53 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:54 vmtaskdb03 mysqld: WSREP: BF lock wait long thanks

            We have also received this error running on RedHat 7.3, MariaDB-server-10.2.7-1.el7, and galera-25.3.20-1.rhel7. The error only occurred on a system with high load and did not occur in the testing environments that have a much lower load. Does this bug affect all of 10.2? Does this bug affect any 10.1.x release?

            canderson Chris Anderson added a comment - We have also received this error running on RedHat 7.3, MariaDB-server-10.2.7-1.el7, and galera-25.3.20-1.rhel7. The error only occurred on a system with high load and did not occur in the testing environments that have a much lower load. Does this bug affect all of 10.2? Does this bug affect any 10.1.x release?
            rgpublic Ranjan Ghosh added a comment -

            Well, I'm only a poor user - no MariaDB guru, but what I can say is this:

            1) There are some anecdotal reports of people upgrading from 10.1.x. to 10.2.x and they suddently encountered this bug. This suggests that the error probably didnt happen on 10.1.

            2) I, myself, never used 10.1 with Galera, so I cant tell for sure. But I've been using Galera with many earlier 10.2.x versions and always had this bug as far as I can remember. This suggests that all of 10.2 is probably affected or at least it won't help to downgrade say from 10.2.7 to 10.2.5 or sth. like that as a work-around.

            3) Like stated in one of my previous comments, when researching this message on the web, it seems that this message does indeed show up in many varieties/versions of MySQL over the years. This suggests that this message might only be a symptom that has/had multiple causes. I assume though, that the current influx of error reports on this have all the same underlying reason.

            Up until know, I have not the slightest idea why it happens. The word "lock" in the error message, however, seems to suggest that it has sth. to do with locking problems. That is, the more InnoDB locking is going on, the more conflicts/deadlocks whatever occur and the more likely could it be that the error shows up. This is perhaps why you're experiencing the bug under high load. Just wildly guessing here, of course. On our server, the bug happens at arbitrary times. Sometimes in the middle of the night. Not necessarily under exceptionally high load.

            In a previous comment, I posted my mysql configuration. I'd like to know if someone sees sth. striking when comparing it to their config Some correlation - anything.

            I find it strange that there seems to be not much concern about this bug and the current MariaDB version is even considered "stable". For us, it renders the whole Galera feature totally useless - it's a huge problem. We are currently running without proper failover. The reason for this calm is probably: There are only very few people who are affected by this. Even though I assume there is a sizable population out there in the world who uses a current MariaDB version with Galera - especially now where it's out of beta. This leads to the question: What are they doing differently than us? I'd really like to find out, because there doesnt seem to be much progress on this otherwise.

            rgpublic Ranjan Ghosh added a comment - Well, I'm only a poor user - no MariaDB guru, but what I can say is this: 1) There are some anecdotal reports of people upgrading from 10.1.x. to 10.2.x and they suddently encountered this bug. This suggests that the error probably didnt happen on 10.1. 2) I, myself, never used 10.1 with Galera, so I cant tell for sure. But I've been using Galera with many earlier 10.2.x versions and always had this bug as far as I can remember. This suggests that all of 10.2 is probably affected or at least it won't help to downgrade say from 10.2.7 to 10.2.5 or sth. like that as a work-around. 3) Like stated in one of my previous comments, when researching this message on the web, it seems that this message does indeed show up in many varieties/versions of MySQL over the years. This suggests that this message might only be a symptom that has/had multiple causes. I assume though, that the current influx of error reports on this have all the same underlying reason. Up until know, I have not the slightest idea why it happens. The word "lock" in the error message, however, seems to suggest that it has sth. to do with locking problems. That is, the more InnoDB locking is going on, the more conflicts/deadlocks whatever occur and the more likely could it be that the error shows up. This is perhaps why you're experiencing the bug under high load. Just wildly guessing here, of course. On our server, the bug happens at arbitrary times. Sometimes in the middle of the night. Not necessarily under exceptionally high load. In a previous comment, I posted my mysql configuration. I'd like to know if someone sees sth. striking when comparing it to their config Some correlation - anything. I find it strange that there seems to be not much concern about this bug and the current MariaDB version is even considered "stable". For us, it renders the whole Galera feature totally useless - it's a huge problem. We are currently running without proper failover. The reason for this calm is probably: There are only very few people who are affected by this. Even though I assume there is a sizable population out there in the world who uses a current MariaDB version with Galera - especially now where it's out of beta. This leads to the question: What are they doing differently than us? I'd really like to find out, because there doesnt seem to be much progress on this otherwise.

            Thanks Ranjan, for your response. We also upgraded from an older version of 10.1.x. We were trying to fix another bug and then ran into this much bigger bug. Sorry, I didn't include that info. We are trying to assess downgrade and want to be above 10.1.23 if possible, but we would like to know before hand if the bug exists instead of going through all the work of a downgrade and then being surprised by the same bug that we are trying to get away from. Hopefully, we can get a response from MariaDB to confirm this.

            canderson Chris Anderson added a comment - Thanks Ranjan, for your response. We also upgraded from an older version of 10.1.x. We were trying to fix another bug and then ran into this much bigger bug. Sorry, I didn't include that info. We are trying to assess downgrade and want to be above 10.1.23 if possible, but we would like to know before hand if the bug exists instead of going through all the work of a downgrade and then being surprised by the same bug that we are trying to get away from. Hopefully, we can get a response from MariaDB to confirm this.
            RaffiOl Ralf Fischer added a comment - - edited

            We used a lot of 10.1 versions up to 10.1.25 in a three node cluster and never had that error. After upgrading to 10.1.7 we got that error every 5-15 minutes.

            RaffiOl Ralf Fischer added a comment - - edited We used a lot of 10.1 versions up to 10.1.25 in a three node cluster and never had that error. After upgrading to 10.1.7 we got that error every 5-15 minutes.

            Thanks Ralf, that is very helpful

            canderson Chris Anderson added a comment - Thanks Ralf, that is very helpful

            Also, Ralf, just to clarify. You got this bug after you upgraded to 10.2.7, correct? You said 10.1.7 and it may just be a typo

            canderson Chris Anderson added a comment - Also, Ralf, just to clarify. You got this bug after you upgraded to 10.2.7, correct? You said 10.1.7 and it may just be a typo
            RaffiOl Ralf Fischer added a comment -

            Yes, sorry Upgraded from 10.1.25 to 10.2.7...

            RaffiOl Ralf Fischer added a comment - Yes, sorry Upgraded from 10.1.25 to 10.2.7...
            chises Michael Weber added a comment -

            note: we used serverals versions of mariadb in the major release of 10.1. that issue never occured.
            After upgrading to 10.2 we are facing this issue.
            so i think a downgrad to 10.1 can be a workaround.

            chises Michael Weber added a comment - note: we used serverals versions of mariadb in the major release of 10.1. that issue never occured. After upgrading to 10.2 we are facing this issue. so i think a downgrad to 10.1 can be a workaround.
            dhawton Daniel Hawton added a comment -

            I too am getting this issue.. fresh installs of 10.2 no upgrade.

            dhawton Daniel Hawton added a comment - I too am getting this issue.. fresh installs of 10.2 no upgrade.
            mrvovanness Vladimir Titov added a comment - - edited

            Got the same issue with a fresh install of mariadb-server v. 10.2.7 from mariadb repo. The server software is Ubuntu 16.04
            Some related logs here

            Aug 15 22:38:22 mariadb mysqld[13566]: WSREP: BF lock wait long
            Aug 15 22:38:27 mariadb mysqld[13566]: message repeated 5 times: [ WSREP: BF lock wait long]
            Aug 15 22:38:27 mariadb mysqld[13566]: =====================================
            Aug 15 22:38:27 mariadb mysqld[13566]: 2017-08-15 22:38:27 0x7fbfdb84d700 INNODB MONITOR OUTPUT
            Aug 15 22:38:27 mariadb mysqld[13566]: =====================================
            ```
            

            Basically log the same as @Michael Veber posted

            mrvovanness Vladimir Titov added a comment - - edited Got the same issue with a fresh install of mariadb-server v. 10.2.7 from mariadb repo. The server software is Ubuntu 16.04 Some related logs here Aug 15 22:38:22 mariadb mysqld[13566]: WSREP: BF lock wait long Aug 15 22:38:27 mariadb mysqld[13566]: message repeated 5 times: [ WSREP: BF lock wait long] Aug 15 22:38:27 mariadb mysqld[13566]: ===================================== Aug 15 22:38:27 mariadb mysqld[13566]: 2017-08-15 22:38:27 0x7fbfdb84d700 INNODB MONITOR OUTPUT Aug 15 22:38:27 mariadb mysqld[13566]: ===================================== ``` Basically log the same as @Michael Veber posted
            natrinicle Nathan Bohman added a comment -

            Could this perhaps be related to MDEV-13331? I am experiencing the same issue and noticed that MDEV-13331 was introduced in 10.2.2 and fixed in 10.2.8. I find that I hit the bug if the application tied to our DB (JIRA) attempts to make a ton of updates to the same row at the same time resulting in a row lock. It would make sense if update got a similar patch as drop from upstream and was ignoring innodb_lock_wait_timeout.

            natrinicle Nathan Bohman added a comment - Could this perhaps be related to MDEV-13331 ? I am experiencing the same issue and noticed that MDEV-13331 was introduced in 10.2.2 and fixed in 10.2.8. I find that I hit the bug if the application tied to our DB (JIRA) attempts to make a ton of updates to the same row at the same time resulting in a row lock. It would make sense if update got a similar patch as drop from upstream and was ignoring innodb_lock_wait_timeout.
            rgpublic Ranjan Ghosh added a comment -

            Very interesting idea. Makes a whole lot of sense to me. There is a DELETE CASCADE. It locks. It never times out due to the bug. The lock is never released. More and more stuff that's supposed to be replicated is piling up due to the lock. Boom!

            Unfortunately I cannot upgrade our cluster to 10.2.8 at the moment - perhaps in the forthcoming weeks. Would be really helpful, though, if someone else could verify this earlier. Especially those, whose cluster is shutting down after a few minutes. With our cluster, I'd probably have to wait for days to see if the error is really resolved or not.

            rgpublic Ranjan Ghosh added a comment - Very interesting idea. Makes a whole lot of sense to me. There is a DELETE CASCADE. It locks. It never times out due to the bug. The lock is never released. More and more stuff that's supposed to be replicated is piling up due to the lock. Boom! Unfortunately I cannot upgrade our cluster to 10.2.8 at the moment - perhaps in the forthcoming weeks. Would be really helpful, though, if someone else could verify this earlier. Especially those, whose cluster is shutting down after a few minutes. With our cluster, I'd probably have to wait for days to see if the error is really resolved or not.
            rgpublic Ranjan Ghosh added a comment -

            Ah. Bummer! I used this call on my database:

            select     concat(table_name, '.', column_name) as 'foreign key',       
            concat(referenced_table_name, '.', referenced_column_name) as 'references' 
            from     information_schema.key_column_usage 
            where     referenced_table_name is not null;  
            

            This is supposed to show all tables with foreign keys. Unfortunately, there are none. So MDEV-13331 can at least not be the only reason AFAICT, because my DB keeps crashing anyway

            rgpublic Ranjan Ghosh added a comment - Ah. Bummer! I used this call on my database: select concat(table_name, '.', column_name) as 'foreign key', concat(referenced_table_name, '.', referenced_column_name) as 'references' from information_schema.key_column_usage where referenced_table_name is not null; This is supposed to show all tables with foreign keys. Unfortunately, there are none. So MDEV-13331 can at least not be the only reason AFAICT, because my DB keeps crashing anyway
            RaffiOl Ralf Fischer added a comment - - edited

            Just checked 10.2.8 with our 3 nodes cluster and as soon as we write on two or more nodes instead of only one, we got the same error as with the other 10.2 version.

            Aug 24 07:22:46 mysql-node2 mysqld: WSREP: BF lock wait long
            Aug 24 07:22:46 mysql-node2 mysqld:
            Aug 24 07:22:46 mysql-node2 mysqld: =====================================
            Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 07:22:46 0x7f4e89a4b700 INNODB MONITOR OUTPUT
            Aug 24 07:22:46 mysql-node2 mysqld: =====================================
            Aug 24 07:22:46 mysql-node2 mysqld: Per second averages calculated from the last 15 seconds
            Aug 24 07:22:46 mysql-node2 mysqld: -----------------
            Aug 24 07:22:46 mysql-node2 mysqld: BACKGROUND THREAD
            Aug 24 07:22:46 mysql-node2 mysqld: -----------------
            Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread loops: 40245 srv_active, 0 srv_shutdown, 25086 srv_idle
            Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread log flush and writes: 65331
            Aug 24 07:22:46 mysql-node2 mysqld: ----------
            Aug 24 07:22:46 mysql-node2 mysqld: SEMAPHORES
            Aug 24 07:22:46 mysql-node2 mysqld: ----------
            Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: reservation count 520504
            Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: signal count 1030299
            Aug 24 07:22:46 mysql-node2 mysqld: RW-shared spins 0, rounds 1439898, OS waits 149766
            Aug 24 07:22:46 mysql-node2 mysqld: RW-excl spins 0, rounds 14688668, OS waits 112658
            Aug 24 07:22:46 mysql-node2 mysqld: RW-sx spins 410293, rounds 5591848, OS waits 49179
            Aug 24 07:22:46 mysql-node2 mysqld: Spin rounds per wait: 1439898.00 RW-shared, 14688668.00 RW-excl, 13.63 RW-sx
            Aug 24 07:22:46 mysql-node2 mysqld: ------------------------
            Aug 24 07:22:46 mysql-node2 mysqld: LATEST FOREIGN KEY ERROR
            Aug 24 07:22:46 mysql-node2 mysqld: ------------------------
            Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 06:50:26 0x7f53fe4e9700 Transaction:
            Aug 24 07:22:46 mysql-node2 mysqld: TRANSACTION 3020224898, ACTIVE 0 sec inserting
            Aug 24 07:22:46 mysql-node2 mysqld: mysql tables in use 2, locked 2
            Aug 24 07:22:46 mysql-node2 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s)
            Aug 24 07:22:46 mysql-node2 mysqld: MySQL thread id 2, OS thread handle 0x7f53fe4e9700, query id 5813680 copy to tmp table
            Aug 24 07:22:46 mysql-node2 mysqld: ALTER TABLE `catalog_product_flat_1` ADD CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE
            Aug 24 07:22:46 mysql-node2 mysqld: Foreign key constraint fails for table `magento_felt_b2b`.`#sql-4c6d_2`:
            Aug 24 07:22:46 mysql-node2 mysqld: ,
            Aug 24 07:22:46 mysql-node2 mysqld: CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE in parent table, in index PRIMARY tuple:

            RaffiOl Ralf Fischer added a comment - - edited Just checked 10.2.8 with our 3 nodes cluster and as soon as we write on two or more nodes instead of only one, we got the same error as with the other 10.2 version. Aug 24 07:22:46 mysql-node2 mysqld: WSREP: BF lock wait long Aug 24 07:22:46 mysql-node2 mysqld: Aug 24 07:22:46 mysql-node2 mysqld: ===================================== Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 07:22:46 0x7f4e89a4b700 INNODB MONITOR OUTPUT Aug 24 07:22:46 mysql-node2 mysqld: ===================================== Aug 24 07:22:46 mysql-node2 mysqld: Per second averages calculated from the last 15 seconds Aug 24 07:22:46 mysql-node2 mysqld: ----------------- Aug 24 07:22:46 mysql-node2 mysqld: BACKGROUND THREAD Aug 24 07:22:46 mysql-node2 mysqld: ----------------- Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread loops: 40245 srv_active, 0 srv_shutdown, 25086 srv_idle Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread log flush and writes: 65331 Aug 24 07:22:46 mysql-node2 mysqld: ---------- Aug 24 07:22:46 mysql-node2 mysqld: SEMAPHORES Aug 24 07:22:46 mysql-node2 mysqld: ---------- Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: reservation count 520504 Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: signal count 1030299 Aug 24 07:22:46 mysql-node2 mysqld: RW-shared spins 0, rounds 1439898, OS waits 149766 Aug 24 07:22:46 mysql-node2 mysqld: RW-excl spins 0, rounds 14688668, OS waits 112658 Aug 24 07:22:46 mysql-node2 mysqld: RW-sx spins 410293, rounds 5591848, OS waits 49179 Aug 24 07:22:46 mysql-node2 mysqld: Spin rounds per wait: 1439898.00 RW-shared, 14688668.00 RW-excl, 13.63 RW-sx Aug 24 07:22:46 mysql-node2 mysqld: ------------------------ Aug 24 07:22:46 mysql-node2 mysqld: LATEST FOREIGN KEY ERROR Aug 24 07:22:46 mysql-node2 mysqld: ------------------------ Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 06:50:26 0x7f53fe4e9700 Transaction: Aug 24 07:22:46 mysql-node2 mysqld: TRANSACTION 3020224898, ACTIVE 0 sec inserting Aug 24 07:22:46 mysql-node2 mysqld: mysql tables in use 2, locked 2 Aug 24 07:22:46 mysql-node2 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s) Aug 24 07:22:46 mysql-node2 mysqld: MySQL thread id 2, OS thread handle 0x7f53fe4e9700, query id 5813680 copy to tmp table Aug 24 07:22:46 mysql-node2 mysqld: ALTER TABLE `catalog_product_flat_1` ADD CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE Aug 24 07:22:46 mysql-node2 mysqld: Foreign key constraint fails for table `magento_felt_b2b`.`#sql-4c6d_2`: Aug 24 07:22:46 mysql-node2 mysqld: , Aug 24 07:22:46 mysql-node2 mysqld: CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE in parent table, in index PRIMARY tuple:
            natrinicle Nathan Bohman added a comment -

            Sorry, I should have been more clear. 10.2.8 fixed a bug that was pulled in from upstream MySQL that caused MariaDB to ignore timeouts for DROP operations. I'm wondering if a similar piece of code was pulled in at the same time for other operations like UPDATE and SELECT?

            natrinicle Nathan Bohman added a comment - Sorry, I should have been more clear. 10.2.8 fixed a bug that was pulled in from upstream MySQL that caused MariaDB to ignore timeouts for DROP operations. I'm wondering if a similar piece of code was pulled in at the same time for other operations like UPDATE and SELECT?
            rgpublic Ranjan Ghosh added a comment -

            Well, the issue issue you mentioned earlier, MDEV-13331, is only about DELETE CASCADE with FOREIGN KEYS (FK). At least this is what it says in the title. And this is, indeed, the bug Ralf send the logs from. Although it seems to happen to him with 10.2.8 as well (@Ralf: Perhaps you'd like to mention the fact that it's still not fixed for you over there as well). For my part, I'm not aware of using any foreign keys at all (see my previous comment how I figured this out). So, this bug here seems to be at least not exclusively about foreign keys. I now understand your idea is: Try to find some similar pulled in code for other operations... It's difficult to find those, though, without knowing specifically what exactly we're search for...

            rgpublic Ranjan Ghosh added a comment - Well, the issue issue you mentioned earlier, MDEV-13331 , is only about DELETE CASCADE with FOREIGN KEYS (FK). At least this is what it says in the title. And this is, indeed, the bug Ralf send the logs from. Although it seems to happen to him with 10.2.8 as well (@Ralf: Perhaps you'd like to mention the fact that it's still not fixed for you over there as well). For my part, I'm not aware of using any foreign keys at all (see my previous comment how I figured this out). So, this bug here seems to be at least not exclusively about foreign keys. I now understand your idea is: Try to find some similar pulled in code for other operations... It's difficult to find those, though, without knowing specifically what exactly we're search for...

            Same bug after upgrade to 10.2.7 version.

            cio@zakupka.com Dutchak Vitalij added a comment - Same bug after upgrade to 10.2.7 version.

            Does 10.2.8 have same issue?

            cio@zakupka.com Dutchak Vitalij added a comment - Does 10.2.8 have same issue?
            mneumann Michael Neumann added a comment - - edited

            I can confirm, that the problem still exists in 10.2.8:

            Sep 19 08:34:50 dbm-ace-live-data-03 mysqld: 2017-09-19  8:34:50 140484532815616 [Warning] WSREP: Failed to report last committed 10768816, -4 (Interrupted system call)
            Sep 19 08:34:54 dbm-ace-live-data-03 mysqld: 2017-09-19  8:34:54 140484532815616 [Warning] WSREP: Failed to report last committed 10768818, -4 (Interrupted system call)
            Sep 19 08:35:01 dbm-ace-live-data-03 CRON[1269]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
            Sep 19 08:35:02 dbm-ace-live-data-03 mysqld: 2017-09-19  8:35:02 140484532815616 [Warning] WSREP: Failed to report last committed 10768925, -4 (Interrupted system call)
            Sep 19 08:35:51 dbm-ace-live-data-03 mysqld: WSREP: BF lock wait long
            Sep 19 08:36:00 dbm-ace-live-data-03 mysqld: message repeated 9 times: [ WSREP: BF lock wait long]
            Sep 19 08:36:01 dbm-ace-live-data-03 CRON[1361]: (root) CMD (/bin/chmod -R a+r /var/log >/dev/null 2>&1)
            Sep 19 08:36:01 dbm-ace-live-data-03 mysqld: WSREP: BF lock wait long
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: message repeated 5 times: [ WSREP: BF lock wait long]
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld:
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: =====================================
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: 2017-09-19 08:36:06 0x7fc3dc748700 INNODB MONITOR OUTPUT
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: =====================================
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: Per second averages calculated from the last 50 seconds
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: -----------------
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: BACKGROUND THREAD
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: -----------------
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: srv_master_thread loops: 333435 srv_active, 0 srv_shutdown, 144764 srv_idle
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: srv_master_thread log flush and writes: 478163
            Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: ----------
            

            So, is it an option to downgrade to 10.1?

            mneumann Michael Neumann added a comment - - edited I can confirm, that the problem still exists in 10.2.8: Sep 19 08 : 34 : 50 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 8 : 34 : 50 140484532815616 [Warning] WSREP: Failed to report last committed 10768816 , - 4 (Interrupted system call) Sep 19 08 : 34 : 54 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 8 : 34 : 54 140484532815616 [Warning] WSREP: Failed to report last committed 10768818 , - 4 (Interrupted system call) Sep 19 08 : 35 : 01 dbm-ace-live-data- 03 CRON[ 1269 ]: (root) CMD (command -v debian-sa1 > /dev/ null && debian-sa1 1 1 ) Sep 19 08 : 35 : 02 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 8 : 35 : 02 140484532815616 [Warning] WSREP: Failed to report last committed 10768925 , - 4 (Interrupted system call) Sep 19 08 : 35 : 51 dbm-ace-live-data- 03 mysqld: WSREP: BF lock wait long Sep 19 08 : 36 : 00 dbm-ace-live-data- 03 mysqld: message repeated 9 times: [ WSREP: BF lock wait long ] Sep 19 08 : 36 : 01 dbm-ace-live-data- 03 CRON[ 1361 ]: (root) CMD (/bin/chmod -R a+r /var/log >/dev/ null 2 >& 1 ) Sep 19 08 : 36 : 01 dbm-ace-live-data- 03 mysqld: WSREP: BF lock wait long Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: message repeated 5 times: [ WSREP: BF lock wait long ] Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ===================================== Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 08 : 36 : 06 0x7fc3dc748700 INNODB MONITOR OUTPUT Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ===================================== Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: Per second averages calculated from the last 50 seconds Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ----------------- Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: BACKGROUND THREAD Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ----------------- Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: srv_master_thread loops: 333435 srv_active, 0 srv_shutdown, 144764 srv_idle Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: srv_master_thread log flush and writes: 478163 Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ---------- So, is it an option to downgrade to 10.1?
            chises Michael Weber added a comment -

            So "good news" for me..
            since version 10.2.9 we have the cluster running without any issues for >5 days.

            mysql --version
            mysql  Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2
             
            systemctl status mysql
            ● mariadb.service - MariaDB database server
               Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
              Drop-In: /etc/systemd/system/mariadb.service.d
                       └─migrated-from-my.cnf-settings.conf
               Active: active (running) since Thu 2017-09-28 16:27:02 CEST; 5 days ago
            

            I dont trust myself but i hope that this verison fixed it for me/us

            any experience with this version from other users?

            chises Michael Weber added a comment - So "good news" for me.. since version 10.2.9 we have the cluster running without any issues for >5 days. mysql --version mysql Ver 15.1 Distrib 10.2 . 9 -MariaDB, for debian-linux-gnu (x86_64) using readline 5.2   systemctl status mysql ● mariadb.service - MariaDB database server Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/mariadb.service.d └─migrated-from-my.cnf-settings.conf Active: active (running) since Thu 2017 - 09 - 28 16 : 27 : 02 CEST; 5 days ago I dont trust myself but i hope that this verison fixed it for me/us any experience with this version from other users?
            mneumann Michael Neumann added a comment - - edited

            No good news for me, the problem stil exists:

            dbm-ace-pre-02:~ # mysql --version
            mysql  Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2
            ....
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld: WSREP: BF lock wait long
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld:
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld: =====================================
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld: 2017-10-05 09:45:58 0x7fb33bcf5700 INNODB MONITOR OUTPUT
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld: =====================================
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld: Per second averages calculated from the last 30 seconds
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld: -----------------
            Oct  5 09:45:59 dbm-ace-pre-02 mysqld: BACKGROUND THREAD
            

            My setup consists of three datanodes with two maxscale proxies in front of it and one hardware loadbalancer as entry point

            i can very easily reproduce the problem when i switch the two maxscale to "Galera-Mode" e.g. write to all nodes allowed.

            mneumann Michael Neumann added a comment - - edited No good news for me, the problem stil exists: dbm-ace-pre-02:~ # mysql --version mysql Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 .... Oct 5 09:45:59 dbm-ace-pre-02 mysqld: WSREP: BF lock wait long Oct 5 09:45:59 dbm-ace-pre-02 mysqld: Oct 5 09:45:59 dbm-ace-pre-02 mysqld: ===================================== Oct 5 09:45:59 dbm-ace-pre-02 mysqld: 2017-10-05 09:45:58 0x7fb33bcf5700 INNODB MONITOR OUTPUT Oct 5 09:45:59 dbm-ace-pre-02 mysqld: ===================================== Oct 5 09:45:59 dbm-ace-pre-02 mysqld: Per second averages calculated from the last 30 seconds Oct 5 09:45:59 dbm-ace-pre-02 mysqld: ----------------- Oct 5 09:45:59 dbm-ace-pre-02 mysqld: BACKGROUND THREAD My setup consists of three datanodes with two maxscale proxies in front of it and one hardware loadbalancer as entry point i can very easily reproduce the problem when i switch the two maxscale to "Galera-Mode" e.g. write to all nodes allowed.
            rgpublic Ranjan Ghosh added a comment -

            Hm. Right, still unsolved with the most recent version. Sigh. I have to admit I had faintly hoped I might return from my vacation and this bug is finally solved What I don't understand is that nobody of the core dev team seems to even be interested in this. Don't get me wrong - as a developer, I know that there are difficult bugs to solve that take a lot of time to figure out etc. But there is not even a reaction, nothing. I think this is a huge bug that in fact renders Galera completely useless. If the database locks up once per day (some even report it to happen every few minutes) and requires manual intervention it is - let's be honest - simply not usable on a production server. And I don't need Galera on a dev system. So, you cannot use it at all. But the integrated Galera was one of the main selling points of using MariaDB over MySQL. At least it was for us. For a long time I thought: "OK, it's a beta version. Just wait. They'll certainly fix this in the future". Now, MariaDB 10.2.x is even considered as "stable" AFAIK. Nothing could be further from the truth.

            rgpublic Ranjan Ghosh added a comment - Hm. Right, still unsolved with the most recent version. Sigh. I have to admit I had faintly hoped I might return from my vacation and this bug is finally solved What I don't understand is that nobody of the core dev team seems to even be interested in this. Don't get me wrong - as a developer, I know that there are difficult bugs to solve that take a lot of time to figure out etc. But there is not even a reaction, nothing. I think this is a huge bug that in fact renders Galera completely useless. If the database locks up once per day (some even report it to happen every few minutes) and requires manual intervention it is - let's be honest - simply not usable on a production server. And I don't need Galera on a dev system. So, you cannot use it at all . But the integrated Galera was one of the main selling points of using MariaDB over MySQL. At least it was for us. For a long time I thought: "OK, it's a beta version. Just wait. They'll certainly fix this in the future". Now, MariaDB 10.2.x is even considered as "stable" AFAIK. Nothing could be further from the truth.
            mneumann Michael Neumann added a comment - - edited

            You speak out of my soul. The reason we gave mariadb(galera) a shot, is the write scalability. We tested it with one of our key.systems and run into this bug. The combination with maxcale appeared so promising. Now we move to our Oracle-RAC system.

            mneumann Michael Neumann added a comment - - edited You speak out of my soul. The reason we gave mariadb(galera) a shot, is the write scalability. We tested it with one of our key.systems and run into this bug. The combination with maxcale appeared so promising. Now we move to our Oracle-RAC system.

            Hi Everyone!

            I am assigned to this bug, but I was a lot busy with other bugs. I will look into this bug in this week and hopefully will come with a work arround / patch.

            • Sachin
            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Hi Everyone! I am assigned to this bug, but I was a lot busy with other bugs. I will look into this bug in this week and hopefully will come with a work arround / patch. Sachin
            RaffiOl Ralf Fischer added a comment -

            Hi Sachin, that's great! Thanks in advance for your support!

            RaffiOl Ralf Fischer added a comment - Hi Sachin, that's great! Thanks in advance for your support!
            anikitin Andrii Nikitin (Inactive) added a comment - - edited

            I was able to reproduce the problem with script below, which downloads and unpacks 10.2.9, configures 3 nodes on 3307, 3308 and 3309 and puts some concurrent write load :

            set -e
            ver=${1:-10.2.9}
            # just use current directory if called from framework
            if [ ! -f common.sh ] ; then
              [ -d mariadb-environs ] || git clone http://github.com/AndriiNikitin/mariadb-environs
              cd mariadb-environs
              ./get_plugin.sh galera
            fi
             
            _template/plant_cluster.sh cluster1
            echo m1 > cluster1/nodes.lst
            echo m2 >> cluster1/nodes.lst
            echo m3 >> cluster1/nodes.lst
            cluster1/replant.sh ${ver}
             
            ./build_or_download.sh m1
             
            cluster1/gen_cnf.sh innodb_buffer_pool_size=1G \
                    innodb_log_file_size=256M \
                    innodb_flush_log_at_trx_commit=2 \
                    innodb_lock_wait_timeout=10 \
                    transaction-isolation=READ-COMMITTED \
                    innodb_stats_persistent=0
            cluster1/install_db.sh
            # cluster1/galera_setup_acl.sh # this needed for mysqldump
            cluster1/galera_start_new.sh wsrep_sst_method=rsync
             
            sleep 10
            cluster1/galera_cluster_size.sh
             
            for i in {1..5}; do
              m1*/sql.sh create table t$i select seq as a, seq as b from seq_1_to_10000
              m1*/sql.sh "alter table t$i add unique index(a)"
            done
             
            trap 'kill $(jobs -p)' EXIT
             
            for m in {1..3}; do
              for i in {1..5}; do
                export I=$i
                export M=$m
                ( while : ; do m$M*/sql.sh "start transaction; update t$I set b=b+1 where a=$I; delete from t$I where a=2*$I; do sleep(10); commit" 2>/dev/null || : ; done; )&
              done
              ( while : ; do
                for i in {1..30}; do m$M*/sql.sh "delete from t$i where a=floor(rand()*10000);" &>/dev/null ; done
              done ) &
              ( while : ; do
                for i in {1..30}; do m$M*/sql.sh "delete from t$i where a=floor(rand()*10000);" &>/dev/null ; done
              done ) &
            done
             
            while : ; do
              m1*/sql.sh show processlist
              m2*/sql.sh show processlist
              m3*/sql.sh show processlist
              grep 'BF lock wait long' m1*/dt/error.log | wc -l
              grep 'BF lock wait long' m2*/dt/error.log | wc -l
              grep 'BF lock wait long' m3*/dt/error.log | wc -l
              sleep 10
            done
            

            The output of write SQL commands is suppressed and diplayed only processlist for each node, followed by number of "BF lock wait long" occurrences in error log.
            After few minutes I can see problems in galera committer threads:

            0
            0
            0
            2	system user		NULL	Sleep	99	wsrep aborter idle	NULL	0.000
            1	system user		NULL	Sleep	10	committed 25	NULL	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            43	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            44	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            45	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            46	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            47	root	localhost	test	Query	9	updating	delete from t4 where a=floor(rand()*10000)	0.000
            48	root	localhost	test	Query	9	updating	delete from t4 where a=floor(rand()*10000)	0.000
            49	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            50	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	97	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	9	committed 30	NULL	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            27	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            29	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            30	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            31	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            32	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            33	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	89	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	20	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            12	root	localhost	test	Query	20	init	commit	0.000
            14	root	localhost	test	Query	20	init	commit	0.000
            15	root	localhost	test	Query	30	query end	delete from t1 where a=floor(rand()*10000)	0.000
            20	root	localhost	test	Query	15	init	commit	0.000
            21	root	localhost	test	Query	10	init	commit	0.000
            25	root	localhost	test	Query	9	updating	update t1 set b=b+1 where a=1	0.000
            26	root	localhost	test	Query	0	init	show processlist	0.000
            0
            0
            10
            2	system user		NULL	Sleep	109	wsrep aborter idle	NULL	0.000
            1	system user		NULL	Sleep	9	Update_rows_log_event::find_row(35)	update t4 set b=b+1 where a=4	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            47	root	localhost	test	Query	19	query end	delete from t4 where a=floor(rand()*10000)	0.000
            49	root	localhost	test	Query	9	init	commit	0.000
            51	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            52	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            53	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            54	root	localhost	test	Query	9	updating	update t4 set b=b+1 where a=4	0.000
            55	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	107	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	9	committed 40	NULL	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            34	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            35	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            36	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            37	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            38	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            39	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	99	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	30	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            12	root	localhost	test	Query	30	init	commit	0.000
            14	root	localhost	test	Query	30	init	commit	0.000
            15	root	localhost	test	Query	40	query end	delete from t1 where a=floor(rand()*10000)	0.000
            20	root	localhost	test	Query	25	init	commit	0.000
            21	root	localhost	test	Query	20	init	commit	0.000
            27	root	localhost	test	Query	8	updating	update t1 set b=b+1 where a=1	0.000
            28	root	localhost	test	Query	0	init	show processlist	0.000
            0
            0
            20
            2	system user		NULL	Sleep	119	wsrep aborter idle	NULL	0.000
            1	system user		NULL	Sleep	19	Update_rows_log_event::find_row(35)	update t4 set b=b+1 where a=4	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            47	root	localhost	test	Query	29	query end	delete from t4 where a=floor(rand()*10000)	0.000
            49	root	localhost	test	Query	19	init	commit	0.000
            51	root	localhost	test	Query	9	init	commit	0.000
            52	root	localhost	test	Query	9	init	commit	0.000
            53	root	localhost	test	Query	9	init	commit	0.000
            56	root	localhost	test	Query	8	updating	update t4 set b=b+1 where a=4	0.000
            57	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	117	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	19	committed 40	NULL	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            40	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            41	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            42	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            43	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            44	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000
            45	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	109	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	40	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            12	root	localhost	test	Query	40	init	commit	0.000
            14	root	localhost	test	Query	40	init	commit	0.000
            15	root	localhost	test	Query	50	query end	delete from t1 where a=floor(rand()*10000)	0.000
            20	root	localhost	test	Query	35	init	commit	0.000
            21	root	localhost	test	Query	30	init	commit	0.000
            29	root	localhost	test	Query	7	updating	update t1 set b=b+1 where a=1	0.000
            30	root	localhost	test	Query	0	init	show processlist	0.000
            9
            0
            30
            2	system user		NULL	Sleep	129	wsrep aborter idle	NULL	0.000
            1	system user		NULL	Sleep	29	Update_rows_log_event::find_row(35)	update t4 set b=b+1 where a=4	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            47	root	localhost	test	Query	39	query end	delete from t4 where a=floor(rand()*10000)	0.000
            49	root	localhost	test	Query	29	init	commit	0.000
            51	root	localhost	test	Query	20	init	commit	0.000
            52	root	localhost	test	Query	19	init	commit	0.000
            53	root	localhost	test	Query	19	init	commit	0.000
            58	root	localhost	test	Query	7	updating	update t4 set b=b+1 where a=4	0.000
            59	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	127	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	29	committed 40	NULL	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            40	root	localhost	test	Query	9	init	commit	0.000
            41	root	localhost	test	Query	9	init	commit	0.000
            42	root	localhost	test	Query	9	init	commit	0.000
            43	root	localhost	test	Query	9	init	commit	0.000
            44	root	localhost	test	Query	9	init	commit	0.000
            46	root	localhost	test	Query	0	init	show processlist	0.000
            1	system user		NULL	Sleep	119	wsrep aborter idle	NULL	0.000
            2	system user		NULL	Sleep	50	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000
            3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000
            4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000
            7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000
            12	root	localhost	test	Query	50	init	commit	0.000
            14	root	localhost	test	Query	50	init	commit	0.000
            15	root	localhost	test	Query	60	query end	delete from t1 where a=floor(rand()*10000)	0.000
            20	root	localhost	test	Query	45	init	commit	0.000
            21	root	localhost	test	Query	40	init	commit	0.000
            31	root	localhost	test	Query	6	updating	update t1 set b=b+1 where a=1	0.000
            32	root	localhost	test	Query	0	init	show processlist	0.000
            19
            0
            40
            

            sachin.setiya.007 let me know if this script is OK or if I should work on mtr case with similar load.

            anikitin Andrii Nikitin (Inactive) added a comment - - edited I was able to reproduce the problem with script below, which downloads and unpacks 10.2.9, configures 3 nodes on 3307, 3308 and 3309 and puts some concurrent write load : set -e ver=${1:-10.2.9} # just use current directory if called from framework if [ ! -f common.sh ] ; then [ -d mariadb-environs ] || git clone http: //github .com /AndriiNikitin/mariadb-environs cd mariadb-environs . /get_plugin .sh galera fi   _template /plant_cluster .sh cluster1 echo m1 > cluster1 /nodes .lst echo m2 >> cluster1 /nodes .lst echo m3 >> cluster1 /nodes .lst cluster1 /replant .sh ${ver}   . /build_or_download .sh m1   cluster1 /gen_cnf .sh innodb_buffer_pool_size=1G \ innodb_log_file_size=256M \ innodb_flush_log_at_trx_commit=2 \ innodb_lock_wait_timeout=10 \ transaction-isolation=READ-COMMITTED \ innodb_stats_persistent=0 cluster1 /install_db .sh # cluster1/galera_setup_acl.sh # this needed for mysqldump cluster1 /galera_start_new .sh wsrep_sst_method= rsync   sleep 10 cluster1 /galera_cluster_size .sh   for i in {1..5}; do m1* /sql .sh create table t$i select seq as a, seq as b from seq_1_to_10000 m1* /sql .sh "alter table t$i add unique index(a)" done   trap 'kill $(jobs -p)' EXIT   for m in {1..3}; do for i in {1..5}; do export I=$i export M=$m ( while : ; do m$M* /sql .sh "start transaction; update t$I set b=b+1 where a=$I; delete from t$I where a=2*$I; do sleep(10); commit" 2> /dev/null || : ; done ; )& done ( while : ; do for i in {1..30}; do m$M* /sql .sh "delete from t$i where a=floor(rand()*10000);" &> /dev/null ; done done ) & ( while : ; do for i in {1..30}; do m$M* /sql .sh "delete from t$i where a=floor(rand()*10000);" &> /dev/null ; done done ) & done   while : ; do m1* /sql .sh show processlist m2* /sql .sh show processlist m3* /sql .sh show processlist grep 'BF lock wait long' m1* /dt/error .log | wc -l grep 'BF lock wait long' m2* /dt/error .log | wc -l grep 'BF lock wait long' m3* /dt/error .log | wc -l sleep 10 done The output of write SQL commands is suppressed and diplayed only processlist for each node, followed by number of "BF lock wait long" occurrences in error log. After few minutes I can see problems in galera committer threads: 0 0 0 2 system user NULL Sleep 99 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 10 committed 25 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 43 root localhost test Query 9 User sleep do sleep(10) 0.000 44 root localhost test Query 9 User sleep do sleep(10) 0.000 45 root localhost test Query 9 User sleep do sleep(10) 0.000 46 root localhost test Query 9 User sleep do sleep(10) 0.000 47 root localhost test Query 9 updating delete from t4 where a=floor(rand()*10000) 0.000 48 root localhost test Query 9 updating delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 9 User sleep do sleep(10) 0.000 50 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 97 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 9 committed 30 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 27 root localhost test Query 9 User sleep do sleep(10) 0.000 29 root localhost test Query 9 User sleep do sleep(10) 0.000 30 root localhost test Query 9 User sleep do sleep(10) 0.000 31 root localhost test Query 9 User sleep do sleep(10) 0.000 32 root localhost test Query 9 User sleep do sleep(10) 0.000 33 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 89 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 20 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 20 init commit 0.000 14 root localhost test Query 20 init commit 0.000 15 root localhost test Query 30 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 15 init commit 0.000 21 root localhost test Query 10 init commit 0.000 25 root localhost test Query 9 updating update t1 set b=b+1 where a=1 0.000 26 root localhost test Query 0 init show processlist 0.000 0 0 10 2 system user NULL Sleep 109 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 9 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 47 root localhost test Query 19 query end delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 9 init commit 0.000 51 root localhost test Query 9 User sleep do sleep(10) 0.000 52 root localhost test Query 9 User sleep do sleep(10) 0.000 53 root localhost test Query 9 User sleep do sleep(10) 0.000 54 root localhost test Query 9 updating update t4 set b=b+1 where a=4 0.000 55 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 107 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 9 committed 40 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 34 root localhost test Query 9 User sleep do sleep(10) 0.000 35 root localhost test Query 9 User sleep do sleep(10) 0.000 36 root localhost test Query 9 User sleep do sleep(10) 0.000 37 root localhost test Query 9 User sleep do sleep(10) 0.000 38 root localhost test Query 9 User sleep do sleep(10) 0.000 39 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 99 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 30 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 30 init commit 0.000 14 root localhost test Query 30 init commit 0.000 15 root localhost test Query 40 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 25 init commit 0.000 21 root localhost test Query 20 init commit 0.000 27 root localhost test Query 8 updating update t1 set b=b+1 where a=1 0.000 28 root localhost test Query 0 init show processlist 0.000 0 0 20 2 system user NULL Sleep 119 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 19 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 47 root localhost test Query 29 query end delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 19 init commit 0.000 51 root localhost test Query 9 init commit 0.000 52 root localhost test Query 9 init commit 0.000 53 root localhost test Query 9 init commit 0.000 56 root localhost test Query 8 updating update t4 set b=b+1 where a=4 0.000 57 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 117 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 19 committed 40 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 40 root localhost test Query 9 User sleep do sleep(10) 0.000 41 root localhost test Query 9 User sleep do sleep(10) 0.000 42 root localhost test Query 9 User sleep do sleep(10) 0.000 43 root localhost test Query 9 User sleep do sleep(10) 0.000 44 root localhost test Query 9 User sleep do sleep(10) 0.000 45 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 109 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 40 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 40 init commit 0.000 14 root localhost test Query 40 init commit 0.000 15 root localhost test Query 50 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 35 init commit 0.000 21 root localhost test Query 30 init commit 0.000 29 root localhost test Query 7 updating update t1 set b=b+1 where a=1 0.000 30 root localhost test Query 0 init show processlist 0.000 9 0 30 2 system user NULL Sleep 129 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 29 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 47 root localhost test Query 39 query end delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 29 init commit 0.000 51 root localhost test Query 20 init commit 0.000 52 root localhost test Query 19 init commit 0.000 53 root localhost test Query 19 init commit 0.000 58 root localhost test Query 7 updating update t4 set b=b+1 where a=4 0.000 59 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 127 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 29 committed 40 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 40 root localhost test Query 9 init commit 0.000 41 root localhost test Query 9 init commit 0.000 42 root localhost test Query 9 init commit 0.000 43 root localhost test Query 9 init commit 0.000 44 root localhost test Query 9 init commit 0.000 46 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 119 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 50 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 50 init commit 0.000 14 root localhost test Query 50 init commit 0.000 15 root localhost test Query 60 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 45 init commit 0.000 21 root localhost test Query 40 init commit 0.000 31 root localhost test Query 6 updating update t1 set b=b+1 where a=1 0.000 32 root localhost test Query 0 init show processlist 0.000 19 0 40 sachin.setiya.007 let me know if this script is OK or if I should work on mtr case with similar load.

            Hi anikitin!

            I tried to run this script on jessie(buildbot), One hour run , No BF lock wait long, The thing is script is not deterministic (I think so is the bug). If we can have some mtr case which simulates it for sure , would be great

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Hi anikitin ! I tried to run this script on jessie(buildbot), One hour run , No BF lock wait long, The thing is script is not deterministic (I think so is the bug). If we can have some mtr case which simulates it for sure , would be great
            Sukan Sukan added a comment -

            Hi Sachin,

            The below scenario will help you in reproducing the bug.

            1. On a single node, for a particular ID updates are fired continuously.
            2. On another node, for the same WHERE clause I set a different value.

            I have a table dummy, where I generated concurrent updates on node 1 and node 2 at the same time.

            *Node 1* update dummy set status=100 where id='716071007'; //This runs in 500 concurrency with 100 iteration
            *Node 2* update dummy set status=101 where id=716071007;  //This runs in 500 concurrency with 100 iteration
            

            GC should have thrown a deadlock error, but instead it keep accepting the queries.

            In node 2, when the query from node 1 replicates and on node 2 when a direct UPDATE hits, it thrown the below error.

            WSREP: BF lock wait long
            WSREP: BF lock wait long
            

            This makes the cluster stalls and need to forcefully kill the same to bring back.

            NOTE: Tested in 10.2.8,10.2.9, creating "BF lock wait long" but not in 10.1.26.

            Steps to reproduce :

            CREATE TABLE `dummy` (
            -> `id` bigint(20) NOT NULL AUTO_INCREMENT,
            -> service varchar(20),
            -> status int(11),
            -> PRIMARY KEY (`id`)
            -> ) ENGINE=InnoDB AUTO_INCREMENT=716071002 DEFAULT CHARSET=utf8;
            Query OK, 0 rows affected (0.43 sec)

            MariaDB [test]> insert into dummy (service,status) values ('MySQL',1),('MSSQL',2),('MONGODB',3);
            Query OK, 3 rows affected (0.13 sec)
            Records: 3 Duplicates: 0 Warnings: 0

            *Generate concurrent updates *

            *Node 1* update dummy set status=100 where id='716071007';
            *Node 2* update dummy set status=101 where id=716071007;
            

            Process list :
            ------------------------------------------------------------------------------------------------------------------------------------------

            Id User Host db Command Time State Info Progress

            ------------------------------------------------------------------------------------------------------------------------------------------

            1 system user   NULL Sleep 4 Update_rows_log_event::find_row(718) update dummy set status=100 where id='716071007' 0.000
            2 system user   NULL Sleep 501 wsrep aborter idle NULL 0.000
            6 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000
            3 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
            4 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
            5 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000
            7 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000
            11 monitor localhost:44902 NULL Sleep 0   NULL 0.000
            17 root localhost test Query 0 init show processlist 0.000
            22 sbtest localhost:46666 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            23 sbtest localhost:46696 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            24 sbtest localhost:46850 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            25 sbtest localhost:46854 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            26 sbtest localhost:46856 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            36 sbtest localhost:47108 test Query 4 updating update dummy set status=100 where id='716071007' 0.000
            38 sbtest localhost:47144 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            39 sbtest localhost:47182 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            37 sbtest localhost:47140 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            40 sbtest localhost:47186 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            42 sbtest localhost:47222 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            41 sbtest localhost:47202 test Query 4 updating update dummy set status=100 where id='716071007' 0.000
            43 sbtest localhost:47232 test Query 4 updating update dummy set status=100 where id='716071007' 0.000
            27 sbtest localhost:46892 test Query 4 query end update dummy set status=100 where id='716071007' 0.000
            30 sbtest localhost:47042 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            31 sbtest localhost:47048 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            32 sbtest localhost:47062 test Query 4 updating update dummy set status=101 where id=716071007 0.000
            33 sbtest localhost:47066 test Query 4 updating update dummy set status=101 where id=716071007 0.000

            Finally, cluster stalls until we kill the MySQL

            +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+
            | Id | User        | Host            | db   | Command | Time | State                                | Info                                             | Progress |
            +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+
            |  1 | system user |                 | NULL | Sleep   |  144 | Update_rows_log_event::find_row(718) | update dummy set status=100 where id='716071007' |    0.000 |
            |  2 | system user |                 | NULL | Sleep   |  641 | wsrep aborter idle                   | NULL                                             |    0.000 |
            |  6 | system user |                 | NULL | Daemon  | NULL | InnoDB purge coordinator             | NULL                                             |    0.000 |
            |  3 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker                  | NULL                                             |    0.000 |
            |  4 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker                  | NULL                                             |    0.000 |
            |  5 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker                  | NULL                                             |    0.000 |
            |  7 | system user |                 | NULL | Daemon  | NULL | InnoDB shutdown handler              | NULL                                             |    0.000 |
            | 11 | monitor     | localhost:44902 | NULL | Sleep   |    0 |                                      | NULL                                             |    0.000 |
            | 17 | root        | localhost       | test | Query   |    0 | init                                 | show processlist                                 |    0.000 |
            | 27 | sbtest      | localhost:46892 | test | Query   |  144 | query end                            | update dummy set status=100 where id='716071007' |    0.000 |
            +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+
            
            

            Hope this will help you in addressing the issue.

            -Sukan
            DBA, Mafiree

            Sukan Sukan added a comment - Hi Sachin, The below scenario will help you in reproducing the bug. 1. On a single node, for a particular ID updates are fired continuously. 2. On another node, for the same WHERE clause I set a different value. I have a table dummy, where I generated concurrent updates on node 1 and node 2 at the same time. *Node 1 * update dummy set status= 100 where id= '716071007' ; //This runs in 500 concurrency with 100 iteration *Node 2 * update dummy set status= 101 where id= 716071007 ; //This runs in 500 concurrency with 100 iteration GC should have thrown a deadlock error, but instead it keep accepting the queries. In node 2, when the query from node 1 replicates and on node 2 when a direct UPDATE hits, it thrown the below error. WSREP: BF lock wait long WSREP: BF lock wait long This makes the cluster stalls and need to forcefully kill the same to bring back. NOTE: Tested in 10.2.8,10.2.9, creating "BF lock wait long" but not in 10.1.26. Steps to reproduce : CREATE TABLE `dummy` ( -> `id` bigint(20) NOT NULL AUTO_INCREMENT, -> service varchar(20), -> status int(11), -> PRIMARY KEY (`id`) -> ) ENGINE=InnoDB AUTO_INCREMENT=716071002 DEFAULT CHARSET=utf8; Query OK, 0 rows affected (0.43 sec) MariaDB [test] > insert into dummy (service,status) values ('MySQL',1),('MSSQL',2),('MONGODB',3); Query OK, 3 rows affected (0.13 sec) Records: 3 Duplicates: 0 Warnings: 0 *Generate concurrent updates * *Node 1 * update dummy set status= 100 where id= '716071007' ; *Node 2 * update dummy set status= 101 where id= 716071007 ; Process list : ---- ----------- --------------- ---- ------- ---- ------------------------------------ ------------------------------------------------ --------- Id User Host db Command Time State Info Progress ---- ----------- --------------- ---- ------- ---- ------------------------------------ ------------------------------------------------ --------- 1 system user   NULL Sleep 4 Update_rows_log_event::find_row(718) update dummy set status=100 where id='716071007' 0.000 2 system user   NULL Sleep 501 wsrep aborter idle NULL 0.000 6 system user   NULL Daemon NULL InnoDB purge coordinator NULL 0.000 3 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user   NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user   NULL Daemon NULL InnoDB shutdown handler NULL 0.000 11 monitor localhost:44902 NULL Sleep 0   NULL 0.000 17 root localhost test Query 0 init show processlist 0.000 22 sbtest localhost:46666 test Query 4 updating update dummy set status=101 where id=716071007 0.000 23 sbtest localhost:46696 test Query 4 updating update dummy set status=101 where id=716071007 0.000 24 sbtest localhost:46850 test Query 4 updating update dummy set status=101 where id=716071007 0.000 25 sbtest localhost:46854 test Query 4 updating update dummy set status=101 where id=716071007 0.000 26 sbtest localhost:46856 test Query 4 updating update dummy set status=101 where id=716071007 0.000 36 sbtest localhost:47108 test Query 4 updating update dummy set status=100 where id='716071007' 0.000 38 sbtest localhost:47144 test Query 4 updating update dummy set status=101 where id=716071007 0.000 39 sbtest localhost:47182 test Query 4 updating update dummy set status=101 where id=716071007 0.000 37 sbtest localhost:47140 test Query 4 updating update dummy set status=101 where id=716071007 0.000 40 sbtest localhost:47186 test Query 4 updating update dummy set status=101 where id=716071007 0.000 42 sbtest localhost:47222 test Query 4 updating update dummy set status=101 where id=716071007 0.000 41 sbtest localhost:47202 test Query 4 updating update dummy set status=100 where id='716071007' 0.000 43 sbtest localhost:47232 test Query 4 updating update dummy set status=100 where id='716071007' 0.000 27 sbtest localhost:46892 test Query 4 query end update dummy set status=100 where id='716071007' 0.000 30 sbtest localhost:47042 test Query 4 updating update dummy set status=101 where id=716071007 0.000 31 sbtest localhost:47048 test Query 4 updating update dummy set status=101 where id=716071007 0.000 32 sbtest localhost:47062 test Query 4 updating update dummy set status=101 where id=716071007 0.000 33 sbtest localhost:47066 test Query 4 updating update dummy set status=101 where id=716071007 0.000 Finally, cluster stalls until we kill the MySQL +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+ | 1 | system user | | NULL | Sleep | 144 | Update_rows_log_event::find_row( 718 ) | update dummy set status= 100 where id= '716071007' | 0.000 | | 2 | system user | | NULL | Sleep | 641 | wsrep aborter idle | NULL | 0.000 | | 6 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 | | 3 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 5 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 7 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 | | 11 | monitor | localhost: 44902 | NULL | Sleep | 0 | | NULL | 0.000 | | 17 | root | localhost | test | Query | 0 | init | show processlist | 0.000 | | 27 | sbtest | localhost: 46892 | test | Query | 144 | query end | update dummy set status= 100 where id= '716071007' | 0.000 | +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+ Hope this will help you in addressing the issue. -Sukan DBA, Mafiree
            ari Ari Maniatis added a comment -

            @sachin.setiya.007 Are you able to reproduce the problem now? I can make it happen on 10.2.7 within a few minutes by running Jira with the loadbalancer JDBC connection across two nodes. This happens with very little load almost right away.

            Since we have no workaround and no way of avoiding the bug, there needs to be some recommendation to avoid Galera entirely with MariaDB 10.2.x. Clearly this isn't ready for production. There are also other important issues to watch out for since 10.2.9 will not start with Galera because of this bug.

            Is there some way we can get a higher level of attention for Galera issues, or for more automated testing in this area? I think this could be mariaDB's key advantage over the competition if this stuff was just really solid.

            ari Ari Maniatis added a comment - @sachin.setiya.007 Are you able to reproduce the problem now? I can make it happen on 10.2.7 within a few minutes by running Jira with the loadbalancer JDBC connection across two nodes. This happens with very little load almost right away. Since we have no workaround and no way of avoiding the bug, there needs to be some recommendation to avoid Galera entirely with MariaDB 10.2.x. Clearly this isn't ready for production. There are also other important issues to watch out for since 10.2.9 will not start with Galera because of this bug . Is there some way we can get a higher level of attention for Galera issues, or for more automated testing in this area? I think this could be mariaDB's key advantage over the competition if this stuff was just really solid.

            Sorry , I was travelling , But will try to do it now, I got to hit the bug by anikitin script. So I am writing mtr test , and on the way to fix

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Sorry , I was travelling , But will try to do it now, I got to hit the bug by anikitin script. So I am writing mtr test , and on the way to fix
            tasdinq Tas Dinq added a comment -

            This bug also effect server version: 10.2.6-MariaDB-10.2.6+maria~xenial-log mariadb.org binary distribution, in my production environment. it's drive me crazy i need force to restart full cluster each times.

            i have a lot's of client process update a table, some times with same id on different node, like:
            UPDATE foo SET foo.bar = 1 or 0 WHERE id=xxx;
            very similar to Sukan's test case.

            any progress for this bug?
            or any temporary workaround for this bug?

            tasdinq Tas Dinq added a comment - This bug also effect server version: 10.2.6-MariaDB-10.2.6+maria~xenial-log mariadb.org binary distribution, in my production environment. it's drive me crazy i need force to restart full cluster each times. i have a lot's of client process update a table, some times with same id on different node, like: UPDATE foo SET foo.bar = 1 or 0 WHERE id=xxx; very similar to Sukan's test case. any progress for this bug? or any temporary workaround for this bug?
            Sukan Sukan added a comment -

            Hi Tan,

            I would suggest you to run the same cluster in single node, you can point all the write traffic to single node. This would be temporary workaround without making any big changes.

            Else you can downgrade the cluster completely to 10.1.x version, where we ran the test and didn't hit this bug.

            Thanks,
            Sukan
            DBA, Mafiree

            Sukan Sukan added a comment - Hi Tan, I would suggest you to run the same cluster in single node, you can point all the write traffic to single node. This would be temporary workaround without making any big changes. Else you can downgrade the cluster completely to 10.1.x version, where we ran the test and didn't hit this bug. Thanks, Sukan DBA, Mafiree
            tasdinq Tas Dinq added a comment -

            THANK YOU !!!!!! Sukan. thank you for your advice.

            I'm trying to use maxscale's ReadWriteSplit module to route all write to single node, hope my life can be easier with this.

            tasdinq Tas Dinq added a comment - THANK YOU !!!!!! Sukan. thank you for your advice. I'm trying to use maxscale's ReadWriteSplit module to route all write to single node, hope my life can be easier with this.
            sibirsky87 Anton Baykov added a comment -

            Hi Sachin,
            any progress for this bug?

            sibirsky87 Anton Baykov added a comment - Hi Sachin, any progress for this bug?
            laocius TAO ZHOU added a comment -

            I hit the error 'BF-BF X lock conflict' even though I am only writing to one node.

            2017-11-09  4:18:49 50152503296 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
            2017-11-09 10:30:37 50153508864 [Note] InnoDB: BF-BF X lock conflict,mode: 1027 supremum: 0
            2017-11-09 10:30:37 50153508864 [Note] InnoDB: conflicts states: my 0 locked 0
            RECORD LOCKS space id 316718 page no 3049 n bits 240 index PRIMARY of table `
            

            laocius TAO ZHOU added a comment - I hit the error 'BF-BF X lock conflict' even though I am only writing to one node. 2017 - 11 - 09 4 : 18 : 49 50152503296 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2017 - 11 - 09 10 : 30 : 37 50153508864 [Note] InnoDB: BF-BF X lock conflict,mode: 1027 supremum: 0 2017 - 11 - 09 10 : 30 : 37 50153508864 [Note] InnoDB: conflicts states: my 0 locked 0 RECORD LOCKS space id 316718 page no 3049 n bits 240 index PRIMARY of table `
            pyfrom Mark added a comment -

            Is there any update for this ticket working twords a fix for this? There has been a distinct lack of feedback about this issue. Not that I expect a fix right away, but let us know whats happening, and maybe a better understanding of what the issue is.

            We have been have been experiencing this problem even though we are sending all of the traffic to one node only. (out off 3) Our application seems to lock up about once a month or so.

            Mark

            pyfrom Mark added a comment - Is there any update for this ticket working twords a fix for this? There has been a distinct lack of feedback about this issue. Not that I expect a fix right away, but let us know whats happening, and maybe a better understanding of what the issue is. We have been have been experiencing this problem even though we are sending all of the traffic to one node only. (out off 3) Our application seems to lock up about once a month or so. Mark

            Hi ,

            I have been assigned to this issue, but for arround 10 days I was on vacations. Currently I am occupied with 10715 and 10177. I will look into this issue after those are done. And this issue require somewhat more understanding , because I am not familiar with innodb code.

            Regards
            sachin

            sachin.setiya.007 Sachin Setiya (Inactive) added a comment - Hi , I have been assigned to this issue, but for arround 10 days I was on vacations. Currently I am occupied with 10715 and 10177. I will look into this issue after those are done. And this issue require somewhat more understanding , because I am not familiar with innodb code. Regards sachin
            ari Ari Maniatis added a comment -

            Hi Sachin

            I know you have your own set of priorities based on your customers, but I'd like to encourage you to look at this bug with higher priority. Particularly when compared to building a new feature for 10.3 (MDEV-10177). This bug:

            • completely crashes mariadb
            • has no workaround
            • affects anyone who is using galera, even if they disable writing to multiple nodes

            In short, this bug is completely catastrophic for all production use of mariaDB if galera is enabled. I think mariaDB management should consider pulling the entire 10.2 release from production ready status and think about how users might roll back to 10.1 if this bug can't be fixed in the next minor 10.2.11 release. This should be on the front page of your site or download page.

            ari Ari Maniatis added a comment - Hi Sachin I know you have your own set of priorities based on your customers, but I'd like to encourage you to look at this bug with higher priority. Particularly when compared to building a new feature for 10.3 ( MDEV-10177 ). This bug: completely crashes mariadb has no workaround affects anyone who is using galera, even if they disable writing to multiple nodes In short, this bug is completely catastrophic for all production use of mariaDB if galera is enabled. I think mariaDB management should consider pulling the entire 10.2 release from production ready status and think about how users might roll back to 10.1 if this bug can't be fixed in the next minor 10.2.11 release. This should be on the front page of your site or download page.
            anikitin Andrii Nikitin (Inactive) added a comment - - edited mtr test case is in MDEV-14401 according to instructions from https://jira.mariadb.org/browse/MDEV-12837?focusedCommentId=101543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-101543

            reassigned to jplindst because seppo has found that

            My debugging session from today shows that the potential smoking gun is
            function lock_grant_and_move_on_page() in lock0lock.cc. This function does not
            honor locking priority policy, and leaves a high priority lock in wait queue
            favoring a normal priority lock.

            serg Sergei Golubchik added a comment - reassigned to jplindst because seppo has found that My debugging session from today shows that the potential smoking gun is function lock_grant_and_move_on_page() in lock0lock.cc. This function does not honor locking priority policy, and leaves a high priority lock in wait queue favoring a normal priority lock.

            When using VATS we favor older transactions this is bad for Galera as we should favor BF transactions.

            jplindst Jan Lindström (Inactive) added a comment - When using VATS we favor older transactions this is bad for Galera as we should favor BF transactions.
            jplindst Jan Lindström (Inactive) added a comment - https://github.com/MariaDB/server/commit/7d4d54e415c3d6081b71b024532b5ba8f1addc4a

            Please try to find a solution that does not involve adding fields to trx_t. I cannot approve the addition of trx_t::trx_mutex_taken.

            There are many Galera test failures on bb-10.2-MDEV-12837. Can you please rebase the fix on the latest 10.2, and then post a detailed analysis of all test failures?

            marko Marko Mäkelä added a comment - Please try to find a solution that does not involve adding fields to trx_t. I cannot approve the addition of trx_t::trx_mutex_taken. There are many Galera test failures on bb-10.2-MDEV-12837 . Can you please rebase the fix on the latest 10.2, and then post a detailed analysis of all test failures?
            jplindst Jan Lindström (Inactive) added a comment - https://github.com/MariaDB/server/commit/9f6dd866c163822e2e3b52345685722c6dec2b2b

            The change looks much cleaner now, but I’d like to have one more review round.

            marko Marko Mäkelä added a comment - The change looks much cleaner now, but I’d like to have one more review round.

            I would like to see a revised test push to buildbot that contains only this change, based on the latest 10.2. The current bb-10.2-MDEV-12837 contains some commits related to MDEV-14401. It seems that MDEV-14401 actually depends on this MDEV-12837. It looks like the changes to lock_rec_lock_slow() and lock_table() in the initial MDEV-14401 patch have been made redundant by the MDEV-12837 change to innobase_init().

            marko Marko Mäkelä added a comment - I would like to see a revised test push to buildbot that contains only this change, based on the latest 10.2. The current bb-10.2-MDEV-12837 contains some commits related to MDEV-14401 . It seems that MDEV-14401 actually depends on this MDEV-12837 . It looks like the changes to lock_rec_lock_slow() and lock_table() in the initial MDEV-14401 patch have been made redundant by the MDEV-12837 change to innobase_init() .
            jplindst Jan Lindström (Inactive) added a comment - - edited Split the commits on two parts: https://github.com/MariaDB/server/commit/1ffd8b30f9f86aedd42c91f72db9d54f2f8576d7 Test case is on: https://github.com/MariaDB/server/commit/7f555777b36983d4b4590df6634aa22170e90bc3 Because it requires both fixes.

            I sent some review comments to the first commit. I still think that we must convert the algorithm at InnoDB startup, not while InnoDB is already open for SQL.
            Where is this running on Buildbot?

            marko Marko Mäkelä added a comment - I sent some review comments to the first commit. I still think that we must convert the algorithm at InnoDB startup, not while InnoDB is already open for SQL. Where is this running on Buildbot?

            bb-10.2-galera

            jplindst Jan Lindström (Inactive) added a comment - bb-10.2-galera

            I now see that bb-10.2-galera consists of two commits on top of 10.2: MDEV-12837 and MDEV-14401.
            If the MDEV-12837 fix were working as requested (adjusting the configuration at InnoDB startup), then I think that all the configuration changes

            innodb-lock-schedule-algorithm=fcfs
            

            in the MDEV-14401 fix should be unnecessary and should be removed. The default VATS should be fine. We should not even need to suppress any warning; at startup, InnoDB could issue an informational note (not warning) that the incompatible parameter was adjusted.

            One more question: Why does the MDEV-14401 commit disable the following occasionally failing tests:

            diff --git a/mysql-test/suite/galera/disabled.def b/mysql-test/suite/galera/disabled.def
            index 423c83208f8..801b12a1f84 100644
            --- a/mysql-test/suite/galera/disabled.def
            +++ b/mysql-test/suite/galera/disabled.def
            @@ -56,3 +56,8 @@ MW-284 : MDEV-13549 Galera test failures 10.1
             galera_as_slave : MDEV-13549 Galera test failures 10.1
             galera_var_innodb_disallow_writes : MDEV-10949
             galera_kill_applier : race condition at the start of the test
            +MW-328C: MDEV-13549 Galera test failures 10.1
            +MW-328A: MDEV-13549 Galera test failures 10.1
            +MW-328B: MDEV-13549 Galera test failures 10.1
            +MW-328: MDEV-13549 Galera test failures 10.1
            +galera_suspend_slave: MDEV-13549 Galera test failures 10.1
            \ No newline at end of file
            

            The commit message does not explain this.
            I think that if it is considered necessary to disable these tests, they should be disabled in a separate commit, with a reference to separate tasks, such as MDEV-14149 or MDEV-13876 or even MDEV-13549.

            marko Marko Mäkelä added a comment - I now see that bb-10.2-galera consists of two commits on top of 10.2: MDEV-12837 and MDEV-14401 . If the MDEV-12837 fix were working as requested (adjusting the configuration at InnoDB startup), then I think that all the configuration changes innodb-lock-schedule-algorithm=fcfs in the MDEV-14401 fix should be unnecessary and should be removed. The default VATS should be fine. We should not even need to suppress any warning; at startup, InnoDB could issue an informational note (not warning) that the incompatible parameter was adjusted. One more question: Why does the MDEV-14401 commit disable the following occasionally failing tests: diff --git a/mysql-test/suite/galera/disabled.def b/mysql-test/suite/galera/disabled.def index 423c83208f8..801b12a1f84 100644 --- a/mysql-test/suite/galera/disabled.def +++ b/mysql-test/suite/galera/disabled.def @@ -56,3 +56,8 @@ MW-284 : MDEV-13549 Galera test failures 10.1 galera_as_slave : MDEV-13549 Galera test failures 10.1 galera_var_innodb_disallow_writes : MDEV-10949 galera_kill_applier : race condition at the start of the test +MW-328C: MDEV-13549 Galera test failures 10.1 +MW-328A: MDEV-13549 Galera test failures 10.1 +MW-328B: MDEV-13549 Galera test failures 10.1 +MW-328: MDEV-13549 Galera test failures 10.1 +galera_suspend_slave: MDEV-13549 Galera test failures 10.1 \ No newline at end of file The commit message does not explain this. I think that if it is considered necessary to disable these tests, they should be disabled in a separate commit, with a reference to separate tasks, such as MDEV-14149 or MDEV-13876 or even MDEV-13549 .

            commit da3a3a68df34c7fef387ce890d3925166edeef2c
            Author: Jan Lindström <jan.lindstrom@mariadb.com>
            Date: Thu Dec 7 12:26:29 2017 +0200

            MDEV-12837: WSREP: BF lock wait long

            Problem was a merge error from MySQL wsrep i.e. Galera.

            wsrep_on_check
            New check function. Galera can't be enabled
            if innodb-lock-schedule-algorithm=VATS.

            innobase_kill_query
            In Galera async kill we could own lock mutex.

            innobase_init
            If Variance-Aware-Transaction-Sheduling Algorithm (VATS) is
            used on Galera we fall back to First-Come-First-Served (FCFS)
            with notice to user.

            Changed innodb-lock-schedule-algorithm as read-only parameter
            as it was designed to be.

            lock_reset_lock_and_trx_wait
            Use ib::hex() to print out transaction ID.

            lock_rec_other_has_expl_req,
            lock_rec_other_has_conflicting,
            RecLock::add_to_waitq
            lock_rec_lock_slow
            lock_table_other_has_incompatible
            lock_rec_insert_check_and_lock
            lock_prdt_other_has_conflicting

            Change pointer to conflicting lock to normal pointer as this
            pointer contents could be changed later.

            RecLock::create
            Conclicting lock pointer is moved to last parameter with
            default value NULL. This conflicting transaction could
            be selected as victim in Galera if requesting transaction
            is BF (brute force) transaction. In this case contents
            of conflicting lock pointer will be changed. Use ib::hex() to print
            transaction ids.

            jplindst Jan Lindström (Inactive) added a comment - commit da3a3a68df34c7fef387ce890d3925166edeef2c Author: Jan Lindström <jan.lindstrom@mariadb.com> Date: Thu Dec 7 12:26:29 2017 +0200 MDEV-12837 : WSREP: BF lock wait long Problem was a merge error from MySQL wsrep i.e. Galera. wsrep_on_check New check function. Galera can't be enabled if innodb-lock-schedule-algorithm=VATS. innobase_kill_query In Galera async kill we could own lock mutex. innobase_init If Variance-Aware-Transaction-Sheduling Algorithm (VATS) is used on Galera we fall back to First-Come-First-Served (FCFS) with notice to user. Changed innodb-lock-schedule-algorithm as read-only parameter as it was designed to be. lock_reset_lock_and_trx_wait Use ib::hex() to print out transaction ID. lock_rec_other_has_expl_req, lock_rec_other_has_conflicting, RecLock::add_to_waitq lock_rec_lock_slow lock_table_other_has_incompatible lock_rec_insert_check_and_lock lock_prdt_other_has_conflicting Change pointer to conflicting lock to normal pointer as this pointer contents could be changed later. RecLock::create Conclicting lock pointer is moved to last parameter with default value NULL. This conflicting transaction could be selected as victim in Galera if requesting transaction is BF (brute force) transaction. In this case contents of conflicting lock pointer will be changed. Use ib::hex() to print transaction ids.
            jplindst Jan Lindström (Inactive) added a comment - - edited

            greenman Can you document that innodb-lock-schedule-algorithm is read-only variables (i.e. it can't be changed dynamically) and that VATS method is not supported on Galera.

            Here what documentation should mention:

            • 10.1, default is FCFS and InnoDB will refuse to start if VATS is used with galera.
            • >=10.2 default is VATS and in galera it will be changed to FCFS.
            jplindst Jan Lindström (Inactive) added a comment - - edited greenman Can you document that innodb-lock-schedule-algorithm is read-only variables (i.e. it can't be changed dynamically) and that VATS method is not supported on Galera. Here what documentation should mention: 10.1, default is FCFS and InnoDB will refuse to start if VATS is used with galera. >=10.2 default is VATS and in galera it will be changed to FCFS.
            rgpublic Ranjan Ghosh added a comment -

            Wowwwww. Incredible. After many months of suffering finally a MariaDB version that really deserved the label "stable"... perhaps before christmas? Can it be true? Don't tell me I'm dreaming. Now we're all waiting anxiously for the binary to arrive in the repos... Thanks to everyone involved and, especially, to Jan for solving this.

            rgpublic Ranjan Ghosh added a comment - Wowwwww. Incredible. After many months of suffering finally a MariaDB version that really deserved the label "stable"... perhaps before christmas? Can it be true? Don't tell me I'm dreaming. Now we're all waiting anxiously for the binary to arrive in the repos... Thanks to everyone involved and, especially, to Jan for solving this.
            ari Ari Maniatis added a comment -

            @jplindst Was there supposed to be a commit to the 10.2 branch as well? I could only see one to the 10.1 branch (where I think the bug wasn't manifesting) and to 10.3.

            https://github.com/MariaDB/server/search?q=MDEV-12837&type=Commits&utf8=%E2%9C%93

            Are you able to confirm this will land in 10.2.12? That will help us plan our testing for being able to use galera cluster in production.

            ari Ari Maniatis added a comment - @jplindst Was there supposed to be a commit to the 10.2 branch as well? I could only see one to the 10.1 branch (where I think the bug wasn't manifesting) and to 10.3. https://github.com/MariaDB/server/search?q=MDEV-12837&type=Commits&utf8=%E2%9C%93 Are you able to confirm this will land in 10.2.12? That will help us plan our testing for being able to use galera cluster in production.

            No, there wasn't supposed to be a separate 10.2 commit — every 10.1 commit will inevitably land in 10.2 eventually.

            Normally this happens at least before the next 10.2 release, so this commit will be in the 10.2.12, just like "Fix Version/s" field at the top of the page says.

            serg Sergei Golubchik added a comment - No, there wasn't supposed to be a separate 10.2 commit — every 10.1 commit will inevitably land in 10.2 eventually. Normally this happens at least before the next 10.2 release, so this commit will be in the 10.2.12, just like "Fix Version/s" field at the top of the page says.

            Hi, there seems to be some kind of information loss here: commit da3a3a68df34c7fef387ce890d3925166edeef2c is for 10.2 (10.1 had no merge error and FCFS is default so change there it is slightly different).

            jplindst Jan Lindström (Inactive) added a comment - Hi, there seems to be some kind of information loss here: commit da3a3a68df34c7fef387ce890d3925166edeef2c is for 10.2 (10.1 had no merge error and FCFS is default so change there it is slightly different).
            rgpublic Ranjan Ghosh added a comment -

            I'd suggest adding a note (like the one below 10.1.22) on:

            https://downloads.mariadb.org/mariadb/+releases/

            Perhaps sth. like: "Updating from 10.2.11 to 10.2.12 is highly recommended for Galera users because previous versions are unstable".
            In addition - IMHO - this should also be mentioned in the release notes. The term "stable" on the above-mentioned page is really misleading for older versions because we all know they will probably crash sooner or later due to this bug - even with very ordinary use patterns.

            rgpublic Ranjan Ghosh added a comment - I'd suggest adding a note (like the one below 10.1.22) on: https://downloads.mariadb.org/mariadb/+releases/ Perhaps sth. like: "Updating from 10.2.11 to 10.2.12 is highly recommended for Galera users because previous versions are unstable". In addition - IMHO - this should also be mentioned in the release notes. The term "stable" on the above-mentioned page is really misleading for older versions because we all know they will probably crash sooner or later due to this bug - even with very ordinary use patterns.
            danblack Daniel Black added a comment -

            rgpublic great work. I suggest the description in https://mariadb.com/kb/en/library/mariadb-10130-release-notes/ could be meaningfully expanded (replaced). This is a page humble users like you and me can edit and probably serves as a good enough place people who are looking to upgrade would read.

            danblack Daniel Black added a comment - rgpublic great work. I suggest the description in https://mariadb.com/kb/en/library/mariadb-10130-release-notes/ could be meaningfully expanded (replaced). This is a page humble users like you and me can edit and probably serves as a good enough place people who are looking to upgrade would read.
            rgpublic Ranjan Ghosh added a comment -

            Ah, really cool. Thanks, Daniel. Didnt know that everyone can change these pages. I've added a warning box and reference to this bug here:

            https://mariadb.com/kb/en/library/mariadb-10212-release-notes/

            For 10.1.x I think the already existing line under "Notable changes" should be sufficent as the bug doesnt seem to manifest there - so no need to cause panic.

            rgpublic Ranjan Ghosh added a comment - Ah, really cool. Thanks, Daniel. Didnt know that everyone can change these pages. I've added a warning box and reference to this bug here: https://mariadb.com/kb/en/library/mariadb-10212-release-notes/ For 10.1.x I think the already existing line under "Notable changes" should be sufficent as the bug doesnt seem to manifest there - so no need to cause panic.
            adrian.tarau Adrian Tarau added a comment -

            It looks like it still happens ....

            innodb_version 5.7.20
            protocol_version 10
            slave_type_conversions
            version 10.2.12-MariaDB
            version_comment MariaDB Server
            version_compile_machine x86_64
            version_compile_os Linux
            version_malloc_library system
            version_ssl_library OpenSSL 1.0.1e-fips 11 Feb 2013
            wsrep_patch_version wsrep_25.21

            What can I provide to help troubleshoot this? It looks like the issue is triggered by a truncate table...

            LOCK WAIT
            MySQL thread id 138641, OS thread handle 139936878089984, query id 139638795 ........ Opening tables
            TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1`
            2018-01-24 11:51:38 139938787669760 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1a3598f query: TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1`
            TRANSACTION 27482511, ACTIVE 1997 sec

            As soon as this kicks in, everything locks. Cannot reproduce it in a consistent way .... also, when it happens, kill 138641 says "now an owner"... even for root!!! how is that possible?

            adrian.tarau Adrian Tarau added a comment - It looks like it still happens .... innodb_version 5.7.20 protocol_version 10 slave_type_conversions version 10.2.12-MariaDB version_comment MariaDB Server version_compile_machine x86_64 version_compile_os Linux version_malloc_library system version_ssl_library OpenSSL 1.0.1e-fips 11 Feb 2013 wsrep_patch_version wsrep_25.21 What can I provide to help troubleshoot this? It looks like the issue is triggered by a truncate table... LOCK WAIT MySQL thread id 138641, OS thread handle 139936878089984, query id 139638795 ........ Opening tables TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1` 2018-01-24 11:51:38 139938787669760 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1a3598f query: TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1` TRANSACTION 27482511, ACTIVE 1997 sec As soon as this kicks in, everything locks. Cannot reproduce it in a consistent way .... also, when it happens, kill 138641 says "now an owner"... even for root!!! how is that possible?

            Can I have full unedited error log? wsrep_debug=on and wait until it repeats and again error log would also help.

            jplindst Jan Lindström (Inactive) added a comment - Can I have full unedited error log? wsrep_debug=on and wait until it repeats and again error log would also help.
            adrian.tarau Adrian Tarau added a comment -

            I'll try ... hard to say when it happens ... hopefully the log will not be GB of data with debug on ....

            adrian.tarau Adrian Tarau added a comment - I'll try ... hard to say when it happens ... hopefully the log will not be GB of data with debug on ....
            adrian.tarau Adrian Tarau added a comment -

            I have 500MB of logs (uncompressed) ... I'll send you an email with a link to download the logs ....

            Forgot to mention ... we are using a 3 node cluster (development database) with HA Proxy, with one node dedicated for writes...nothing interesting is logged in (mostly) read-only nodes .... A stress test is started with the database in the same state ... after a while it locks and it is always triggered by a DDL statement (truncate in this case, in the past I've seen these dead-locks with ALTER TABLE ... ADD PARTITION ...). Once the DDL enters in dead-lock, no other query can be executed (and there are plenty of threads/connections available), even queries not related to that table lock. Same with ALTER TABLE ... ADD PARTITION

            First reference of a dead-lock starts at 2018-01-24 15:39:09 139784560858880 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1e9eb4f query

            adrian.tarau Adrian Tarau added a comment - I have 500MB of logs (uncompressed) ... I'll send you an email with a link to download the logs .... Forgot to mention ... we are using a 3 node cluster (development database) with HA Proxy, with one node dedicated for writes...nothing interesting is logged in (mostly) read-only nodes .... A stress test is started with the database in the same state ... after a while it locks and it is always triggered by a DDL statement (truncate in this case, in the past I've seen these dead-locks with ALTER TABLE ... ADD PARTITION ...). Once the DDL enters in dead-lock, no other query can be executed (and there are plenty of threads/connections available), even queries not related to that table lock. Same with ALTER TABLE ... ADD PARTITION First reference of a dead-lock starts at 2018-01-24 15:39:09 139784560858880 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1e9eb4f query
            neiltembo Neil Skrypuch added a comment -

            I am also still seeing this issue with 10.2.12, although the behaviour has changed somewhat compared to 10.2.9 (the previous version we were running).

            Now with 10.2.12, it seems to only happen on one master at a time (not multiple masters simultaneously). Also, the query that gets regularly printed to the logs is always a DDL statement now, which it was not before.

            The other noteworthy change with 10.2.12 is that the log output is much terser. Instead of a complete dump of SHOW ENGINE INNODB STATUS; every second, we now just see 5 lines per second, like so:

            Jan 18 02:53:12 tm2 mysqld[16437]: 2018-01-18 2:53:12 140246917351168 [Note] InnoDB: WSREP: BF lock wait long for trx:0x76a45d4 query: CREATE TABLE IF NOT EXISTS ...
            Jan 18 02:53:12 tm2 mysqld[16437]: TRANSACTION 124405204, ACTIVE 269206 sec
            Jan 18 02:53:12 tm2 mysqld[16437]: LOCK WAIT
            Jan 18 02:53:12 tm2 mysqld[16437]: MySQL thread id 16, OS thread handle 140246857389824, query id 27596013 creating table
            Jan 18 02:53:12 tm2 mysqld[16437]: CREATE TABLE IF NOT EXISTS ...

            neiltembo Neil Skrypuch added a comment - I am also still seeing this issue with 10.2.12, although the behaviour has changed somewhat compared to 10.2.9 (the previous version we were running). Now with 10.2.12, it seems to only happen on one master at a time (not multiple masters simultaneously). Also, the query that gets regularly printed to the logs is always a DDL statement now, which it was not before. The other noteworthy change with 10.2.12 is that the log output is much terser. Instead of a complete dump of SHOW ENGINE INNODB STATUS; every second, we now just see 5 lines per second, like so: Jan 18 02:53:12 tm2 mysqld [16437] : 2018-01-18 2:53:12 140246917351168 [Note] InnoDB: WSREP: BF lock wait long for trx:0x76a45d4 query: CREATE TABLE IF NOT EXISTS ... Jan 18 02:53:12 tm2 mysqld [16437] : TRANSACTION 124405204, ACTIVE 269206 sec Jan 18 02:53:12 tm2 mysqld [16437] : LOCK WAIT Jan 18 02:53:12 tm2 mysqld [16437] : MySQL thread id 16, OS thread handle 140246857389824, query id 27596013 creating table Jan 18 02:53:12 tm2 mysqld [16437] : CREATE TABLE IF NOT EXISTS ...
            adrian.tarau Adrian Tarau added a comment -

            I do not see a way to re-open the defect ....

            adrian.tarau Adrian Tarau added a comment - I do not see a way to re-open the defect ....

            adrian.tarau, please create a new one. You can refer to this one in the description if you'd like or copy-paste from it.

            serg Sergei Golubchik added a comment - adrian.tarau , please create a new one. You can refer to this one in the description if you'd like or copy-paste from it.

            Updated correct releases where actual fix is:

            commit 57cde8ccd19675dc98e3cbacf0ef5c72cb188e49 (HEAD -> 10.1, origin/10.1)
            Author: Jan Lindström <jan.lindstrom@mariadb.com>
            Date: Wed Jul 25 08:21:25 2018 +0300

            MDEV-15822: WSREP: BF lock wait long for trx

            In Galera BF (brute force) transactions may not wait for lock requests
            and normally BF-transaction would select transaction holding conflicting
            locks as a victim for rollback. However, background statistic calculation
            transaction is InnoDB internal transaction and it has no thd i.e. it can't be
            selected as a victim. If background statistics calculation transaction holds
            conflicting locks to statistics tables it will cause BF lock wait long
            error message. Correct way to handle background statistics calculation is to
            acquire thd for transaction but that change is too big for GA-releases and
            there are other reported problems on background statistics calculation.

            This fix avoids adding a table to background statistics calculation if

            jplindst Jan Lindström (Inactive) added a comment - Updated correct releases where actual fix is: commit 57cde8ccd19675dc98e3cbacf0ef5c72cb188e49 (HEAD -> 10.1, origin/10.1) Author: Jan Lindström <jan.lindstrom@mariadb.com> Date: Wed Jul 25 08:21:25 2018 +0300 MDEV-15822 : WSREP: BF lock wait long for trx In Galera BF (brute force) transactions may not wait for lock requests and normally BF-transaction would select transaction holding conflicting locks as a victim for rollback. However, background statistic calculation transaction is InnoDB internal transaction and it has no thd i.e. it can't be selected as a victim. If background statistics calculation transaction holds conflicting locks to statistics tables it will cause BF lock wait long error message. Correct way to handle background statistics calculation is to acquire thd for transaction but that change is too big for GA-releases and there are other reported problems on background statistics calculation. This fix avoids adding a table to background statistics calculation if

            People

              jplindst Jan Lindström (Inactive)
              rgpublic Ranjan Ghosh
              Votes:
              19 Vote for this issue
              Watchers:
              38 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.