[MDEV-12837] WSREP: BF lock wait long - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Critical
Resolution: Fixed
Affects Version/s: 10.2.5
Fix Version/s: 10.1.35, 10.2.17, 10.3.9
Component/s: Storage Engine - InnoDB, wsrep
Labels:
None
Environment:
Ubuntu 16.04.2 LTS (Xenial)

Description

Hi there,

Every few days our cluster stalls. You always know doomsday has arrived, when this message appears in the log:

May 17 06:46:25 uhu1 mysqld[27227]: WSREP: BF lock wait long

This message then repeats a lot and other weird stuff finally appears (see below). The cluster never recovers. We always need to restart all nodes. I find this especially surprising, since there are various timeouts defined. If there is a deadlock, I wonder why it doesnt resolve automatically after some time by aborting with some error message. Googling for this message to find out whether it might be an application problem didnt yield any results either.

May 17 06:46:26 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:28 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:29 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:30 uhu1 mysqld[27227]: message repeated 6 times: [ WSREP: BF lock wait long]

May 17 06:46:31 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:32 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:33 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:34 uhu1 mysqld[27227]: message repeated 8 times: [ WSREP: BF lock wait long]

May 17 06:46:35 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:36 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]

May 17 06:46:37 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:38 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:39 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]

May 17 06:46:40 uhu1 mysqld[27227]: WSREP: BF lock wait long

May 17 06:46:40 uhu1 mysqld[27227]: =====================================

May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 06:46:40 0x7f52c3cdd700 INNODB MONITOR OUTPUT

May 17 06:46:40 uhu1 mysqld[27227]: =====================================

May 17 06:46:40 uhu1 mysqld[27227]: Per second averages calculated from the last 16 seconds

May 17 06:46:40 uhu1 mysqld[27227]: -----------------

May 17 06:46:40 uhu1 mysqld[27227]: BACKGROUND THREAD

May 17 06:46:40 uhu1 mysqld[27227]: -----------------

May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread loops: 570080 srv_active, 0 srv_shutdown, 120059 srv_idle

May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread log flush and writes: 690137

May 17 06:46:40 uhu1 mysqld[27227]: ----------

May 17 06:46:40 uhu1 mysqld[27227]: SEMAPHORES

May 17 06:46:40 uhu1 mysqld[27227]: ----------

May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: reservation count 2567993

May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: signal count 1741581

May 17 06:46:40 uhu1 mysqld[27227]: RW-shared spins 0, rounds 1276685, OS waits 365942

May 17 06:46:40 uhu1 mysqld[27227]: RW-excl spins 0, rounds 16000314, OS waits 198061

May 17 06:46:40 uhu1 mysqld[27227]: RW-sx spins 210079, rounds 4201641, OS waits 50893

May 17 06:46:40 uhu1 mysqld[27227]: Spin rounds per wait: 1276685.00 RW-shared, 16000314.00 RW-excl, 20.00 RW-sx

May 17 06:46:40 uhu1 mysqld[27227]: ------------------------

May 17 06:46:40 uhu1 mysqld[27227]: LATEST DETECTED DEADLOCK

May 17 06:46:40 uhu1 mysqld[27227]: ------------------------

May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 03:24:35 0x7f52add94700

May 17 06:46:40 uhu1 mysqld[27227]: *** (1) TRANSACTION:

May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857376, ACTIVE 0 sec inserting

May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1

May 17 06:46:40 uhu1 mysqld[27227]: LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)

May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704966, OS thread handle 0x7f52ae768700, query id 400303172 localhost 127.0.0.1 eposservicescom update

May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1193996601591ba65393eb33.09632175', '1494984305.6058')

May 17 06:46:40 uhu1 mysqld[27227]: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:

May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857376 lock_mode X locks rec but not gap waiting

May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32

May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);

May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;

May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;

May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);

May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;

May 17 06:46:40 uhu1 mysqld[27227]: *** (2) TRANSACTION:

May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857375, ACTIVE 0 sec inserting

May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1

May 17 06:46:40 uhu1 mysqld[27227]: 3 lock struct(s), heap size 1136, 2 row lock(s)

May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704967, OS thread handle 0x7f52add94700, query id 400303173 localhost 127.0.0.1 eposservicescom update

May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1474565599591ba65393ebe9.44223691', '1494984305.6058')

May 17 06:46:40 uhu1 mysqld[27227]: *** (2) HOLDS THE LOCK(S):

May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock mode S locks rec but not gap

May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32

May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);

May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;

May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;

May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);

May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;

May 17 06:46:40 uhu1 mysqld[27227]: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:

May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock_mode X locks rec but not gap waiting

May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32

May 17 06:46:40 uhu1 mysqld[27227]:  0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);

May 17 06:46:40 uhu1 mysqld[27227]:  1: len 6; hex 000011a26cb5; asc     l ;;

May 17 06:46:40 uhu1 mysqld[27227]:  2: len 7; hex 40000006a82965; asc @    )e;;

May 17 06:46:40 uhu1 mysqld[27227]:  3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);

May 17 06:46:40 uhu1 mysqld[27227]:  4: len 8; hex 4850f49be946d641; asc HP   F A;;

May 17 06:46:40 uhu1 mysqld[27227]: *** WE ROLL BACK TRANSACTION (2)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mysql_backtrace.log
140 kB
2017-07-08 10:30

Issue Links

includes

MDEV-14401 Stored procedure that declares a handler that catches ER_LOCK_DEADLOCK error causes thd->is_error() assertion

Closed

is caused by

MDEV-11039 Add new scheduling algorithm for reducing tail latencies

Closed

relates to

MDEV-14793 Document innodb-lock-schedule-algorithm details

Closed

MDEV-15747 WSREP: BF lock wait long for trx

Closed

MDEV-15154 WSREP: BF lock wait long after a TRUNCATE TABLE

Closed

MDEV-16664 InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF ... if innodb_lock_schedule_algorithm=VATS

Closed

(1 relates to)

Activity

Ascending order - Click to sort in descending order

Michael Weber added a comment - 2017-05-24 05:37 - edited

Hello all.
Since the update from 10.1 to 10.2.6 yesterday we are facing the same issue "WSREP: BF lock wait long".
The whole cluster is stuck after that. we only can bring it back if we restart the node with the "WSREP: BF lock wait long".
Note: we are using Debian 3.16.39-1+deb8u2

May 23 17:39:00 he-rz22-sql01 mysqld[14930]: WSREP: BF lock wait long

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: WSREP: BF lock wait long

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: =====================================

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: 2017-05-23 17:39:01 0x7f7365ba3700 INNODB MONITOR OUTPUT

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: =====================================

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Per second averages calculated from the last 16 seconds

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: -----------------

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: BACKGROUND THREAD

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: -----------------

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: srv_master_thread loops: 632 srv_active, 0 srv_shutdown, 788 srv_idle

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: srv_master_thread log flush and writes: 1420

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ----------

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: SEMAPHORES

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ----------

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: OS WAIT ARRAY INFO: reservation count 11612

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: OS WAIT ARRAY INFO: signal count 7425

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RW-shared spins 0, rounds 1553, OS waits 655

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RW-excl spins 0, rounds 4065, OS waits 84

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RW-sx spins 392, rounds 3750, OS waits 34

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Spin rounds per wait: 1553.00 RW-shared, 4065.00 RW-excl, 9.57 RW-sx

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ------------

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: TRANSACTIONS

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ------------

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Trx id counter 246763323

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Purge done for trx's n:o < 246762955 undo n:o < 0 state: running but idle

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: History list length 35

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: LIST OF TRANSACTIONS FOR EACH SESSION:

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ---TRANSACTION 421620296599400, not started

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: 0 lock struct(s), heap size 1136, 0 row lock(s)

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ---TRANSACTION 246763322, ACTIVE 3 sec starting index read

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: mysql tables in use 1, locked 1

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: MySQL thread id 2113, OS thread handle 0x7f735eac8700, query id 47052 172.16.0.12 USER updating

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: UPDATE `devices` SET `updated_at`=1495553938, `last_contact`=1495553938 WHERE `id`=145

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: ------- TRX HAS BEEN WAITING 3 SEC FOR THIS LOCK TO BE GRANTED:

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: RECORD LOCKS space id 851 page no 5 n bits 128 index PRIMARY of table `DATABASE`.`devices` trx id 2467633

22 lock_mode X locks rec but not gap waiting

May 23 17:39:01 he-rz22-sql01 mysqld[14930]: Record lock, heap no 18 PHYSICAL RECORD: n_fields 54; compact format; info bits 0

Michael Weber added a comment - 2017-05-24 05:37 - edited Hello all. Since the update from 10.1 to 10.2.6 yesterday we are facing the same issue "WSREP: BF lock wait long". The whole cluster is stuck after that. we only can bring it back if we restart the node with the "WSREP: BF lock wait long". Note: we are using Debian 3.16.39-1+deb8u2 May 23 17 : 39 : 00 he-rz22-sql01 mysqld[ 14930 ]: WSREP: BF lock wait long May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: WSREP: BF lock wait long May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ===================================== May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: 2017 - 05 - 23 17 : 39 : 01 0x7f7365ba3700 INNODB MONITOR OUTPUT May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ===================================== May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Per second averages calculated from the last 16 seconds May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ----------------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: BACKGROUND THREAD May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ----------------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: srv_master_thread loops: 632 srv_active, 0 srv_shutdown, 788 srv_idle May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: srv_master_thread log flush and writes: 1420 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: SEMAPHORES May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---------- May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: OS WAIT ARRAY INFO: reservation count 11612 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: OS WAIT ARRAY INFO: signal count 7425 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RW-shared spins 0 , rounds 1553 , OS waits 655 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RW-excl spins 0 , rounds 4065 , OS waits 84 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RW-sx spins 392 , rounds 3750 , OS waits 34 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Spin rounds per wait: 1553.00 RW-shared, 4065.00 RW-excl, 9.57 RW-sx May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ------------ May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: TRANSACTIONS May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ------------ May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Trx id counter 246763323 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Purge done for trx's n:o < 246762955 undo n:o < 0 state: running but idle May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: History list length 35 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: LIST OF TRANSACTIONS FOR EACH SESSION: May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---TRANSACTION 421620296599400 , not started May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: 0 lock struct(s), heap size 1136 , 0 row lock(s) May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ---TRANSACTION 246763322 , ACTIVE 3 sec starting index read May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: mysql tables in use 1 , locked 1 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: LOCK WAIT 2 lock struct(s), heap size 1136 , 1 row lock(s) May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: MySQL thread id 2113 , OS thread handle 0x7f735eac8700 , query id 47052 172.16 . 0.12 USER updating May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: UPDATE `devices` SET `updated_at`= 1495553938 , `last_contact`= 1495553938 WHERE `id`= 145 May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: ------- TRX HAS BEEN WAITING 3 SEC FOR THIS LOCK TO BE GRANTED: May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: RECORD LOCKS space id 851 page no 5 n bits 128 index PRIMARY of table `DATABASE`.`devices` trx id 2467633 22 lock_mode X locks rec but not gap waiting May 23 17 : 39 : 01 he-rz22-sql01 mysqld[ 14930 ]: Record lock, heap no 18 PHYSICAL RECORD: n_fields 54 ; compact format; info bits 0

Ranjan Ghosh added a comment - 2017-06-08 09:41

Does anybody at least have an idea what causes this? Our cluster somtimes stalls multiple times a day and needs to be restarted which is very, very ugly. I can't say for sure but I have the feeling this bug has even been introducted in recent versions (10.2.3, 10.2.4 ... ?). If I only knew a workaround or some setting that made this error at least less probable too appear. Right now, it's not really usable. I'm even considering turning off Galera and let MariaDB run as a single node for the time being. My current config in case that's of any help:

[mysqld]

# Enough buffer for data, indexes etc

innodb_buffer_pool_size=4G

# Do not wait too long for locks - connections might build up (Especially important for Drupal semaphore table)

innodb_lock_wait_timeout=10

#Recommended by Drupal

max_allowed_packet=512M

#Try to prevent crashes

skip-log-bin

# Purge faster :-)

innodb_purge_threads = 16

# Replicate faster :-)

wsrep_slave_threads = 32

# Reduce dead-locks

transaction-isolation = READ-COMMITTED

#Query Cache for faster performance

query_cache_type = 1

query_cache_size = 128M # Should not be too large to avoid cache contention

#mysql settings mandatory for galera

binlog_format=ROW

binlog_row_image=MINIMAL

default_storage_engine=InnoDB

innodb_autoinc_lock_mode=2

bind-address=0.0.0.0

innodb_flush_log_at_trx_commit=0

# Accept up to 300 incoming connections but use the thread pool to prevent too many simultaneous connections actually working at the same time

max_connections=300

thread-handling=pool-of-threads

thread_pool_size=32

extra-port=3307

extra-max-connections=1

#galera settings

wsrep_on=ON

wsrep_provider=/usr/lib/galera/libgalera_smm.so

wsrep_cluster_name="uhu_cluster"

wsrep_cluster_address="gcomm://node1,node2"

wsrep_node_address=192.168.0.1

wsrep_provider_options='gcache.size=2G'

enforce_storage_engine=InnoDB

sql_mode=NO_AUTO_CREATE_USER

Ranjan Ghosh added a comment - 2017-06-08 09:41 Does anybody at least have an idea what causes this? Our cluster somtimes stalls multiple times a day and needs to be restarted which is very, very ugly. I can't say for sure but I have the feeling this bug has even been introducted in recent versions (10.2.3, 10.2.4 ... ?). If I only knew a workaround or some setting that made this error at least less probable too appear. Right now, it's not really usable. I'm even considering turning off Galera and let MariaDB run as a single node for the time being. My current config in case that's of any help: [mysqld] # Enough buffer for data, indexes etc innodb_buffer_pool_size=4G # Do not wait too long for locks - connections might build up (Especially important for Drupal semaphore table) innodb_lock_wait_timeout=10 #Recommended by Drupal max_allowed_packet=512M #Try to prevent crashes skip-log-bin # Purge faster :-) innodb_purge_threads = 16 # Replicate faster :-) wsrep_slave_threads = 32 # Reduce dead-locks transaction-isolation = READ-COMMITTED #Query Cache for faster performance query_cache_type = 1 query_cache_size = 128M # Should not be too large to avoid cache contention #mysql settings mandatory for galera binlog_format=ROW binlog_row_image=MINIMAL default_storage_engine=InnoDB innodb_autoinc_lock_mode=2 bind-address=0.0.0.0 innodb_flush_log_at_trx_commit=0 # Accept up to 300 incoming connections but use the thread pool to prevent too many simultaneous connections actually working at the same time max_connections=300 thread-handling=pool-of-threads thread_pool_size=32 extra-port=3307 extra-max-connections=1 #galera settings wsrep_on=ON wsrep_provider=/usr/lib/galera/libgalera_smm.so wsrep_cluster_name="uhu_cluster" wsrep_cluster_address="gcomm://node1,node2" wsrep_node_address=192.168.0.1 wsrep_provider_options='gcache.size=2G' enforce_storage_engine=InnoDB sql_mode=NO_AUTO_CREATE_USER

Michael Weber added a comment - 2017-06-09 08:55

Bad News: we still got this issue.
At the moment i configured the Loadbalancer to send requests only to 1 node. that seems to work for a while. since 3 days we had no crash.
waiting for a update/patch..

Michael Weber added a comment - 2017-06-09 08:55 Bad News: we still got this issue. At the moment i configured the Loadbalancer to send requests only to 1 node. that seems to work for a while. since 3 days we had no crash. waiting for a update/patch..

Ranjan Ghosh added a comment - 2017-06-16 16:00

Unfortunately, even if I send requests only to 1 node, our cluster still crashes. Currently, I've got only 1 single node running. This defeats the whole purpose why we are using MariaDB/Galera in the first place, though. It is, however, stable.

When researching this bug, I find many different anecdotal reports over the years:

None with a real solution or any explanation on why (i.e. under which circumstances) this happens. The error seems very random to me. Sometimes it happens at night when there is hardly any load on the server at all. All reports have in common that the database never seems to recover by itself. In all reported cases a restart seemed to be required. Therefore, at the very least, I guess the database should just shutdown (signal 11) in case this error happens. This would be far better than the current situation where the database is happily accepting connections which then subsequently just hang/freeze forever. I think this bug is particularly evil just because of that said property: We use Galera in particular to have an automatic failover if sth. goes wrong. This bug stalls the database and because it isn't properly shutdown, the second node cannot take over and hangs as well.

Ranjan Ghosh added a comment - 2017-06-16 16:00 Unfortunately, even if I send requests only to 1 node, our cluster still crashes. Currently, I've got only 1 single node running. This defeats the whole purpose why we are using MariaDB/Galera in the first place, though. It is, however, stable. When researching this bug, I find many different anecdotal reports over the years: https://groups.google.com/forum/#!topic/codership-team/6wsv_jUmc5w https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1671542 https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1197771 https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/43848-cluster-locks-insert-update-issues https://www.mail-archive.com/maria-discuss@lists.launchpad.net/msg02441.html None with a real solution or any explanation on why (i.e. under which circumstances) this happens. The error seems very random to me. Sometimes it happens at night when there is hardly any load on the server at all. All reports have in common that the database never seems to recover by itself. In all reported cases a restart seemed to be required. Therefore, at the very least , I guess the database should just shutdown (signal 11) in case this error happens. This would be far better than the current situation where the database is happily accepting connections which then subsequently just hang/freeze forever. I think this bug is particularly evil just because of that said property: We use Galera in particular to have an automatic failover if sth. goes wrong. This bug stalls the database and because it isn't properly shutdown, the second node cannot take over and hangs as well.

Daniel Black added a comment - 2017-06-18 22:34 - edited

Please try to get a backtrace

Also what galera version are you running? Have you tried 25.3.25?

edit: camera -> galera (groan - phone autospell)

Daniel Black added a comment - 2017-06-18 22:34 - edited Please try to get a backtrace Also what galera version are you running? Have you tried 25.3.25? edit: camera -> galera (groan - phone autospell)

Ranjan Ghosh added a comment - 2017-06-19 09:37

You meant Galera version, I assume? Well, I use the one that comes with the OS (Ubuntu) AFAICT:

~ # dpkg -l | grep galera

ii  galera-3                             25.3.19-xenial                       amd64        Replication framework for transactional applications

I've also added the MariaDB PPAs but the most recent version I can find there is 25.3.20. How can I upgrade to a newer version? Is there a separate PPA?

Futhermore, I understand why you'd like to have a backtrace, although it's obviously very, very difficult to get that in this case. I'd first need to enable the second node again. Then I'd need to compile a separate MariaDB version. Then wait until it hangs. It doesnt crash. It only hangs. Sometimes this is only once every 1-2 days. Sometimes in the very early morning hours. Then I would need to send it a "kill -6" if I understand correctly and grab the backtrace. As this is a live server with important websites on it not an easy feat. I'll need sometime to make that possible. In the mean time: Any insights into why this might happen would be much appreciated. Right now, I'm totally in the dark on what things I should avoid or watch out for...

Ranjan Ghosh added a comment - 2017-06-19 09:37 You meant Galera version, I assume? Well, I use the one that comes with the OS (Ubuntu) AFAICT: ~ # dpkg -l | grep galera ii galera-3 25.3.19-xenial amd64 Replication framework for transactional applications I've also added the MariaDB PPAs but the most recent version I can find there is 25.3.20. How can I upgrade to a newer version? Is there a separate PPA? Futhermore, I understand why you'd like to have a backtrace, although it's obviously very, very difficult to get that in this case. I'd first need to enable the second node again. Then I'd need to compile a separate MariaDB version. Then wait until it hangs. It doesnt crash. It only hangs. Sometimes this is only once every 1-2 days. Sometimes in the very early morning hours. Then I would need to send it a "kill -6" if I understand correctly and grab the backtrace. As this is a live server with important websites on it not an easy feat. I'll need sometime to make that possible. In the mean time: Any insights into why this might happen would be much appreciated. Right now, I'm totally in the dark on what things I should avoid or watch out for...

Ranjan Ghosh added a comment - 2017-06-27 12:38

OK, I wrote a small BASH script as an intermediate step (see below, SystemD required). Perhaps this is also useful to others. I will add this to the crontab to run every minute. It will watch the server if the problem occurs and shut down the node. This makes it hopefully at least possible again to run with two nodes unsupervised, because the system should recover automatically. Now, next step: Compile MariaDB with debug symbols. Let's see how this works out...

#!/bin/bash

# Dont do anything if MariaDB isnt running or is already in the process of shutting down

if ! (systemctl status mariadb | grep --quiet 'active (running)'); then

        echo "MariaDB not running";

        exit;

fi

# Get the start date/time of MariaDB and the date/time of 10 minutes ago

ts_start=$(systemctl show -p ExecMainStartTimestamp mariadb | cut -d "=" -f 2 | awk '{print $2 " " $3}')

ts_ago=$(date --date="10 minutes ago" +"%Y-%m-%d %H:%M:%S")

# Use whatever is greater as a log start date (i.e: Get the logs only for the current run but only the most recent 10 minutes)

if [[ "$ts_start" > "$ts_ago" ]]; then

        ts=$ts_start;

else

        ts=$ts_ago;

fi

# If the log contains "BF lock wait long" the database is frozen and needs to be shutdown.

if (journalctl -u mariadb --since "$ts" | grep --quiet "BF lock wait long"); then

        echo "BF lock wait detected! Shutting down MariaDB.";

        systemctl --no-block stop mariadb

fi

Ranjan Ghosh added a comment - 2017-06-27 12:38 OK, I wrote a small BASH script as an intermediate step (see below, SystemD required). Perhaps this is also useful to others. I will add this to the crontab to run every minute. It will watch the server if the problem occurs and shut down the node. This makes it hopefully at least possible again to run with two nodes unsupervised, because the system should recover automatically. Now, next step: Compile MariaDB with debug symbols. Let's see how this works out... #!/bin/bash # Dont do anything if MariaDB isnt running or is already in the process of shutting down if ! (systemctl status mariadb | grep --quiet 'active (running)'); then echo "MariaDB not running"; exit; fi # Get the start date/time of MariaDB and the date/time of 10 minutes ago ts_start=$(systemctl show -p ExecMainStartTimestamp mariadb | cut -d "=" -f 2 | awk '{print $2 " " $3}') ts_ago=$(date --date="10 minutes ago" +"%Y-%m-%d %H:%M:%S") # Use whatever is greater as a log start date (i.e: Get the logs only for the current run but only the most recent 10 minutes) if [[ "$ts_start" > "$ts_ago" ]]; then ts=$ts_start; else ts=$ts_ago; fi # If the log contains "BF lock wait long" the database is frozen and needs to be shutdown. if (journalctl -u mariadb --since "$ts" | grep --quiet "BF lock wait long"); then echo "BF lock wait detected! Shutting down MariaDB."; systemctl --no-block stop mariadb fi

Ranjan Ghosh added a comment - 2017-06-27 13:05

@danblack: I hope it will be sufficient to build MariaDB with just the additional debug symbols to get the backtrace, right? I won't need the 30% slower full-debug, version? Obviously, I'd like to avoid that as it is a production server. I'm referring to this page here:

https://mariadb.com/kb/en/mariadb/compiling-mariadb-for-debugging/

Ranjan Ghosh added a comment - 2017-06-27 13:05 @danblack: I hope it will be sufficient to build MariaDB with just the additional debug symbols to get the backtrace, right? I won't need the 30% slower full-debug, version? Obviously, I'd like to avoid that as it is a production server. I'm referring to this page here: https://mariadb.com/kb/en/mariadb/compiling-mariadb-for-debugging/

Daniel Black added a comment - 2017-06-27 22:29 - edited

There are debug symbol packages for the existing install in the mariadb repos that should be installable. i.e.:
http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/m/mariadb-10.2/mariadb-server-core-10.2-dbgsym_10.2.6+maria~xenial_amd64.deb

Don't forget galera syms too:
http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/g/galera-3/galera-3-dbg_25.3.20-xenial_amd64.deb

Otherwise use -DCMAKE_BUILD_TYPE=RelWithDebInfo and it won't be any slower.

Nice script. An alternate way:

journalctl -u mariadb -f | while read line

do

  if [[ $line =~ "BF lock wait long" ]] ....

done

Daniel Black added a comment - 2017-06-27 22:29 - edited There are debug symbol packages for the existing install in the mariadb repos that should be installable. i.e.: http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/m/mariadb-10.2/mariadb-server-core-10.2-dbgsym_10.2.6+maria~xenial_amd64.deb Don't forget galera syms too: http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/g/galera-3/galera-3-dbg_25.3.20-xenial_amd64.deb Otherwise use -DCMAKE_BUILD_TYPE=RelWithDebInfo and it won't be any slower. Nice script. An alternate way: journalctl -u mariadb -f | while read line do if [[ $line =~ "BF lock wait long" ]] .... done

Ranjan Ghosh added a comment - 2017-07-05 22:38

OK, I've now installed the dbgsym packages. I wondered if I needed to restart the DB to actually "enable" the symbols. Well, I did it anyway just to be sure.
D'oh - your script is of course waaaay shorter than mine Thanks for sharing. I've added the line to create the backtrace I got from here:

https://mariadb.com/kb/en/mariadb/how-to-produce-a-full-stack-trace-for-mysqld/#getting-a-stack-trace-for-a-running-mysqld-server

Although, I think the docs were a bit wrong/misleading (didnt work for me) and "program" should really be "-p". I now have this line to get the backtrace:

gdb --batch --eval-command="thread apply all bt" -p $(pgrep -xn mysqld) > /root/mysql_backtrace.log

When I ran it, it created a backtrace most of the time. Sometimes, though, it did just hang at the end and I had to kill gdb Perhaps I was impatient. Hope this wont happen during the BFLockWait.

OK, now that everything's setup I guess we'll just have to wait until the rare striped tiger runs in front of the camera :-/
Thanks again, @danblack for all the helpful infos. It's much appreciated.

Ranjan Ghosh added a comment - 2017-07-05 22:38 OK, I've now installed the dbgsym packages. I wondered if I needed to restart the DB to actually "enable" the symbols. Well, I did it anyway just to be sure. D'oh - your script is of course waaaay shorter than mine Thanks for sharing. I've added the line to create the backtrace I got from here: https://mariadb.com/kb/en/mariadb/how-to-produce-a-full-stack-trace-for-mysqld/#getting-a-stack-trace-for-a-running-mysqld-server Although, I think the docs were a bit wrong/misleading (didnt work for me) and "program" should really be "-p". I now have this line to get the backtrace: gdb --batch --eval-command="thread apply all bt" -p $(pgrep -xn mysqld) > /root/mysql_backtrace.log When I ran it, it created a backtrace most of the time. Sometimes, though, it did just hang at the end and I had to kill gdb Perhaps I was impatient. Hope this wont happen during the BFLockWait. OK, now that everything's setup I guess we'll just have to wait until the rare striped tiger runs in front of the camera :-/ Thanks again, @danblack for all the helpful infos. It's much appreciated.

Ranjan Ghosh added a comment - 2017-07-08 10:31

Ha! Finally the crash happened. The script worked perfectly. The DB was shut down. The backtrace was automatically created. Here it is:

mysql_backtrace.log

Ranjan Ghosh added a comment - 2017-07-08 10:31 Ha! Finally the crash happened. The script worked perfectly. The DB was shut down. The backtrace was automatically created. Here it is: mysql_backtrace.log

Ralf Fischer added a comment - 2017-07-13 18:01

Exactly the same here: Just upgraded a 3 nodes cluster from 10.1.25 to 10.2.7. After 4 minutes i got: WSREP: BF lock wait long and the cluster hangs for all Updates/Inserts until restarting the hanging one. After less then 5 Minutes, got the same message again. Any good hints to downgrade back to 10.1?

Debian 7.1,
Linux mariadb-node1 3.2.0-4-amd64 #1 SMP Debian 3.2.81-2 x86_64 GNU/Linux

Thanks in advance

Ralf Fischer added a comment - 2017-07-13 18:01 Exactly the same here: Just upgraded a 3 nodes cluster from 10.1.25 to 10.2.7. After 4 minutes i got: WSREP: BF lock wait long and the cluster hangs for all Updates/Inserts until restarting the hanging one. After less then 5 Minutes, got the same message again. Any good hints to downgrade back to 10.1? Debian 7.1, Linux mariadb-node1 3.2.0-4-amd64 #1 SMP Debian 3.2.81-2 x86_64 GNU/Linux Thanks in advance

Tomislav Mihaldinec added a comment - 2017-07-15 13:57

Same issue here.. but on Centos 7.3

Linux taskdb02 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
KVM

Jul 14 14:50:22 vmtaskdb03 mysqld: mysql tables in use 1, locked 1
Jul 14 14:50:22 vmtaskdb03 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
Jul 14 14:50:22 vmtaskdb03 mysqld: MySQL thread id 2027, OS thread handle 0x7f5196e33700, query id 30214 10.232.10.31 taskitud_ianis query end
Jul 14 14:50:22 vmtaskdb03 mysqld: INSERT INTO `mw_campaign_track_open` (`campaign_id`, `subscriber_id`, `ip_address`, `user_agent`, `date_added`) VALUES (6351, 2314769, '90.9.29.141', 'Mozilla/5.0 (Windows NT 10.0; Win6
4; x64; rv:54.0) Gecko/20100101 Firefox/54.0', NOW())
Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign_track_open` trx id 16643769 lock mode IX
Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode IS
Jul 14 14:50:22 vmtaskdb03 mysqld: RECORD LOCKS space id 8 page no 76 n bits 120 index PRIMARY of table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode S locks rec but not gap
Jul 14 14:50:45 vmtaskdb03 journal: Suppressed 5655 messages from /system.slice/mariadb.service
Jul 14 14:50:45 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:46 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:47 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:48 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:49 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:50 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:51 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:52 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:53 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:54 vmtaskdb03 mysqld: WSREP: BF lock wait long

thanks

Tomislav Mihaldinec added a comment - 2017-07-15 13:57 Same issue here.. but on Centos 7.3 Linux taskdb02 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux KVM Jul 14 14:50:22 vmtaskdb03 mysqld: mysql tables in use 1, locked 1 Jul 14 14:50:22 vmtaskdb03 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1 Jul 14 14:50:22 vmtaskdb03 mysqld: MySQL thread id 2027, OS thread handle 0x7f5196e33700, query id 30214 10.232.10.31 taskitud_ianis query end Jul 14 14:50:22 vmtaskdb03 mysqld: INSERT INTO `mw_campaign_track_open` (`campaign_id`, `subscriber_id`, `ip_address`, `user_agent`, `date_added`) VALUES (6351, 2314769, '90.9.29.141', 'Mozilla/5.0 (Windows NT 10.0; Win6 4; x64; rv:54.0) Gecko/20100101 Firefox/54.0', NOW()) Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign_track_open` trx id 16643769 lock mode IX Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode IS Jul 14 14:50:22 vmtaskdb03 mysqld: RECORD LOCKS space id 8 page no 76 n bits 120 index PRIMARY of table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode S locks rec but not gap Jul 14 14:50:45 vmtaskdb03 journal: Suppressed 5655 messages from /system.slice/mariadb.service Jul 14 14:50:45 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:46 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:47 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:48 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:49 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:50 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:51 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:52 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:53 vmtaskdb03 mysqld: WSREP: BF lock wait long Jul 14 14:50:54 vmtaskdb03 mysqld: WSREP: BF lock wait long thanks

Chris Anderson added a comment - 2017-07-28 12:11

We have also received this error running on RedHat 7.3, MariaDB-server-10.2.7-1.el7, and galera-25.3.20-1.rhel7. The error only occurred on a system with high load and did not occur in the testing environments that have a much lower load. Does this bug affect all of 10.2? Does this bug affect any 10.1.x release?

Chris Anderson added a comment - 2017-07-28 12:11 We have also received this error running on RedHat 7.3, MariaDB-server-10.2.7-1.el7, and galera-25.3.20-1.rhel7. The error only occurred on a system with high load and did not occur in the testing environments that have a much lower load. Does this bug affect all of 10.2? Does this bug affect any 10.1.x release?

Ranjan Ghosh added a comment - 2017-07-28 12:33

Well, I'm only a poor user - no MariaDB guru, but what I can say is this:

1) There are some anecdotal reports of people upgrading from 10.1.x. to 10.2.x and they suddently encountered this bug. This suggests that the error probably didnt happen on 10.1.

2) I, myself, never used 10.1 with Galera, so I cant tell for sure. But I've been using Galera with many earlier 10.2.x versions and always had this bug as far as I can remember. This suggests that all of 10.2 is probably affected or at least it won't help to downgrade say from 10.2.7 to 10.2.5 or sth. like that as a work-around.

3) Like stated in one of my previous comments, when researching this message on the web, it seems that this message does indeed show up in many varieties/versions of MySQL over the years. This suggests that this message might only be a symptom that has/had multiple causes. I assume though, that the current influx of error reports on this have all the same underlying reason.

Up until know, I have not the slightest idea why it happens. The word "lock" in the error message, however, seems to suggest that it has sth. to do with locking problems. That is, the more InnoDB locking is going on, the more conflicts/deadlocks whatever occur and the more likely could it be that the error shows up. This is perhaps why you're experiencing the bug under high load. Just wildly guessing here, of course. On our server, the bug happens at arbitrary times. Sometimes in the middle of the night. Not necessarily under exceptionally high load.

In a previous comment, I posted my mysql configuration. I'd like to know if someone sees sth. striking when comparing it to their config Some correlation - anything.

I find it strange that there seems to be not much concern about this bug and the current MariaDB version is even considered "stable". For us, it renders the whole Galera feature totally useless - it's a huge problem. We are currently running without proper failover. The reason for this calm is probably: There are only very few people who are affected by this. Even though I assume there is a sizable population out there in the world who uses a current MariaDB version with Galera - especially now where it's out of beta. This leads to the question: What are they doing differently than us? I'd really like to find out, because there doesnt seem to be much progress on this otherwise.

Ranjan Ghosh added a comment - 2017-07-28 12:33 Well, I'm only a poor user - no MariaDB guru, but what I can say is this: 1) There are some anecdotal reports of people upgrading from 10.1.x. to 10.2.x and they suddently encountered this bug. This suggests that the error probably didnt happen on 10.1. 2) I, myself, never used 10.1 with Galera, so I cant tell for sure. But I've been using Galera with many earlier 10.2.x versions and always had this bug as far as I can remember. This suggests that all of 10.2 is probably affected or at least it won't help to downgrade say from 10.2.7 to 10.2.5 or sth. like that as a work-around. 3) Like stated in one of my previous comments, when researching this message on the web, it seems that this message does indeed show up in many varieties/versions of MySQL over the years. This suggests that this message might only be a symptom that has/had multiple causes. I assume though, that the current influx of error reports on this have all the same underlying reason. Up until know, I have not the slightest idea why it happens. The word "lock" in the error message, however, seems to suggest that it has sth. to do with locking problems. That is, the more InnoDB locking is going on, the more conflicts/deadlocks whatever occur and the more likely could it be that the error shows up. This is perhaps why you're experiencing the bug under high load. Just wildly guessing here, of course. On our server, the bug happens at arbitrary times. Sometimes in the middle of the night. Not necessarily under exceptionally high load. In a previous comment, I posted my mysql configuration. I'd like to know if someone sees sth. striking when comparing it to their config Some correlation - anything. I find it strange that there seems to be not much concern about this bug and the current MariaDB version is even considered "stable". For us, it renders the whole Galera feature totally useless - it's a huge problem. We are currently running without proper failover. The reason for this calm is probably: There are only very few people who are affected by this. Even though I assume there is a sizable population out there in the world who uses a current MariaDB version with Galera - especially now where it's out of beta. This leads to the question: What are they doing differently than us? I'd really like to find out, because there doesnt seem to be much progress on this otherwise.

Chris Anderson added a comment - 2017-07-28 13:05

Thanks Ranjan, for your response. We also upgraded from an older version of 10.1.x. We were trying to fix another bug and then ran into this much bigger bug. Sorry, I didn't include that info. We are trying to assess downgrade and want to be above 10.1.23 if possible, but we would like to know before hand if the bug exists instead of going through all the work of a downgrade and then being surprised by the same bug that we are trying to get away from. Hopefully, we can get a response from MariaDB to confirm this.

Chris Anderson added a comment - 2017-07-28 13:05 Thanks Ranjan, for your response. We also upgraded from an older version of 10.1.x. We were trying to fix another bug and then ran into this much bigger bug. Sorry, I didn't include that info. We are trying to assess downgrade and want to be above 10.1.23 if possible, but we would like to know before hand if the bug exists instead of going through all the work of a downgrade and then being surprised by the same bug that we are trying to get away from. Hopefully, we can get a response from MariaDB to confirm this.

Ralf Fischer added a comment - 2017-07-28 13:13 - edited

We used a lot of 10.1 versions up to 10.1.25 in a three node cluster and never had that error. After upgrading to 10.1.7 we got that error every 5-15 minutes.

Ralf Fischer added a comment - 2017-07-28 13:13 - edited We used a lot of 10.1 versions up to 10.1.25 in a three node cluster and never had that error. After upgrading to 10.1.7 we got that error every 5-15 minutes.

Chris Anderson added a comment - 2017-07-28 13:16

Thanks Ralf, that is very helpful

Chris Anderson added a comment - 2017-07-28 13:16 Thanks Ralf, that is very helpful

Chris Anderson added a comment - 2017-07-28 13:18

Also, Ralf, just to clarify. You got this bug after you upgraded to 10.2.7, correct? You said 10.1.7 and it may just be a typo

Chris Anderson added a comment - 2017-07-28 13:18 Also, Ralf, just to clarify. You got this bug after you upgraded to 10.2.7, correct? You said 10.1.7 and it may just be a typo

Ralf Fischer added a comment - 2017-07-28 13:18

Yes, sorry Upgraded from 10.1.25 to 10.2.7...

Ralf Fischer added a comment - 2017-07-28 13:18 Yes, sorry Upgraded from 10.1.25 to 10.2.7...

Michael Weber added a comment - 2017-07-28 13:40

note: we used serverals versions of mariadb in the major release of 10.1. that issue never occured.
After upgrading to 10.2 we are facing this issue.
so i think a downgrad to 10.1 can be a workaround.

Michael Weber added a comment - 2017-07-28 13:40 note: we used serverals versions of mariadb in the major release of 10.1. that issue never occured. After upgrading to 10.2 we are facing this issue. so i think a downgrad to 10.1 can be a workaround.

Daniel Hawton added a comment - 2017-08-03 01:00

I too am getting this issue.. fresh installs of 10.2 no upgrade.

Daniel Hawton added a comment - 2017-08-03 01:00 I too am getting this issue.. fresh installs of 10.2 no upgrade.

Vladimir Titov added a comment - 2017-08-16 10:10 - edited

Got the same issue with a fresh install of mariadb-server v. 10.2.7 from mariadb repo. The server software is Ubuntu 16.04
Some related logs here

Aug 15 22:38:22 mariadb mysqld[13566]: WSREP: BF lock wait long

Aug 15 22:38:27 mariadb mysqld[13566]: message repeated 5 times: [ WSREP: BF lock wait long]

Aug 15 22:38:27 mariadb mysqld[13566]: =====================================

Aug 15 22:38:27 mariadb mysqld[13566]: 2017-08-15 22:38:27 0x7fbfdb84d700 INNODB MONITOR OUTPUT

Aug 15 22:38:27 mariadb mysqld[13566]: =====================================

```

Basically log the same as @Michael Veber posted

Vladimir Titov added a comment - 2017-08-16 10:10 - edited Got the same issue with a fresh install of mariadb-server v. 10.2.7 from mariadb repo. The server software is Ubuntu 16.04 Some related logs here Aug 15 22:38:22 mariadb mysqld[13566]: WSREP: BF lock wait long Aug 15 22:38:27 mariadb mysqld[13566]: message repeated 5 times: [ WSREP: BF lock wait long] Aug 15 22:38:27 mariadb mysqld[13566]: ===================================== Aug 15 22:38:27 mariadb mysqld[13566]: 2017-08-15 22:38:27 0x7fbfdb84d700 INNODB MONITOR OUTPUT Aug 15 22:38:27 mariadb mysqld[13566]: ===================================== ``` Basically log the same as @Michael Veber posted

Nathan Bohman added a comment - 2017-08-22 17:04

Could this perhaps be related to ~~MDEV-13331~~? I am experiencing the same issue and noticed that ~~MDEV-13331~~ was introduced in 10.2.2 and fixed in 10.2.8. I find that I hit the bug if the application tied to our DB (JIRA) attempts to make a ton of updates to the same row at the same time resulting in a row lock. It would make sense if update got a similar patch as drop from upstream and was ignoring innodb_lock_wait_timeout.

Nathan Bohman added a comment - 2017-08-22 17:04 Could this perhaps be related to MDEV-13331 ? I am experiencing the same issue and noticed that MDEV-13331 was introduced in 10.2.2 and fixed in 10.2.8. I find that I hit the bug if the application tied to our DB (JIRA) attempts to make a ton of updates to the same row at the same time resulting in a row lock. It would make sense if update got a similar patch as drop from upstream and was ignoring innodb_lock_wait_timeout.

Ranjan Ghosh added a comment - 2017-08-22 17:52

Very interesting idea. Makes a whole lot of sense to me. There is a DELETE CASCADE. It locks. It never times out due to the bug. The lock is never released. More and more stuff that's supposed to be replicated is piling up due to the lock. Boom!

Unfortunately I cannot upgrade our cluster to 10.2.8 at the moment - perhaps in the forthcoming weeks. Would be really helpful, though, if someone else could verify this earlier. Especially those, whose cluster is shutting down after a few minutes. With our cluster, I'd probably have to wait for days to see if the error is really resolved or not.

Ranjan Ghosh added a comment - 2017-08-22 17:52 Very interesting idea. Makes a whole lot of sense to me. There is a DELETE CASCADE. It locks. It never times out due to the bug. The lock is never released. More and more stuff that's supposed to be replicated is piling up due to the lock. Boom! Unfortunately I cannot upgrade our cluster to 10.2.8 at the moment - perhaps in the forthcoming weeks. Would be really helpful, though, if someone else could verify this earlier. Especially those, whose cluster is shutting down after a few minutes. With our cluster, I'd probably have to wait for days to see if the error is really resolved or not.

Ranjan Ghosh added a comment - 2017-08-22 20:58

Ah. Bummer! I used this call on my database:

select     concat(table_name, '.', column_name) as 'foreign key',

concat(referenced_table_name, '.', referenced_column_name) as 'references'

from     information_schema.key_column_usage

where     referenced_table_name is not null;

This is supposed to show all tables with foreign keys. Unfortunately, there are none. So ~~MDEV-13331~~ can at least not be the only reason AFAICT, because my DB keeps crashing anyway

Ranjan Ghosh added a comment - 2017-08-22 20:58 Ah. Bummer! I used this call on my database: select concat(table_name, '.', column_name) as 'foreign key', concat(referenced_table_name, '.', referenced_column_name) as 'references' from information_schema.key_column_usage where referenced_table_name is not null; This is supposed to show all tables with foreign keys. Unfortunately, there are none. So MDEV-13331 can at least not be the only reason AFAICT, because my DB keeps crashing anyway

Ralf Fischer added a comment - 2017-08-24 05:53 - edited

Just checked 10.2.8 with our 3 nodes cluster and as soon as we write on two or more nodes instead of only one, we got the same error as with the other 10.2 version.

Aug 24 07:22:46 mysql-node2 mysqld: WSREP: BF lock wait long
Aug 24 07:22:46 mysql-node2 mysqld:
Aug 24 07:22:46 mysql-node2 mysqld: =====================================
Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 07:22:46 0x7f4e89a4b700 INNODB MONITOR OUTPUT
Aug 24 07:22:46 mysql-node2 mysqld: =====================================
Aug 24 07:22:46 mysql-node2 mysqld: Per second averages calculated from the last 15 seconds
Aug 24 07:22:46 mysql-node2 mysqld: -----------------
Aug 24 07:22:46 mysql-node2 mysqld: BACKGROUND THREAD
Aug 24 07:22:46 mysql-node2 mysqld: -----------------
Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread loops: 40245 srv_active, 0 srv_shutdown, 25086 srv_idle
Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread log flush and writes: 65331
Aug 24 07:22:46 mysql-node2 mysqld: ----------
Aug 24 07:22:46 mysql-node2 mysqld: SEMAPHORES
Aug 24 07:22:46 mysql-node2 mysqld: ----------
Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: reservation count 520504
Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: signal count 1030299
Aug 24 07:22:46 mysql-node2 mysqld: RW-shared spins 0, rounds 1439898, OS waits 149766
Aug 24 07:22:46 mysql-node2 mysqld: RW-excl spins 0, rounds 14688668, OS waits 112658
Aug 24 07:22:46 mysql-node2 mysqld: RW-sx spins 410293, rounds 5591848, OS waits 49179
Aug 24 07:22:46 mysql-node2 mysqld: Spin rounds per wait: 1439898.00 RW-shared, 14688668.00 RW-excl, 13.63 RW-sx
Aug 24 07:22:46 mysql-node2 mysqld: ------------------------
Aug 24 07:22:46 mysql-node2 mysqld: LATEST FOREIGN KEY ERROR
Aug 24 07:22:46 mysql-node2 mysqld: ------------------------
Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 06:50:26 0x7f53fe4e9700 Transaction:
Aug 24 07:22:46 mysql-node2 mysqld: TRANSACTION 3020224898, ACTIVE 0 sec inserting
Aug 24 07:22:46 mysql-node2 mysqld: mysql tables in use 2, locked 2
Aug 24 07:22:46 mysql-node2 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s)
Aug 24 07:22:46 mysql-node2 mysqld: MySQL thread id 2, OS thread handle 0x7f53fe4e9700, query id 5813680 copy to tmp table
Aug 24 07:22:46 mysql-node2 mysqld: ALTER TABLE `catalog_product_flat_1` ADD CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE
Aug 24 07:22:46 mysql-node2 mysqld: Foreign key constraint fails for table `magento_felt_b2b`.`#sql-4c6d_2`:
Aug 24 07:22:46 mysql-node2 mysqld: ,
Aug 24 07:22:46 mysql-node2 mysqld: CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE in parent table, in index PRIMARY tuple:

Ralf Fischer added a comment - 2017-08-24 05:53 - edited Just checked 10.2.8 with our 3 nodes cluster and as soon as we write on two or more nodes instead of only one, we got the same error as with the other 10.2 version. Aug 24 07:22:46 mysql-node2 mysqld: WSREP: BF lock wait long Aug 24 07:22:46 mysql-node2 mysqld: Aug 24 07:22:46 mysql-node2 mysqld: ===================================== Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 07:22:46 0x7f4e89a4b700 INNODB MONITOR OUTPUT Aug 24 07:22:46 mysql-node2 mysqld: ===================================== Aug 24 07:22:46 mysql-node2 mysqld: Per second averages calculated from the last 15 seconds Aug 24 07:22:46 mysql-node2 mysqld: ----------------- Aug 24 07:22:46 mysql-node2 mysqld: BACKGROUND THREAD Aug 24 07:22:46 mysql-node2 mysqld: ----------------- Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread loops: 40245 srv_active, 0 srv_shutdown, 25086 srv_idle Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread log flush and writes: 65331 Aug 24 07:22:46 mysql-node2 mysqld: ---------- Aug 24 07:22:46 mysql-node2 mysqld: SEMAPHORES Aug 24 07:22:46 mysql-node2 mysqld: ---------- Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: reservation count 520504 Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: signal count 1030299 Aug 24 07:22:46 mysql-node2 mysqld: RW-shared spins 0, rounds 1439898, OS waits 149766 Aug 24 07:22:46 mysql-node2 mysqld: RW-excl spins 0, rounds 14688668, OS waits 112658 Aug 24 07:22:46 mysql-node2 mysqld: RW-sx spins 410293, rounds 5591848, OS waits 49179 Aug 24 07:22:46 mysql-node2 mysqld: Spin rounds per wait: 1439898.00 RW-shared, 14688668.00 RW-excl, 13.63 RW-sx Aug 24 07:22:46 mysql-node2 mysqld: ------------------------ Aug 24 07:22:46 mysql-node2 mysqld: LATEST FOREIGN KEY ERROR Aug 24 07:22:46 mysql-node2 mysqld: ------------------------ Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 06:50:26 0x7f53fe4e9700 Transaction: Aug 24 07:22:46 mysql-node2 mysqld: TRANSACTION 3020224898, ACTIVE 0 sec inserting Aug 24 07:22:46 mysql-node2 mysqld: mysql tables in use 2, locked 2 Aug 24 07:22:46 mysql-node2 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s) Aug 24 07:22:46 mysql-node2 mysqld: MySQL thread id 2, OS thread handle 0x7f53fe4e9700, query id 5813680 copy to tmp table Aug 24 07:22:46 mysql-node2 mysqld: ALTER TABLE `catalog_product_flat_1` ADD CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE Aug 24 07:22:46 mysql-node2 mysqld: Foreign key constraint fails for table `magento_felt_b2b`.`#sql-4c6d_2`: Aug 24 07:22:46 mysql-node2 mysqld: , Aug 24 07:22:46 mysql-node2 mysqld: CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE in parent table, in index PRIMARY tuple:

Nathan Bohman added a comment - 2017-08-24 13:33

Sorry, I should have been more clear. 10.2.8 fixed a bug that was pulled in from upstream MySQL that caused MariaDB to ignore timeouts for DROP operations. I'm wondering if a similar piece of code was pulled in at the same time for other operations like UPDATE and SELECT?

Nathan Bohman added a comment - 2017-08-24 13:33 Sorry, I should have been more clear. 10.2.8 fixed a bug that was pulled in from upstream MySQL that caused MariaDB to ignore timeouts for DROP operations. I'm wondering if a similar piece of code was pulled in at the same time for other operations like UPDATE and SELECT?

Ranjan Ghosh added a comment - 2017-08-24 13:53

Well, the issue issue you mentioned earlier, ~~MDEV-13331~~, is only about DELETE CASCADE with FOREIGN KEYS (FK). At least this is what it says in the title. And this is, indeed, the bug Ralf send the logs from. Although it seems to happen to him with 10.2.8 as well (@Ralf: Perhaps you'd like to mention the fact that it's still not fixed for you over there as well). For my part, I'm not aware of using any foreign keys at all (see my previous comment how I figured this out). So, this bug here seems to be at least not exclusively about foreign keys. I now understand your idea is: Try to find some similar pulled in code for other operations... It's difficult to find those, though, without knowing specifically what exactly we're search for...

Ranjan Ghosh added a comment - 2017-08-24 13:53 Well, the issue issue you mentioned earlier, MDEV-13331 , is only about DELETE CASCADE with FOREIGN KEYS (FK). At least this is what it says in the title. And this is, indeed, the bug Ralf send the logs from. Although it seems to happen to him with 10.2.8 as well (@Ralf: Perhaps you'd like to mention the fact that it's still not fixed for you over there as well). For my part, I'm not aware of using any foreign keys at all (see my previous comment how I figured this out). So, this bug here seems to be at least not exclusively about foreign keys. I now understand your idea is: Try to find some similar pulled in code for other operations... It's difficult to find those, though, without knowing specifically what exactly we're search for...

Dutchak Vitalij added a comment - 2017-08-29 06:16

Same bug after upgrade to 10.2.7 version.

Dutchak Vitalij added a comment - 2017-08-29 06:16 Same bug after upgrade to 10.2.7 version.

Dutchak Vitalij added a comment - 2017-08-29 06:51

Does 10.2.8 have same issue?

Dutchak Vitalij added a comment - 2017-08-29 06:51 Does 10.2.8 have same issue?

Michael Neumann added a comment - 2017-09-19 12:09 - edited

I can confirm, that the problem still exists in 10.2.8:

Sep 19 08:34:50 dbm-ace-live-data-03 mysqld: 2017-09-19  8:34:50 140484532815616 [Warning] WSREP: Failed to report last committed 10768816, -4 (Interrupted system call)

Sep 19 08:34:54 dbm-ace-live-data-03 mysqld: 2017-09-19  8:34:54 140484532815616 [Warning] WSREP: Failed to report last committed 10768818, -4 (Interrupted system call)

Sep 19 08:35:01 dbm-ace-live-data-03 CRON[1269]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Sep 19 08:35:02 dbm-ace-live-data-03 mysqld: 2017-09-19  8:35:02 140484532815616 [Warning] WSREP: Failed to report last committed 10768925, -4 (Interrupted system call)

Sep 19 08:35:51 dbm-ace-live-data-03 mysqld: WSREP: BF lock wait long

Sep 19 08:36:00 dbm-ace-live-data-03 mysqld: message repeated 9 times: [ WSREP: BF lock wait long]

Sep 19 08:36:01 dbm-ace-live-data-03 CRON[1361]: (root) CMD (/bin/chmod -R a+r /var/log >/dev/null 2>&1)

Sep 19 08:36:01 dbm-ace-live-data-03 mysqld: WSREP: BF lock wait long

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: message repeated 5 times: [ WSREP: BF lock wait long]

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld:

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: =====================================

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: 2017-09-19 08:36:06 0x7fc3dc748700 INNODB MONITOR OUTPUT

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: =====================================

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: Per second averages calculated from the last 50 seconds

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: -----------------

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: BACKGROUND THREAD

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: -----------------

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: srv_master_thread loops: 333435 srv_active, 0 srv_shutdown, 144764 srv_idle

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: srv_master_thread log flush and writes: 478163

Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: ----------

So, is it an option to downgrade to 10.1?

Michael Neumann added a comment - 2017-09-19 12:09 - edited I can confirm, that the problem still exists in 10.2.8: Sep 19 08 : 34 : 50 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 8 : 34 : 50 140484532815616 [Warning] WSREP: Failed to report last committed 10768816 , - 4 (Interrupted system call) Sep 19 08 : 34 : 54 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 8 : 34 : 54 140484532815616 [Warning] WSREP: Failed to report last committed 10768818 , - 4 (Interrupted system call) Sep 19 08 : 35 : 01 dbm-ace-live-data- 03 CRON[ 1269 ]: (root) CMD (command -v debian-sa1 > /dev/ null && debian-sa1 1 1 ) Sep 19 08 : 35 : 02 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 8 : 35 : 02 140484532815616 [Warning] WSREP: Failed to report last committed 10768925 , - 4 (Interrupted system call) Sep 19 08 : 35 : 51 dbm-ace-live-data- 03 mysqld: WSREP: BF lock wait long Sep 19 08 : 36 : 00 dbm-ace-live-data- 03 mysqld: message repeated 9 times: [ WSREP: BF lock wait long ] Sep 19 08 : 36 : 01 dbm-ace-live-data- 03 CRON[ 1361 ]: (root) CMD (/bin/chmod -R a+r /var/log >/dev/ null 2 >& 1 ) Sep 19 08 : 36 : 01 dbm-ace-live-data- 03 mysqld: WSREP: BF lock wait long Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: message repeated 5 times: [ WSREP: BF lock wait long ] Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ===================================== Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: 2017 - 09 - 19 08 : 36 : 06 0x7fc3dc748700 INNODB MONITOR OUTPUT Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ===================================== Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: Per second averages calculated from the last 50 seconds Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ----------------- Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: BACKGROUND THREAD Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ----------------- Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: srv_master_thread loops: 333435 srv_active, 0 srv_shutdown, 144764 srv_idle Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: srv_master_thread log flush and writes: 478163 Sep 19 08 : 36 : 06 dbm-ace-live-data- 03 mysqld: ---------- So, is it an option to downgrade to 10.1?

Michael Weber added a comment - 2017-10-04 08:26

So "good news" for me..
since version 10.2.9 we have the cluster running without any issues for >5 days.

mysql --version

mysql  Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2

systemctl status mysql

● mariadb.service - MariaDB database server

   Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)

  Drop-In: /etc/systemd/system/mariadb.service.d

           └─migrated-from-my.cnf-settings.conf

   Active: active (running) since Thu 2017-09-28 16:27:02 CEST; 5 days ago

I dont trust myself but i hope that this verison fixed it for me/us

any experience with this version from other users?

Michael Weber added a comment - 2017-10-04 08:26 So "good news" for me.. since version 10.2.9 we have the cluster running without any issues for >5 days. mysql --version mysql Ver 15.1 Distrib 10.2 . 9 -MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 systemctl status mysql ● mariadb.service - MariaDB database server Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled) Drop-In: /etc/systemd/system/mariadb.service.d └─migrated-from-my.cnf-settings.conf Active: active (running) since Thu 2017 - 09 - 28 16 : 27 : 02 CEST; 5 days ago I dont trust myself but i hope that this verison fixed it for me/us any experience with this version from other users?

Michael Neumann added a comment - 2017-10-05 08:17 - edited

No good news for me, the problem stil exists:

dbm-ace-pre-02:~ # mysql --version

mysql  Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2

....

Oct  5 09:45:59 dbm-ace-pre-02 mysqld: WSREP: BF lock wait long

Oct  5 09:45:59 dbm-ace-pre-02 mysqld:

Oct  5 09:45:59 dbm-ace-pre-02 mysqld: =====================================

Oct  5 09:45:59 dbm-ace-pre-02 mysqld: 2017-10-05 09:45:58 0x7fb33bcf5700 INNODB MONITOR OUTPUT

Oct  5 09:45:59 dbm-ace-pre-02 mysqld: =====================================

Oct  5 09:45:59 dbm-ace-pre-02 mysqld: Per second averages calculated from the last 30 seconds

Oct  5 09:45:59 dbm-ace-pre-02 mysqld: -----------------

Oct  5 09:45:59 dbm-ace-pre-02 mysqld: BACKGROUND THREAD

My setup consists of three datanodes with two maxscale proxies in front of it and one hardware loadbalancer as entry point

i can very easily reproduce the problem when i switch the two maxscale to "Galera-Mode" e.g. write to all nodes allowed.

Michael Neumann added a comment - 2017-10-05 08:17 - edited No good news for me, the problem stil exists: dbm-ace-pre-02:~ # mysql --version mysql Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 .... Oct 5 09:45:59 dbm-ace-pre-02 mysqld: WSREP: BF lock wait long Oct 5 09:45:59 dbm-ace-pre-02 mysqld: Oct 5 09:45:59 dbm-ace-pre-02 mysqld: ===================================== Oct 5 09:45:59 dbm-ace-pre-02 mysqld: 2017-10-05 09:45:58 0x7fb33bcf5700 INNODB MONITOR OUTPUT Oct 5 09:45:59 dbm-ace-pre-02 mysqld: ===================================== Oct 5 09:45:59 dbm-ace-pre-02 mysqld: Per second averages calculated from the last 30 seconds Oct 5 09:45:59 dbm-ace-pre-02 mysqld: ----------------- Oct 5 09:45:59 dbm-ace-pre-02 mysqld: BACKGROUND THREAD My setup consists of three datanodes with two maxscale proxies in front of it and one hardware loadbalancer as entry point i can very easily reproduce the problem when i switch the two maxscale to "Galera-Mode" e.g. write to all nodes allowed.

Ranjan Ghosh added a comment - 2017-10-05 10:42

Hm. Right, still unsolved with the most recent version. Sigh. I have to admit I had faintly hoped I might return from my vacation and this bug is finally solved What I don't understand is that nobody of the core dev team seems to even be interested in this. Don't get me wrong - as a developer, I know that there are difficult bugs to solve that take a lot of time to figure out etc. But there is not even a reaction, nothing. I think this is a huge bug that in fact renders Galera completely useless. If the database locks up once per day (some even report it to happen every few minutes) and requires manual intervention it is - let's be honest - simply not usable on a production server. And I don't need Galera on a dev system. So, you cannot use it at all. But the integrated Galera was one of the main selling points of using MariaDB over MySQL. At least it was for us. For a long time I thought: "OK, it's a beta version. Just wait. They'll certainly fix this in the future". Now, MariaDB 10.2.x is even considered as "stable" AFAIK. Nothing could be further from the truth.

Ranjan Ghosh added a comment - 2017-10-05 10:42 Hm. Right, still unsolved with the most recent version. Sigh. I have to admit I had faintly hoped I might return from my vacation and this bug is finally solved What I don't understand is that nobody of the core dev team seems to even be interested in this. Don't get me wrong - as a developer, I know that there are difficult bugs to solve that take a lot of time to figure out etc. But there is not even a reaction, nothing. I think this is a huge bug that in fact renders Galera completely useless. If the database locks up once per day (some even report it to happen every few minutes) and requires manual intervention it is - let's be honest - simply not usable on a production server. And I don't need Galera on a dev system. So, you cannot use it at all . But the integrated Galera was one of the main selling points of using MariaDB over MySQL. At least it was for us. For a long time I thought: "OK, it's a beta version. Just wait. They'll certainly fix this in the future". Now, MariaDB 10.2.x is even considered as "stable" AFAIK. Nothing could be further from the truth.

Michael Neumann added a comment - 2017-10-05 11:56 - edited

You speak out of my soul. The reason we gave mariadb(galera) a shot, is the write scalability. We tested it with one of our key.systems and run into this bug. The combination with maxcale appeared so promising. Now we move to our Oracle-RAC system.

Michael Neumann added a comment - 2017-10-05 11:56 - edited You speak out of my soul. The reason we gave mariadb(galera) a shot, is the write scalability. We tested it with one of our key.systems and run into this bug. The combination with maxcale appeared so promising. Now we move to our Oracle-RAC system.

Sachin Setiya (Inactive) added a comment - 2017-10-05 17:53

Hi Everyone!

I am assigned to this bug, but I was a lot busy with other bugs. I will look into this bug in this week and hopefully will come with a work arround / patch.

Sachin

Sachin Setiya (Inactive) added a comment - 2017-10-05 17:53 Hi Everyone! I am assigned to this bug, but I was a lot busy with other bugs. I will look into this bug in this week and hopefully will come with a work arround / patch. Sachin

Ralf Fischer added a comment - 2017-10-05 19:25

Hi Sachin, that's great! Thanks in advance for your support!

Ralf Fischer added a comment - 2017-10-05 19:25 Hi Sachin, that's great! Thanks in advance for your support!

Andrii Nikitin (Inactive) added a comment - 2017-10-05 20:28 - edited

I was able to reproduce the problem with script below, which downloads and unpacks 10.2.9, configures 3 nodes on 3307, 3308 and 3309 and puts some concurrent write load :

set -e

ver=${1:-10.2.9}

# just use current directory if called from framework

if [ ! -f common.sh ] ; then

  [ -d mariadb-environs ] || git clone http://github.com/AndriiNikitin/mariadb-environs

  cd mariadb-environs

  ./get_plugin.sh galera

fi

_template/plant_cluster.sh cluster1

echo m1 > cluster1/nodes.lst

echo m2 >> cluster1/nodes.lst

echo m3 >> cluster1/nodes.lst

cluster1/replant.sh ${ver}

./build_or_download.sh m1

cluster1/gen_cnf.sh innodb_buffer_pool_size=1G \

        innodb_log_file_size=256M \

        innodb_flush_log_at_trx_commit=2 \

        innodb_lock_wait_timeout=10 \

        transaction-isolation=READ-COMMITTED \

        innodb_stats_persistent=0

cluster1/install_db.sh

# cluster1/galera_setup_acl.sh # this needed for mysqldump

cluster1/galera_start_new.sh wsrep_sst_method=rsync

sleep 10

cluster1/galera_cluster_size.sh

for i in {1..5}; do

  m1*/sql.sh create table t$i select seq as a, seq as b from seq_1_to_10000

  m1*/sql.sh "alter table t$i add unique index(a)"

done

trap 'kill $(jobs -p)' EXIT

for m in {1..3}; do

  for i in {1..5}; do

    export I=$i

    export M=$m

    ( while : ; do m$M*/sql.sh "start transaction; update t$I set b=b+1 where a=$I; delete from t$I where a=2*$I; do sleep(10); commit" 2>/dev/null || : ; done; )&

  done

  ( while : ; do

    for i in {1..30}; do m$M*/sql.sh "delete from t$i where a=floor(rand()*10000);" &>/dev/null ; done

  done ) &

  ( while : ; do

    for i in {1..30}; do m$M*/sql.sh "delete from t$i where a=floor(rand()*10000);" &>/dev/null ; done

  done ) &

done

while : ; do

  m1*/sql.sh show processlist

  m2*/sql.sh show processlist

  m3*/sql.sh show processlist

  grep 'BF lock wait long' m1*/dt/error.log | wc -l

  grep 'BF lock wait long' m2*/dt/error.log | wc -l

  grep 'BF lock wait long' m3*/dt/error.log | wc -l

  sleep 10

done

The output of write SQL commands is suppressed and diplayed only processlist for each node, followed by number of "BF lock wait long" occurrences in error log.
After few minutes I can see problems in galera committer threads:

2	system user		NULL	Sleep	99	wsrep aborter idle	NULL	0.000

1	system user		NULL	Sleep	10	committed 25	NULL	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

43	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

44	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

45	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

46	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

47	root	localhost	test	Query	9	updating	delete from t4 where a=floor(rand()*10000)	0.000

48	root	localhost	test	Query	9	updating	delete from t4 where a=floor(rand()*10000)	0.000

49	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

50	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	97	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	9	committed 30	NULL	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

27	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

29	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

30	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

31	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

32	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

33	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	89	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	20	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

12	root	localhost	test	Query	20	init	commit	0.000

14	root	localhost	test	Query	20	init	commit	0.000

15	root	localhost	test	Query	30	query end	delete from t1 where a=floor(rand()*10000)	0.000

20	root	localhost	test	Query	15	init	commit	0.000

21	root	localhost	test	Query	10	init	commit	0.000

25	root	localhost	test	Query	9	updating	update t1 set b=b+1 where a=1	0.000

26	root	localhost	test	Query	0	init	show processlist	0.000

2	system user		NULL	Sleep	109	wsrep aborter idle	NULL	0.000

1	system user		NULL	Sleep	9	Update_rows_log_event::find_row(35)	update t4 set b=b+1 where a=4	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

47	root	localhost	test	Query	19	query end	delete from t4 where a=floor(rand()*10000)	0.000

49	root	localhost	test	Query	9	init	commit	0.000

51	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

52	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

53	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

54	root	localhost	test	Query	9	updating	update t4 set b=b+1 where a=4	0.000

55	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	107	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	9	committed 40	NULL	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

34	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

35	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

36	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

37	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

38	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

39	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	99	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	30	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

12	root	localhost	test	Query	30	init	commit	0.000

14	root	localhost	test	Query	30	init	commit	0.000

15	root	localhost	test	Query	40	query end	delete from t1 where a=floor(rand()*10000)	0.000

20	root	localhost	test	Query	25	init	commit	0.000

21	root	localhost	test	Query	20	init	commit	0.000

27	root	localhost	test	Query	8	updating	update t1 set b=b+1 where a=1	0.000

28	root	localhost	test	Query	0	init	show processlist	0.000

2	system user		NULL	Sleep	119	wsrep aborter idle	NULL	0.000

1	system user		NULL	Sleep	19	Update_rows_log_event::find_row(35)	update t4 set b=b+1 where a=4	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

47	root	localhost	test	Query	29	query end	delete from t4 where a=floor(rand()*10000)	0.000

49	root	localhost	test	Query	19	init	commit	0.000

51	root	localhost	test	Query	9	init	commit	0.000

52	root	localhost	test	Query	9	init	commit	0.000

53	root	localhost	test	Query	9	init	commit	0.000

56	root	localhost	test	Query	8	updating	update t4 set b=b+1 where a=4	0.000

57	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	117	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	19	committed 40	NULL	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

40	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

41	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

42	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

43	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

44	root	localhost	test	Query	9	User sleep	do sleep(10)	0.000

45	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	109	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	40	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

12	root	localhost	test	Query	40	init	commit	0.000

14	root	localhost	test	Query	40	init	commit	0.000

15	root	localhost	test	Query	50	query end	delete from t1 where a=floor(rand()*10000)	0.000

20	root	localhost	test	Query	35	init	commit	0.000

21	root	localhost	test	Query	30	init	commit	0.000

29	root	localhost	test	Query	7	updating	update t1 set b=b+1 where a=1	0.000

30	root	localhost	test	Query	0	init	show processlist	0.000

2	system user		NULL	Sleep	129	wsrep aborter idle	NULL	0.000

1	system user		NULL	Sleep	29	Update_rows_log_event::find_row(35)	update t4 set b=b+1 where a=4	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

47	root	localhost	test	Query	39	query end	delete from t4 where a=floor(rand()*10000)	0.000

49	root	localhost	test	Query	29	init	commit	0.000

51	root	localhost	test	Query	20	init	commit	0.000

52	root	localhost	test	Query	19	init	commit	0.000

53	root	localhost	test	Query	19	init	commit	0.000

58	root	localhost	test	Query	7	updating	update t4 set b=b+1 where a=4	0.000

59	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	127	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	29	committed 40	NULL	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

40	root	localhost	test	Query	9	init	commit	0.000

41	root	localhost	test	Query	9	init	commit	0.000

42	root	localhost	test	Query	9	init	commit	0.000

43	root	localhost	test	Query	9	init	commit	0.000

44	root	localhost	test	Query	9	init	commit	0.000

46	root	localhost	test	Query	0	init	show processlist	0.000

1	system user		NULL	Sleep	119	wsrep aborter idle	NULL	0.000

2	system user		NULL	Sleep	50	Update_rows_log_event::find_row(13)	update t1 set b=b+1 where a=1	0.000

3	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL	0.000

4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

6	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL	0.000

7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL	0.000

12	root	localhost	test	Query	50	init	commit	0.000

14	root	localhost	test	Query	50	init	commit	0.000

15	root	localhost	test	Query	60	query end	delete from t1 where a=floor(rand()*10000)	0.000

20	root	localhost	test	Query	45	init	commit	0.000

21	root	localhost	test	Query	40	init	commit	0.000

31	root	localhost	test	Query	6	updating	update t1 set b=b+1 where a=1	0.000

32	root	localhost	test	Query	0	init	show processlist	0.000

sachin.setiya.007 let me know if this script is OK or if I should work on mtr case with similar load.

Andrii Nikitin (Inactive) added a comment - 2017-10-05 20:28 - edited I was able to reproduce the problem with script below, which downloads and unpacks 10.2.9, configures 3 nodes on 3307, 3308 and 3309 and puts some concurrent write load : set -e ver=${1:-10.2.9} # just use current directory if called from framework if [ ! -f common.sh ] ; then [ -d mariadb-environs ] || git clone http: //github .com /AndriiNikitin/mariadb-environs cd mariadb-environs . /get_plugin .sh galera fi _template /plant_cluster .sh cluster1 echo m1 > cluster1 /nodes .lst echo m2 >> cluster1 /nodes .lst echo m3 >> cluster1 /nodes .lst cluster1 /replant .sh ${ver} . /build_or_download .sh m1 cluster1 /gen_cnf .sh innodb_buffer_pool_size=1G \ innodb_log_file_size=256M \ innodb_flush_log_at_trx_commit=2 \ innodb_lock_wait_timeout=10 \ transaction-isolation=READ-COMMITTED \ innodb_stats_persistent=0 cluster1 /install_db .sh # cluster1/galera_setup_acl.sh # this needed for mysqldump cluster1 /galera_start_new .sh wsrep_sst_method= rsync sleep 10 cluster1 /galera_cluster_size .sh for i in {1..5}; do m1* /sql .sh create table t$i select seq as a, seq as b from seq_1_to_10000 m1* /sql .sh "alter table t$i add unique index(a)" done trap 'kill $(jobs -p)' EXIT for m in {1..3}; do for i in {1..5}; do export I=$i export M=$m ( while : ; do m$M* /sql .sh "start transaction; update t$I set b=b+1 where a=$I; delete from t$I where a=2*$I; do sleep(10); commit" 2> /dev/null || : ; done ; )& done ( while : ; do for i in {1..30}; do m$M* /sql .sh "delete from t$i where a=floor(rand()*10000);" &> /dev/null ; done done ) & ( while : ; do for i in {1..30}; do m$M* /sql .sh "delete from t$i where a=floor(rand()*10000);" &> /dev/null ; done done ) & done while : ; do m1* /sql .sh show processlist m2* /sql .sh show processlist m3* /sql .sh show processlist grep 'BF lock wait long' m1* /dt/error .log | wc -l grep 'BF lock wait long' m2* /dt/error .log | wc -l grep 'BF lock wait long' m3* /dt/error .log | wc -l sleep 10 done The output of write SQL commands is suppressed and diplayed only processlist for each node, followed by number of "BF lock wait long" occurrences in error log. After few minutes I can see problems in galera committer threads: 0 0 0 2 system user NULL Sleep 99 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 10 committed 25 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 43 root localhost test Query 9 User sleep do sleep(10) 0.000 44 root localhost test Query 9 User sleep do sleep(10) 0.000 45 root localhost test Query 9 User sleep do sleep(10) 0.000 46 root localhost test Query 9 User sleep do sleep(10) 0.000 47 root localhost test Query 9 updating delete from t4 where a=floor(rand()*10000) 0.000 48 root localhost test Query 9 updating delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 9 User sleep do sleep(10) 0.000 50 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 97 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 9 committed 30 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 27 root localhost test Query 9 User sleep do sleep(10) 0.000 29 root localhost test Query 9 User sleep do sleep(10) 0.000 30 root localhost test Query 9 User sleep do sleep(10) 0.000 31 root localhost test Query 9 User sleep do sleep(10) 0.000 32 root localhost test Query 9 User sleep do sleep(10) 0.000 33 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 89 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 20 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 20 init commit 0.000 14 root localhost test Query 20 init commit 0.000 15 root localhost test Query 30 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 15 init commit 0.000 21 root localhost test Query 10 init commit 0.000 25 root localhost test Query 9 updating update t1 set b=b+1 where a=1 0.000 26 root localhost test Query 0 init show processlist 0.000 0 0 10 2 system user NULL Sleep 109 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 9 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 47 root localhost test Query 19 query end delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 9 init commit 0.000 51 root localhost test Query 9 User sleep do sleep(10) 0.000 52 root localhost test Query 9 User sleep do sleep(10) 0.000 53 root localhost test Query 9 User sleep do sleep(10) 0.000 54 root localhost test Query 9 updating update t4 set b=b+1 where a=4 0.000 55 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 107 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 9 committed 40 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 34 root localhost test Query 9 User sleep do sleep(10) 0.000 35 root localhost test Query 9 User sleep do sleep(10) 0.000 36 root localhost test Query 9 User sleep do sleep(10) 0.000 37 root localhost test Query 9 User sleep do sleep(10) 0.000 38 root localhost test Query 9 User sleep do sleep(10) 0.000 39 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 99 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 30 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 30 init commit 0.000 14 root localhost test Query 30 init commit 0.000 15 root localhost test Query 40 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 25 init commit 0.000 21 root localhost test Query 20 init commit 0.000 27 root localhost test Query 8 updating update t1 set b=b+1 where a=1 0.000 28 root localhost test Query 0 init show processlist 0.000 0 0 20 2 system user NULL Sleep 119 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 19 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 47 root localhost test Query 29 query end delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 19 init commit 0.000 51 root localhost test Query 9 init commit 0.000 52 root localhost test Query 9 init commit 0.000 53 root localhost test Query 9 init commit 0.000 56 root localhost test Query 8 updating update t4 set b=b+1 where a=4 0.000 57 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 117 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 19 committed 40 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 40 root localhost test Query 9 User sleep do sleep(10) 0.000 41 root localhost test Query 9 User sleep do sleep(10) 0.000 42 root localhost test Query 9 User sleep do sleep(10) 0.000 43 root localhost test Query 9 User sleep do sleep(10) 0.000 44 root localhost test Query 9 User sleep do sleep(10) 0.000 45 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 109 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 40 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 40 init commit 0.000 14 root localhost test Query 40 init commit 0.000 15 root localhost test Query 50 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 35 init commit 0.000 21 root localhost test Query 30 init commit 0.000 29 root localhost test Query 7 updating update t1 set b=b+1 where a=1 0.000 30 root localhost test Query 0 init show processlist 0.000 9 0 30 2 system user NULL Sleep 129 wsrep aborter idle NULL 0.000 1 system user NULL Sleep 29 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 47 root localhost test Query 39 query end delete from t4 where a=floor(rand()*10000) 0.000 49 root localhost test Query 29 init commit 0.000 51 root localhost test Query 20 init commit 0.000 52 root localhost test Query 19 init commit 0.000 53 root localhost test Query 19 init commit 0.000 58 root localhost test Query 7 updating update t4 set b=b+1 where a=4 0.000 59 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 127 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 29 committed 40 NULL 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 40 root localhost test Query 9 init commit 0.000 41 root localhost test Query 9 init commit 0.000 42 root localhost test Query 9 init commit 0.000 43 root localhost test Query 9 init commit 0.000 44 root localhost test Query 9 init commit 0.000 46 root localhost test Query 0 init show processlist 0.000 1 system user NULL Sleep 119 wsrep aborter idle NULL 0.000 2 system user NULL Sleep 50 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000 3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 12 root localhost test Query 50 init commit 0.000 14 root localhost test Query 50 init commit 0.000 15 root localhost test Query 60 query end delete from t1 where a=floor(rand()*10000) 0.000 20 root localhost test Query 45 init commit 0.000 21 root localhost test Query 40 init commit 0.000 31 root localhost test Query 6 updating update t1 set b=b+1 where a=1 0.000 32 root localhost test Query 0 init show processlist 0.000 19 0 40 sachin.setiya.007 let me know if this script is OK or if I should work on mtr case with similar load.

Sachin Setiya (Inactive) added a comment - 2017-10-13 07:54

Hi anikitin!

I tried to run this script on jessie(buildbot), One hour run , No BF lock wait long, The thing is script is not deterministic (I think so is the bug). If we can have some mtr case which simulates it for sure , would be great

Sachin Setiya (Inactive) added a comment - 2017-10-13 07:54 Hi anikitin ! I tried to run this script on jessie(buildbot), One hour run , No BF lock wait long, The thing is script is not deterministic (I think so is the bug). If we can have some mtr case which simulates it for sure , would be great

Sukan added a comment - 2017-10-13 10:14

Hi Sachin,

The below scenario will help you in reproducing the bug.

1. On a single node, for a particular ID updates are fired continuously.
2. On another node, for the same WHERE clause I set a different value.

I have a table dummy, where I generated concurrent updates on node 1 and node 2 at the same time.

*Node 1* update dummy set status=100 where id='716071007'; //This runs in 500 concurrency with 100 iteration

*Node 2* update dummy set status=101 where id=716071007;  //This runs in 500 concurrency with 100 iteration

GC should have thrown a deadlock error, but instead it keep accepting the queries.

In node 2, when the query from node 1 replicates and on node 2 when a direct UPDATE hits, it thrown the below error.

WSREP: BF lock wait long

WSREP: BF lock wait long

This makes the cluster stalls and need to forcefully kill the same to bring back.

NOTE: Tested in 10.2.8,10.2.9, creating "BF lock wait long" but not in 10.1.26.

Steps to reproduce :

CREATE TABLE `dummy` (
-> `id` bigint(20) NOT NULL AUTO_INCREMENT,
-> service varchar(20),
-> status int(11),
-> PRIMARY KEY (`id`)
-> ) ENGINE=InnoDB AUTO_INCREMENT=716071002 DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.43 sec)

MariaDB [test]> insert into dummy (service,status) values ('MySQL',1),('MSSQL',2),('MONGODB',3);
Query OK, 3 rows affected (0.13 sec)
Records: 3 Duplicates: 0 Warnings: 0

*Generate concurrent updates *

*Node 1* update dummy set status=100 where id='716071007';

*Node 2* update dummy set status=101 where id=716071007;

Process list :
------------------------------------------------------------------------------------------------------------------------------------------

User

Host

Command

Time

State

Info

Progress

------------------------------------------------------------------------------------------------------------------------------------------

1	system user		NULL	Sleep	4	Update_rows_log_event::find_row(718)	update dummy set status=100 where id='716071007'
2	system user		NULL	Sleep	501	wsrep aborter idle	NULL
6	system user		NULL	Daemon	NULL	InnoDB purge coordinator	NULL
3	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL
4	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL
5	system user		NULL	Daemon	NULL	InnoDB purge worker	NULL
7	system user		NULL	Daemon	NULL	InnoDB shutdown handler	NULL
11	monitor	localhost:44902	NULL	Sleep	0		NULL
17	root	localhost	test	Query	0	init	show processlist
22	sbtest	localhost:46666	test	Query	4	updating	update dummy set status=101 where id=716071007
23	sbtest	localhost:46696	test	Query	4	updating	update dummy set status=101 where id=716071007
24	sbtest	localhost:46850	test	Query	4	updating	update dummy set status=101 where id=716071007
25	sbtest	localhost:46854	test	Query	4	updating	update dummy set status=101 where id=716071007
26	sbtest	localhost:46856	test	Query	4	updating	update dummy set status=101 where id=716071007
36	sbtest	localhost:47108	test	Query	4	updating	update dummy set status=100 where id='716071007'
38	sbtest	localhost:47144	test	Query	4	updating	update dummy set status=101 where id=716071007
39	sbtest	localhost:47182	test	Query	4	updating	update dummy set status=101 where id=716071007
37	sbtest	localhost:47140	test	Query	4	updating	update dummy set status=101 where id=716071007
40	sbtest	localhost:47186	test	Query	4	updating	update dummy set status=101 where id=716071007
42	sbtest	localhost:47222	test	Query	4	updating	update dummy set status=101 where id=716071007
41	sbtest	localhost:47202	test	Query	4	updating	update dummy set status=100 where id='716071007'
43	sbtest	localhost:47232	test	Query	4	updating	update dummy set status=100 where id='716071007'
27	sbtest	localhost:46892	test	Query	4	query end	update dummy set status=100 where id='716071007'
30	sbtest	localhost:47042	test	Query	4	updating	update dummy set status=101 where id=716071007
31	sbtest	localhost:47048	test	Query	4	updating	update dummy set status=101 where id=716071007
32	sbtest	localhost:47062	test	Query	4	updating	update dummy set status=101 where id=716071007
33	sbtest	localhost:47066	test	Query	4	updating	update dummy set status=101 where id=716071007

Finally, cluster stalls until we kill the MySQL

+----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+

| Id | User        | Host            | db   | Command | Time | State                                | Info                                             | Progress |

+----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+

|  1 | system user |                 | NULL | Sleep   |  144 | Update_rows_log_event::find_row(718) | update dummy set status=100 where id='716071007' |    0.000 |

|  2 | system user |                 | NULL | Sleep   |  641 | wsrep aborter idle                   | NULL                                             |    0.000 |

|  6 | system user |                 | NULL | Daemon  | NULL | InnoDB purge coordinator             | NULL                                             |    0.000 |

|  3 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker                  | NULL                                             |    0.000 |

|  4 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker                  | NULL                                             |    0.000 |

|  5 | system user |                 | NULL | Daemon  | NULL | InnoDB purge worker                  | NULL                                             |    0.000 |

|  7 | system user |                 | NULL | Daemon  | NULL | InnoDB shutdown handler              | NULL                                             |    0.000 |

| 11 | monitor     | localhost:44902 | NULL | Sleep   |    0 |                                      | NULL                                             |    0.000 |

| 17 | root        | localhost       | test | Query   |    0 | init                                 | show processlist                                 |    0.000 |

| 27 | sbtest      | localhost:46892 | test | Query   |  144 | query end                            | update dummy set status=100 where id='716071007' |    0.000 |

+----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+

Hope this will help you in addressing the issue.

-Sukan
DBA, Mafiree

Sukan added a comment - 2017-10-13 10:14 Hi Sachin, The below scenario will help you in reproducing the bug. 1. On a single node, for a particular ID updates are fired continuously. 2. On another node, for the same WHERE clause I set a different value. I have a table dummy, where I generated concurrent updates on node 1 and node 2 at the same time. *Node 1 * update dummy set status= 100 where id= '716071007' ; //This runs in 500 concurrency with 100 iteration *Node 2 * update dummy set status= 101 where id= 716071007 ; //This runs in 500 concurrency with 100 iteration GC should have thrown a deadlock error, but instead it keep accepting the queries. In node 2, when the query from node 1 replicates and on node 2 when a direct UPDATE hits, it thrown the below error. WSREP: BF lock wait long WSREP: BF lock wait long This makes the cluster stalls and need to forcefully kill the same to bring back. NOTE: Tested in 10.2.8,10.2.9, creating "BF lock wait long" but not in 10.1.26. Steps to reproduce : CREATE TABLE `dummy` ( -> `id` bigint(20) NOT NULL AUTO_INCREMENT, -> service varchar(20), -> status int(11), -> PRIMARY KEY (`id`) -> ) ENGINE=InnoDB AUTO_INCREMENT=716071002 DEFAULT CHARSET=utf8; Query OK, 0 rows affected (0.43 sec) MariaDB [test] > insert into dummy (service,status) values ('MySQL',1),('MSSQL',2),('MONGODB',3); Query OK, 3 rows affected (0.13 sec) Records: 3 Duplicates: 0 Warnings: 0 *Generate concurrent updates * *Node 1 * update dummy set status= 100 where id= '716071007' ; *Node 2 * update dummy set status= 101 where id= 716071007 ; Process list : ---- ----------- --------------- ---- ------- ---- ------------------------------------ ------------------------------------------------ --------- Id User Host db Command Time State Info Progress ---- ----------- --------------- ---- ------- ---- ------------------------------------ ------------------------------------------------ --------- 1 system user NULL Sleep 4 Update_rows_log_event::find_row(718) update dummy set status=100 where id='716071007' 0.000 2 system user NULL Sleep 501 wsrep aborter idle NULL 0.000 6 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000 3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000 7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000 11 monitor localhost:44902 NULL Sleep 0 NULL 0.000 17 root localhost test Query 0 init show processlist 0.000 22 sbtest localhost:46666 test Query 4 updating update dummy set status=101 where id=716071007 0.000 23 sbtest localhost:46696 test Query 4 updating update dummy set status=101 where id=716071007 0.000 24 sbtest localhost:46850 test Query 4 updating update dummy set status=101 where id=716071007 0.000 25 sbtest localhost:46854 test Query 4 updating update dummy set status=101 where id=716071007 0.000 26 sbtest localhost:46856 test Query 4 updating update dummy set status=101 where id=716071007 0.000 36 sbtest localhost:47108 test Query 4 updating update dummy set status=100 where id='716071007' 0.000 38 sbtest localhost:47144 test Query 4 updating update dummy set status=101 where id=716071007 0.000 39 sbtest localhost:47182 test Query 4 updating update dummy set status=101 where id=716071007 0.000 37 sbtest localhost:47140 test Query 4 updating update dummy set status=101 where id=716071007 0.000 40 sbtest localhost:47186 test Query 4 updating update dummy set status=101 where id=716071007 0.000 42 sbtest localhost:47222 test Query 4 updating update dummy set status=101 where id=716071007 0.000 41 sbtest localhost:47202 test Query 4 updating update dummy set status=100 where id='716071007' 0.000 43 sbtest localhost:47232 test Query 4 updating update dummy set status=100 where id='716071007' 0.000 27 sbtest localhost:46892 test Query 4 query end update dummy set status=100 where id='716071007' 0.000 30 sbtest localhost:47042 test Query 4 updating update dummy set status=101 where id=716071007 0.000 31 sbtest localhost:47048 test Query 4 updating update dummy set status=101 where id=716071007 0.000 32 sbtest localhost:47062 test Query 4 updating update dummy set status=101 where id=716071007 0.000 33 sbtest localhost:47066 test Query 4 updating update dummy set status=101 where id=716071007 0.000 Finally, cluster stalls until we kill the MySQL +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+ | 1 | system user | | NULL | Sleep | 144 | Update_rows_log_event::find_row( 718 ) | update dummy set status= 100 where id= '716071007' | 0.000 | | 2 | system user | | NULL | Sleep | 641 | wsrep aborter idle | NULL | 0.000 | | 6 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 | | 3 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 5 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | | 7 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 | | 11 | monitor | localhost: 44902 | NULL | Sleep | 0 | | NULL | 0.000 | | 17 | root | localhost | test | Query | 0 | init | show processlist | 0.000 | | 27 | sbtest | localhost: 46892 | test | Query | 144 | query end | update dummy set status= 100 where id= '716071007' | 0.000 | +----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+ Hope this will help you in addressing the issue. -Sukan DBA, Mafiree

Ari Maniatis added a comment - 2017-10-16 09:38

@sachin.setiya.007 Are you able to reproduce the problem now? I can make it happen on 10.2.7 within a few minutes by running Jira with the loadbalancer JDBC connection across two nodes. This happens with very little load almost right away.

Since we have no workaround and no way of avoiding the bug, there needs to be some recommendation to avoid Galera entirely with MariaDB 10.2.x. Clearly this isn't ready for production. There are also other important issues to watch out for since 10.2.9 will not start with Galera because of this bug.

Is there some way we can get a higher level of attention for Galera issues, or for more automated testing in this area? I think this could be mariaDB's key advantage over the competition if this stuff was just really solid.

Ari Maniatis added a comment - 2017-10-16 09:38 @sachin.setiya.007 Are you able to reproduce the problem now? I can make it happen on 10.2.7 within a few minutes by running Jira with the loadbalancer JDBC connection across two nodes. This happens with very little load almost right away. Since we have no workaround and no way of avoiding the bug, there needs to be some recommendation to avoid Galera entirely with MariaDB 10.2.x. Clearly this isn't ready for production. There are also other important issues to watch out for since 10.2.9 will not start with Galera because of this bug . Is there some way we can get a higher level of attention for Galera issues, or for more automated testing in this area? I think this could be mariaDB's key advantage over the competition if this stuff was just really solid.

Sachin Setiya (Inactive) added a comment - 2017-10-18 09:39

Sorry , I was travelling , But will try to do it now, I got to hit the bug by anikitin script. So I am writing mtr test , and on the way to fix

Sachin Setiya (Inactive) added a comment - 2017-10-18 09:39 Sorry , I was travelling , But will try to do it now, I got to hit the bug by anikitin script. So I am writing mtr test , and on the way to fix

Tas Dinq added a comment - 2017-10-28 18:09

This bug also effect server version: 10.2.6-MariaDB-10.2.6+maria~xenial-log mariadb.org binary distribution, in my production environment. it's drive me crazy i need force to restart full cluster each times.

i have a lot's of client process update a table, some times with same id on different node, like:
UPDATE foo SET foo.bar = 1 or 0 WHERE id=xxx;
very similar to Sukan's test case.

any progress for this bug?
or any temporary workaround for this bug?

Tas Dinq added a comment - 2017-10-28 18:09 This bug also effect server version: 10.2.6-MariaDB-10.2.6+maria~xenial-log mariadb.org binary distribution, in my production environment. it's drive me crazy i need force to restart full cluster each times. i have a lot's of client process update a table, some times with same id on different node, like: UPDATE foo SET foo.bar = 1 or 0 WHERE id=xxx; very similar to Sukan's test case. any progress for this bug? or any temporary workaround for this bug?

Sukan added a comment - 2017-10-30 05:05

Hi Tan,

I would suggest you to run the same cluster in single node, you can point all the write traffic to single node. This would be temporary workaround without making any big changes.

Else you can downgrade the cluster completely to 10.1.x version, where we ran the test and didn't hit this bug.

Thanks,
Sukan
DBA, Mafiree

Sukan added a comment - 2017-10-30 05:05 Hi Tan, I would suggest you to run the same cluster in single node, you can point all the write traffic to single node. This would be temporary workaround without making any big changes. Else you can downgrade the cluster completely to 10.1.x version, where we ran the test and didn't hit this bug. Thanks, Sukan DBA, Mafiree

Tas Dinq added a comment - 2017-11-01 15:43

THANK YOU !!!!!! Sukan. thank you for your advice.

I'm trying to use maxscale's ReadWriteSplit module to route all write to single node, hope my life can be easier with this.

Tas Dinq added a comment - 2017-11-01 15:43 THANK YOU !!!!!! Sukan. thank you for your advice. I'm trying to use maxscale's ReadWriteSplit module to route all write to single node, hope my life can be easier with this.

Anton Baykov added a comment - 2017-11-08 05:02

Hi Sachin,
any progress for this bug?

Anton Baykov added a comment - 2017-11-08 05:02 Hi Sachin, any progress for this bug?

TAO ZHOU added a comment - 2017-11-09 00:43

I hit the error 'BF-BF X lock conflict' even though I am only writing to one node.

2017-11-09  4:18:49 50152503296 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

2017-11-09 10:30:37 50153508864 [Note] InnoDB: BF-BF X lock conflict,mode: 1027 supremum: 0

2017-11-09 10:30:37 50153508864 [Note] InnoDB: conflicts states: my 0 locked 0

RECORD LOCKS space id 316718 page no 3049 n bits 240 index PRIMARY of table `

TAO ZHOU added a comment - 2017-11-09 00:43 I hit the error 'BF-BF X lock conflict' even though I am only writing to one node. 2017 - 11 - 09 4 : 18 : 49 50152503296 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. 2017 - 11 - 09 10 : 30 : 37 50153508864 [Note] InnoDB: BF-BF X lock conflict,mode: 1027 supremum: 0 2017 - 11 - 09 10 : 30 : 37 50153508864 [Note] InnoDB: conflicts states: my 0 locked 0 RECORD LOCKS space id 316718 page no 3049 n bits 240 index PRIMARY of table `

Mark added a comment - 2017-11-09 15:49

Is there any update for this ticket working twords a fix for this? There has been a distinct lack of feedback about this issue. Not that I expect a fix right away, but let us know whats happening, and maybe a better understanding of what the issue is.

We have been have been experiencing this problem even though we are sending all of the traffic to one node only. (out off 3) Our application seems to lock up about once a month or so.

Mark

Mark added a comment - 2017-11-09 15:49 Is there any update for this ticket working twords a fix for this? There has been a distinct lack of feedback about this issue. Not that I expect a fix right away, but let us know whats happening, and maybe a better understanding of what the issue is. We have been have been experiencing this problem even though we are sending all of the traffic to one node only. (out off 3) Our application seems to lock up about once a month or so. Mark

Sachin Setiya (Inactive) added a comment - 2017-11-10 07:53

Hi ,

I have been assigned to this issue, but for arround 10 days I was on vacations. Currently I am occupied with 10715 and 10177. I will look into this issue after those are done. And this issue require somewhat more understanding , because I am not familiar with innodb code.

Regards
sachin

Sachin Setiya (Inactive) added a comment - 2017-11-10 07:53 Hi , I have been assigned to this issue, but for arround 10 days I was on vacations. Currently I am occupied with 10715 and 10177. I will look into this issue after those are done. And this issue require somewhat more understanding , because I am not familiar with innodb code. Regards sachin

Ari Maniatis added a comment - 2017-11-11 06:43

Hi Sachin

I know you have your own set of priorities based on your customers, but I'd like to encourage you to look at this bug with higher priority. Particularly when compared to building a new feature for 10.3 (~~MDEV-10177~~). This bug:

completely crashes mariadb
has no workaround
affects anyone who is using galera, even if they disable writing to multiple nodes

In short, this bug is completely catastrophic for all production use of mariaDB if galera is enabled. I think mariaDB management should consider pulling the entire 10.2 release from production ready status and think about how users might roll back to 10.1 if this bug can't be fixed in the next minor 10.2.11 release. This should be on the front page of your site or download page.

Ari Maniatis added a comment - 2017-11-11 06:43 Hi Sachin I know you have your own set of priorities based on your customers, but I'd like to encourage you to look at this bug with higher priority. Particularly when compared to building a new feature for 10.3 ( MDEV-10177 ). This bug: completely crashes mariadb has no workaround affects anyone who is using galera, even if they disable writing to multiple nodes In short, this bug is completely catastrophic for all production use of mariaDB if galera is enabled. I think mariaDB management should consider pulling the entire 10.2 release from production ready status and think about how users might roll back to 10.1 if this bug can't be fixed in the next minor 10.2.11 release. This should be on the front page of your site or download page.

Andrii Nikitin (Inactive) added a comment - 2017-11-15 12:00 - edited

mtr test case is in ~~MDEV-14401~~ according to instructions from https://jira.mariadb.org/browse/MDEV-12837?focusedCommentId=101543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-101543

Andrii Nikitin (Inactive) added a comment - 2017-11-15 12:00 - edited mtr test case is in MDEV-14401 according to instructions from https://jira.mariadb.org/browse/MDEV-12837?focusedCommentId=101543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-101543

Sergei Golubchik added a comment - 2017-11-16 01:57

reassigned to jplindst because seppo has found that

My debugging session from today shows that the potential smoking gun is
function lock_grant_and_move_on_page() in lock0lock.cc. This function does not
honor locking priority policy, and leaves a high priority lock in wait queue
favoring a normal priority lock.

Sergei Golubchik added a comment - 2017-11-16 01:57 reassigned to jplindst because seppo has found that My debugging session from today shows that the potential smoking gun is function lock_grant_and_move_on_page() in lock0lock.cc. This function does not honor locking priority policy, and leaves a high priority lock in wait queue favoring a normal priority lock.

Jan Lindström (Inactive) added a comment - 2017-11-17 08:11

When using VATS we favor older transactions this is bad for Galera as we should favor BF transactions.

Jan Lindström (Inactive) added a comment - 2017-11-17 08:11 When using VATS we favor older transactions this is bad for Galera as we should favor BF transactions.

Jan Lindström (Inactive) added a comment - 2017-11-23 16:08

https://github.com/MariaDB/server/commit/7d4d54e415c3d6081b71b024532b5ba8f1addc4a

Jan Lindström (Inactive) added a comment - 2017-11-23 16:08 https://github.com/MariaDB/server/commit/7d4d54e415c3d6081b71b024532b5ba8f1addc4a

Marko Mäkelä added a comment - 2017-11-24 16:11

Please try to find a solution that does not involve adding fields to trx_t. I cannot approve the addition of trx_t::trx_mutex_taken.

There are many Galera test failures on bb-10.2-MDEV-12837. Can you please rebase the fix on the latest 10.2, and then post a detailed analysis of all test failures?

Marko Mäkelä added a comment - 2017-11-24 16:11 Please try to find a solution that does not involve adding fields to trx_t. I cannot approve the addition of trx_t::trx_mutex_taken. There are many Galera test failures on bb-10.2-MDEV-12837 . Can you please rebase the fix on the latest 10.2, and then post a detailed analysis of all test failures?

Jan Lindström (Inactive) added a comment - 2017-11-27 12:29

https://github.com/MariaDB/server/commit/9f6dd866c163822e2e3b52345685722c6dec2b2b

Jan Lindström (Inactive) added a comment - 2017-11-27 12:29 https://github.com/MariaDB/server/commit/9f6dd866c163822e2e3b52345685722c6dec2b2b

Marko Mäkelä added a comment - 2017-11-28 14:47

The change looks much cleaner now, but I’d like to have one more review round.

Marko Mäkelä added a comment - 2017-11-28 14:47 The change looks much cleaner now, but I’d like to have one more review round.

Marko Mäkelä added a comment - 2017-12-04 20:34

I would like to see a revised test push to buildbot that contains only this change, based on the latest 10.2. The current bb-10.2-MDEV-12837 contains some commits related to ~~MDEV-14401~~. It seems that ~~MDEV-14401~~ actually depends on this ~~MDEV-12837~~. It looks like the changes to lock_rec_lock_slow() and lock_table() in the initial MDEV-14401 patch have been made redundant by the MDEV-12837 change to innobase_init().

Marko Mäkelä added a comment - 2017-12-04 20:34 I would like to see a revised test push to buildbot that contains only this change, based on the latest 10.2. The current bb-10.2-MDEV-12837 contains some commits related to MDEV-14401 . It seems that MDEV-14401 actually depends on this MDEV-12837 . It looks like the changes to lock_rec_lock_slow() and lock_table() in the initial MDEV-14401 patch have been made redundant by the MDEV-12837 change to innobase_init() .

Jan Lindström (Inactive) added a comment - 2017-12-05 08:13 - edited

Split the commits on two parts:

https://github.com/MariaDB/server/commit/1ffd8b30f9f86aedd42c91f72db9d54f2f8576d7

Test case is on:

https://github.com/MariaDB/server/commit/7f555777b36983d4b4590df6634aa22170e90bc3

Because it requires both fixes.

Jan Lindström (Inactive) added a comment - 2017-12-05 08:13 - edited Split the commits on two parts: https://github.com/MariaDB/server/commit/1ffd8b30f9f86aedd42c91f72db9d54f2f8576d7 Test case is on: https://github.com/MariaDB/server/commit/7f555777b36983d4b4590df6634aa22170e90bc3 Because it requires both fixes.

Marko Mäkelä added a comment - 2017-12-05 17:54

I sent some review comments to the first commit. I still think that we must convert the algorithm at InnoDB startup, not while InnoDB is already open for SQL.
Where is this running on Buildbot?

Marko Mäkelä added a comment - 2017-12-05 17:54 I sent some review comments to the first commit. I still think that we must convert the algorithm at InnoDB startup, not while InnoDB is already open for SQL. Where is this running on Buildbot?

Jan Lindström (Inactive) added a comment - 2017-12-05 19:07

bb-10.2-galera

Jan Lindström (Inactive) added a comment - 2017-12-05 19:07 bb-10.2-galera

Marko Mäkelä added a comment - 2017-12-06 13:29

I now see that bb-10.2-galera consists of two commits on top of 10.2: ~~MDEV-12837~~ and ~~MDEV-14401~~.
If the ~~MDEV-12837~~ fix were working as requested (adjusting the configuration at InnoDB startup), then I think that all the configuration changes

innodb-lock-schedule-algorithm=fcfs

in the ~~MDEV-14401~~ fix should be unnecessary and should be removed. The default VATS should be fine. We should not even need to suppress any warning; at startup, InnoDB could issue an informational note (not warning) that the incompatible parameter was adjusted.

One more question: Why does the ~~MDEV-14401~~ commit disable the following occasionally failing tests:

diff --git a/mysql-test/suite/galera/disabled.def b/mysql-test/suite/galera/disabled.def

index 423c83208f8..801b12a1f84 100644

--- a/mysql-test/suite/galera/disabled.def

+++ b/mysql-test/suite/galera/disabled.def

@@ -56,3 +56,8 @@ MW-284 : MDEV-13549 Galera test failures 10.1

 galera_as_slave : MDEV-13549 Galera test failures 10.1

 galera_var_innodb_disallow_writes : MDEV-10949

 galera_kill_applier : race condition at the start of the test

+MW-328C: MDEV-13549 Galera test failures 10.1

+MW-328A: MDEV-13549 Galera test failures 10.1

+MW-328B: MDEV-13549 Galera test failures 10.1

+MW-328: MDEV-13549 Galera test failures 10.1

+galera_suspend_slave: MDEV-13549 Galera test failures 10.1

\ No newline at end of file

The commit message does not explain this.
I think that if it is considered necessary to disable these tests, they should be disabled in a separate commit, with a reference to separate tasks, such as ~~MDEV-14149~~ or ~~MDEV-13876~~ or even ~~MDEV-13549~~.

Marko Mäkelä added a comment - 2017-12-06 13:29 I now see that bb-10.2-galera consists of two commits on top of 10.2: MDEV-12837 and MDEV-14401 . If the MDEV-12837 fix were working as requested (adjusting the configuration at InnoDB startup), then I think that all the configuration changes innodb-lock-schedule-algorithm=fcfs in the MDEV-14401 fix should be unnecessary and should be removed. The default VATS should be fine. We should not even need to suppress any warning; at startup, InnoDB could issue an informational note (not warning) that the incompatible parameter was adjusted. One more question: Why does the MDEV-14401 commit disable the following occasionally failing tests: diff --git a/mysql-test/suite/galera/disabled.def b/mysql-test/suite/galera/disabled.def index 423c83208f8..801b12a1f84 100644 --- a/mysql-test/suite/galera/disabled.def +++ b/mysql-test/suite/galera/disabled.def @@ -56,3 +56,8 @@ MW-284 : MDEV-13549 Galera test failures 10.1 galera_as_slave : MDEV-13549 Galera test failures 10.1 galera_var_innodb_disallow_writes : MDEV-10949 galera_kill_applier : race condition at the start of the test +MW-328C: MDEV-13549 Galera test failures 10.1 +MW-328A: MDEV-13549 Galera test failures 10.1 +MW-328B: MDEV-13549 Galera test failures 10.1 +MW-328: MDEV-13549 Galera test failures 10.1 +galera_suspend_slave: MDEV-13549 Galera test failures 10.1 \ No newline at end of file The commit message does not explain this. I think that if it is considered necessary to disable these tests, they should be disabled in a separate commit, with a reference to separate tasks, such as MDEV-14149 or MDEV-13876 or even MDEV-13549 .

Jan Lindström (Inactive) added a comment - 2017-12-07 13:32

commit da3a3a68df34c7fef387ce890d3925166edeef2c
Author: Jan Lindström <jan.lindstrom@mariadb.com>
Date: Thu Dec 7 12:26:29 2017 +0200

~~MDEV-12837~~: WSREP: BF lock wait long

Problem was a merge error from MySQL wsrep i.e. Galera.

wsrep_on_check
New check function. Galera can't be enabled
if innodb-lock-schedule-algorithm=VATS.

innobase_kill_query
In Galera async kill we could own lock mutex.

innobase_init
If Variance-Aware-Transaction-Sheduling Algorithm (VATS) is
used on Galera we fall back to First-Come-First-Served (FCFS)
with notice to user.

Changed innodb-lock-schedule-algorithm as read-only parameter
as it was designed to be.

lock_reset_lock_and_trx_wait
Use ib::hex() to print out transaction ID.

lock_rec_other_has_expl_req,
lock_rec_other_has_conflicting,
RecLock::add_to_waitq
lock_rec_lock_slow
lock_table_other_has_incompatible
lock_rec_insert_check_and_lock
lock_prdt_other_has_conflicting

Change pointer to conflicting lock to normal pointer as this
pointer contents could be changed later.

RecLock::create
Conclicting lock pointer is moved to last parameter with
default value NULL. This conflicting transaction could
be selected as victim in Galera if requesting transaction
is BF (brute force) transaction. In this case contents
of conflicting lock pointer will be changed. Use ib::hex() to print
transaction ids.

Jan Lindström (Inactive) added a comment - 2017-12-07 13:32 commit da3a3a68df34c7fef387ce890d3925166edeef2c Author: Jan Lindström <jan.lindstrom@mariadb.com> Date: Thu Dec 7 12:26:29 2017 +0200 MDEV-12837 : WSREP: BF lock wait long Problem was a merge error from MySQL wsrep i.e. Galera. wsrep_on_check New check function. Galera can't be enabled if innodb-lock-schedule-algorithm=VATS. innobase_kill_query In Galera async kill we could own lock mutex. innobase_init If Variance-Aware-Transaction-Sheduling Algorithm (VATS) is used on Galera we fall back to First-Come-First-Served (FCFS) with notice to user. Changed innodb-lock-schedule-algorithm as read-only parameter as it was designed to be. lock_reset_lock_and_trx_wait Use ib::hex() to print out transaction ID. lock_rec_other_has_expl_req, lock_rec_other_has_conflicting, RecLock::add_to_waitq lock_rec_lock_slow lock_table_other_has_incompatible lock_rec_insert_check_and_lock lock_prdt_other_has_conflicting Change pointer to conflicting lock to normal pointer as this pointer contents could be changed later. RecLock::create Conclicting lock pointer is moved to last parameter with default value NULL. This conflicting transaction could be selected as victim in Galera if requesting transaction is BF (brute force) transaction. In this case contents of conflicting lock pointer will be changed. Use ib::hex() to print transaction ids.

Jan Lindström (Inactive) added a comment - 2017-12-07 13:34 - edited

greenman Can you document that innodb-lock-schedule-algorithm is read-only variables (i.e. it can't be changed dynamically) and that VATS method is not supported on Galera.

Here what documentation should mention:

10.1, default is FCFS and InnoDB will refuse to start if VATS is used with galera.
>=10.2 default is VATS and in galera it will be changed to FCFS.

Jan Lindström (Inactive) added a comment - 2017-12-07 13:34 - edited greenman Can you document that innodb-lock-schedule-algorithm is read-only variables (i.e. it can't be changed dynamically) and that VATS method is not supported on Galera. Here what documentation should mention: 10.1, default is FCFS and InnoDB will refuse to start if VATS is used with galera. >=10.2 default is VATS and in galera it will be changed to FCFS.

Ranjan Ghosh added a comment - 2017-12-07 13:37

Wowwwww. Incredible. After many months of suffering finally a MariaDB version that really deserved the label "stable"... perhaps before christmas? Can it be true? Don't tell me I'm dreaming. Now we're all waiting anxiously for the binary to arrive in the repos... Thanks to everyone involved and, especially, to Jan for solving this.

Ranjan Ghosh added a comment - 2017-12-07 13:37 Wowwwww. Incredible. After many months of suffering finally a MariaDB version that really deserved the label "stable"... perhaps before christmas? Can it be true? Don't tell me I'm dreaming. Now we're all waiting anxiously for the binary to arrive in the repos... Thanks to everyone involved and, especially, to Jan for solving this.

Ari Maniatis added a comment - 2017-12-24 05:29

@jplindst Was there supposed to be a commit to the 10.2 branch as well? I could only see one to the 10.1 branch (where I think the bug wasn't manifesting) and to 10.3.

https://github.com/MariaDB/server/search?q=MDEV-12837&type=Commits&utf8=%E2%9C%93

Are you able to confirm this will land in 10.2.12? That will help us plan our testing for being able to use galera cluster in production.

Ari Maniatis added a comment - 2017-12-24 05:29 @jplindst Was there supposed to be a commit to the 10.2 branch as well? I could only see one to the 10.1 branch (where I think the bug wasn't manifesting) and to 10.3. https://github.com/MariaDB/server/search?q=MDEV-12837&type=Commits&utf8=%E2%9C%93 Are you able to confirm this will land in 10.2.12? That will help us plan our testing for being able to use galera cluster in production.

Sergei Golubchik added a comment - 2017-12-24 12:19

No, there wasn't supposed to be a separate 10.2 commit — every 10.1 commit will inevitably land in 10.2 eventually.

Normally this happens at least before the next 10.2 release, so this commit will be in the 10.2.12, just like "Fix Version/s" field at the top of the page says.

Sergei Golubchik added a comment - 2017-12-24 12:19 No, there wasn't supposed to be a separate 10.2 commit — every 10.1 commit will inevitably land in 10.2 eventually. Normally this happens at least before the next 10.2 release, so this commit will be in the 10.2.12, just like "Fix Version/s" field at the top of the page says.

Jan Lindström (Inactive) added a comment - 2017-12-28 06:21

Hi, there seems to be some kind of information loss here: commit da3a3a68df34c7fef387ce890d3925166edeef2c is for 10.2 (10.1 had no merge error and FCFS is default so change there it is slightly different).

Jan Lindström (Inactive) added a comment - 2017-12-28 06:21 Hi, there seems to be some kind of information loss here: commit da3a3a68df34c7fef387ce890d3925166edeef2c is for 10.2 (10.1 had no merge error and FCFS is default so change there it is slightly different).

Ranjan Ghosh added a comment - 2017-12-28 16:19

I'd suggest adding a note (like the one below 10.1.22) on:

https://downloads.mariadb.org/mariadb/+releases/

Perhaps sth. like: "Updating from 10.2.11 to 10.2.12 is highly recommended for Galera users because previous versions are unstable".
In addition - IMHO - this should also be mentioned in the release notes. The term "stable" on the above-mentioned page is really misleading for older versions because we all know they will probably crash sooner or later due to this bug - even with very ordinary use patterns.

Ranjan Ghosh added a comment - 2017-12-28 16:19 I'd suggest adding a note (like the one below 10.1.22) on: https://downloads.mariadb.org/mariadb/+releases/ Perhaps sth. like: "Updating from 10.2.11 to 10.2.12 is highly recommended for Galera users because previous versions are unstable". In addition - IMHO - this should also be mentioned in the release notes. The term "stable" on the above-mentioned page is really misleading for older versions because we all know they will probably crash sooner or later due to this bug - even with very ordinary use patterns.

Daniel Black added a comment - 2017-12-29 12:21

rgpublic great work. I suggest the description in https://mariadb.com/kb/en/library/mariadb-10130-release-notes/ could be meaningfully expanded (replaced). This is a page humble users like you and me can edit and probably serves as a good enough place people who are looking to upgrade would read.

Daniel Black added a comment - 2017-12-29 12:21 rgpublic great work. I suggest the description in https://mariadb.com/kb/en/library/mariadb-10130-release-notes/ could be meaningfully expanded (replaced). This is a page humble users like you and me can edit and probably serves as a good enough place people who are looking to upgrade would read.

Ranjan Ghosh added a comment - 2017-12-29 13:52

Ah, really cool. Thanks, Daniel. Didnt know that everyone can change these pages. I've added a warning box and reference to this bug here:

https://mariadb.com/kb/en/library/mariadb-10212-release-notes/

For 10.1.x I think the already existing line under "Notable changes" should be sufficent as the bug doesnt seem to manifest there - so no need to cause panic.

Ranjan Ghosh added a comment - 2017-12-29 13:52 Ah, really cool. Thanks, Daniel. Didnt know that everyone can change these pages. I've added a warning box and reference to this bug here: https://mariadb.com/kb/en/library/mariadb-10212-release-notes/ For 10.1.x I think the already existing line under "Notable changes" should be sufficent as the bug doesnt seem to manifest there - so no need to cause panic.

Adrian Tarau added a comment - 2018-01-24 17:07

It looks like it still happens ....

innodb_version 5.7.20
protocol_version 10
slave_type_conversions
version 10.2.12-MariaDB
version_comment MariaDB Server
version_compile_machine x86_64
version_compile_os Linux
version_malloc_library system
version_ssl_library OpenSSL 1.0.1e-fips 11 Feb 2013
wsrep_patch_version wsrep_25.21

What can I provide to help troubleshoot this? It looks like the issue is triggered by a truncate table...

LOCK WAIT
MySQL thread id 138641, OS thread handle 139936878089984, query id 139638795 ........ Opening tables
TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1`
2018-01-24 11:51:38 139938787669760 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1a3598f query: TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1`
TRANSACTION 27482511, ACTIVE 1997 sec

As soon as this kicks in, everything locks. Cannot reproduce it in a consistent way .... also, when it happens, kill 138641 says "now an owner"... even for root!!! how is that possible?

Adrian Tarau added a comment - 2018-01-24 17:07 It looks like it still happens .... innodb_version 5.7.20 protocol_version 10 slave_type_conversions version 10.2.12-MariaDB version_comment MariaDB Server version_compile_machine x86_64 version_compile_os Linux version_malloc_library system version_ssl_library OpenSSL 1.0.1e-fips 11 Feb 2013 wsrep_patch_version wsrep_25.21 What can I provide to help troubleshoot this? It looks like the issue is triggered by a truncate table... LOCK WAIT MySQL thread id 138641, OS thread handle 139936878089984, query id 139638795 ........ Opening tables TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1` 2018-01-24 11:51:38 139938787669760 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1a3598f query: TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1` TRANSACTION 27482511, ACTIVE 1997 sec As soon as this kicks in, everything locks. Cannot reproduce it in a consistent way .... also, when it happens, kill 138641 says "now an owner"... even for root!!! how is that possible?

Jan Lindström (Inactive) added a comment - 2018-01-24 17:16

Can I have full unedited error log? wsrep_debug=on and wait until it repeats and again error log would also help.

Jan Lindström (Inactive) added a comment - 2018-01-24 17:16 Can I have full unedited error log? wsrep_debug=on and wait until it repeats and again error log would also help.

Adrian Tarau added a comment - 2018-01-24 18:58

I'll try ... hard to say when it happens ... hopefully the log will not be GB of data with debug on ....

Adrian Tarau added a comment - 2018-01-24 18:58 I'll try ... hard to say when it happens ... hopefully the log will not be GB of data with debug on ....

Adrian Tarau added a comment - 2018-01-25 14:44

I have 500MB of logs (uncompressed) ... I'll send you an email with a link to download the logs ....

Forgot to mention ... we are using a 3 node cluster (development database) with HA Proxy, with one node dedicated for writes...nothing interesting is logged in (mostly) read-only nodes .... A stress test is started with the database in the same state ... after a while it locks and it is always triggered by a DDL statement (truncate in this case, in the past I've seen these dead-locks with ALTER TABLE ... ADD PARTITION ...). Once the DDL enters in dead-lock, no other query can be executed (and there are plenty of threads/connections available), even queries not related to that table lock. Same with ALTER TABLE ... ADD PARTITION

First reference of a dead-lock starts at 2018-01-24 15:39:09 139784560858880 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1e9eb4f query

Adrian Tarau added a comment - 2018-01-25 14:44 I have 500MB of logs (uncompressed) ... I'll send you an email with a link to download the logs .... Forgot to mention ... we are using a 3 node cluster (development database) with HA Proxy, with one node dedicated for writes...nothing interesting is logged in (mostly) read-only nodes .... A stress test is started with the database in the same state ... after a while it locks and it is always triggered by a DDL statement (truncate in this case, in the past I've seen these dead-locks with ALTER TABLE ... ADD PARTITION ...). Once the DDL enters in dead-lock, no other query can be executed (and there are plenty of threads/connections available), even queries not related to that table lock. Same with ALTER TABLE ... ADD PARTITION First reference of a dead-lock starts at 2018-01-24 15:39:09 139784560858880 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1e9eb4f query

Neil Skrypuch added a comment - 2018-01-29 23:16

I am also still seeing this issue with 10.2.12, although the behaviour has changed somewhat compared to 10.2.9 (the previous version we were running).

Now with 10.2.12, it seems to only happen on one master at a time (not multiple masters simultaneously). Also, the query that gets regularly printed to the logs is always a DDL statement now, which it was not before.

The other noteworthy change with 10.2.12 is that the log output is much terser. Instead of a complete dump of SHOW ENGINE INNODB STATUS; every second, we now just see 5 lines per second, like so:

Jan 18 02:53:12 tm2 mysqld[16437]: 2018-01-18 2:53:12 140246917351168 [Note] InnoDB: WSREP: BF lock wait long for trx:0x76a45d4 query: CREATE TABLE IF NOT EXISTS ...
Jan 18 02:53:12 tm2 mysqld[16437]: TRANSACTION 124405204, ACTIVE 269206 sec
Jan 18 02:53:12 tm2 mysqld[16437]: LOCK WAIT
Jan 18 02:53:12 tm2 mysqld[16437]: MySQL thread id 16, OS thread handle 140246857389824, query id 27596013 creating table
Jan 18 02:53:12 tm2 mysqld[16437]: CREATE TABLE IF NOT EXISTS ...

Neil Skrypuch added a comment - 2018-01-29 23:16 I am also still seeing this issue with 10.2.12, although the behaviour has changed somewhat compared to 10.2.9 (the previous version we were running). Now with 10.2.12, it seems to only happen on one master at a time (not multiple masters simultaneously). Also, the query that gets regularly printed to the logs is always a DDL statement now, which it was not before. The other noteworthy change with 10.2.12 is that the log output is much terser. Instead of a complete dump of SHOW ENGINE INNODB STATUS; every second, we now just see 5 lines per second, like so: Jan 18 02:53:12 tm2 mysqld [16437] : 2018-01-18 2:53:12 140246917351168 [Note] InnoDB: WSREP: BF lock wait long for trx:0x76a45d4 query: CREATE TABLE IF NOT EXISTS ... Jan 18 02:53:12 tm2 mysqld [16437] : TRANSACTION 124405204, ACTIVE 269206 sec Jan 18 02:53:12 tm2 mysqld [16437] : LOCK WAIT Jan 18 02:53:12 tm2 mysqld [16437] : MySQL thread id 16, OS thread handle 140246857389824, query id 27596013 creating table Jan 18 02:53:12 tm2 mysqld [16437] : CREATE TABLE IF NOT EXISTS ...

Adrian Tarau added a comment - 2018-01-30 21:30

I do not see a way to re-open the defect ....

Adrian Tarau added a comment - 2018-01-30 21:30 I do not see a way to re-open the defect ....

Sergei Golubchik added a comment - 2018-01-31 11:32

adrian.tarau, please create a new one. You can refer to this one in the description if you'd like or copy-paste from it.

Sergei Golubchik added a comment - 2018-01-31 11:32 adrian.tarau , please create a new one. You can refer to this one in the description if you'd like or copy-paste from it.

Jan Lindström (Inactive) added a comment - 2018-07-25 11:19

Updated correct releases where actual fix is:

commit 57cde8ccd19675dc98e3cbacf0ef5c72cb188e49 (HEAD -> 10.1, origin/10.1)
Author: Jan Lindström <jan.lindstrom@mariadb.com>
Date: Wed Jul 25 08:21:25 2018 +0300

~~MDEV-15822~~: WSREP: BF lock wait long for trx

In Galera BF (brute force) transactions may not wait for lock requests
and normally BF-transaction would select transaction holding conflicting
locks as a victim for rollback. However, background statistic calculation
transaction is InnoDB internal transaction and it has no thd i.e. it can't be
selected as a victim. If background statistics calculation transaction holds
conflicting locks to statistics tables it will cause BF lock wait long
error message. Correct way to handle background statistics calculation is to
acquire thd for transaction but that change is too big for GA-releases and
there are other reported problems on background statistics calculation.

This fix avoids adding a table to background statistics calculation if

Jan Lindström (Inactive) added a comment - 2018-07-25 11:19 Updated correct releases where actual fix is: commit 57cde8ccd19675dc98e3cbacf0ef5c72cb188e49 (HEAD -> 10.1, origin/10.1) Author: Jan Lindström <jan.lindstrom@mariadb.com> Date: Wed Jul 25 08:21:25 2018 +0300 MDEV-15822 : WSREP: BF lock wait long for trx In Galera BF (brute force) transactions may not wait for lock requests and normally BF-transaction would select transaction holding conflicting locks as a victim for rollback. However, background statistic calculation transaction is InnoDB internal transaction and it has no thd i.e. it can't be selected as a victim. If background statistics calculation transaction holds conflicting locks to statistics tables it will cause BF lock wait long error message. Correct way to handle background statistics calculation is to acquire thd for transaction but that change is too big for GA-releases and there are other reported problems on background statistics calculation. This fix avoids adding a table to background statistics calculation if

People

Assignee:: Jan Lindström (Inactive)

Reporter:: Ranjan Ghosh

Votes:: 19 Vote for this issue

Watchers:: 38 Start watching this issue

Dates

Created:: 2017-05-18 13:45

Updated:: 2020-10-06 19:35

Resolved:: 2017-12-07 13:32

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.