I am using version 10.3.9 but I just encountered the BF error. It blocked all transactions and I needed to restart the node. Is there any configuration in the my.cnf I have to adjust to avoid the error?
The following are some of the error messages:
2018-11-22 8:53:28 0 [Note] InnoDB: WSREP: BF lock wait long for trx:0x2129df query: DELETE FROM pe3logex1.tmp_app_to_db_status_report WHERE site='SP'¢Yõ[^S^B
TRANSACTION 2173407, ACTIVE 42084 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 2, OS thread handle 140537284146944, query id 5316319 Delete_rows_log_event::find_row(278043)
DELETE FROM pe3logex1.tmp_app_to_db_status_report WHERE site='SP'
2018-11-22 8:53:29 0 [Note] InnoDB: WSREP: BF lock wait long for trx:0x2129df query: DELETE FROM pe3logex1.tmp_app_to_db_status_report WHERE site='SP'¢Yõ[^S^B
TRANSACTION 2173407, ACTIVE 42085 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 2, OS thread handle 140537284146944, query id 5316319 Delete_rows_log_event::find_row(278043)
DELETE FROM pe3logex1.tmp_app_to_db_status_report WHERE site='SP'
=====================================
2018-11-22 08:53:29 0x7fd0f9431700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 16 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 8323 srv_active, 0 srv_shutdown, 59026 srv_idle
srv_master_thread log flush and writes: 67349
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 212583
OS WAIT ARRAY INFO: signal count 659238
RW-shared spins 0, rounds 1190209, OS waits 78610
RW-excl spins 0, rounds 10315112, OS waits 63310
RW-sx spins 213949, rounds 3318375, OS waits 31377
Spin rounds per wait: 1190209.00 RW-shared, 10315112.00 RW-excl, 15.51 RW-sx
------------
TRANSACTIONS
------------
Trx id counter 2173554
Purge done for trx's n:o < 2173504 undo n:o < 0 state: running but idle
History list length 3
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 422011681601128, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 422011681596232, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 2173426, ACTIVE 41969 sec
4 lock struct(s), heap size 1136, 129 row lock(s), undo log entries 258
MySQL thread id 17, OS thread handle 140537170228992, query id 5333660 kpcp-portalapp1.egis.hksarg 10.88.223.9 e3logexp10 Init
commit
TABLE LOCK table `pe3logex1`.`db_instance_restart_time_to_portal` trx id 2173426 lock mode IX
RECORD LOCKS space id 22 page no 3 n bits 272 index PRIMARY of table `pe3logex1`.`db_instance_restart_time_to_portal` trx id 2173426 lock_mode X locks rec but not gap
RECORD LOCKS space id 22 page no 4 n bits 256 index PRIMARY of table `pe3logex1`.`db_instance_restart_time_to_portal` trx id 2173426 lock_mode X locks rec but not gap
Record lock, heap no 2 PHYSICAL RECORD: n_fields 12; compact format; info bits 32
0: len 8; hex 800000000001590c; asc Y ;;
1: len 6; hex 0000002129f2; asc !) ;;
2: len 7; hex 00000008230110; asc # ;;
3: len 5; hex 4353464953; asc CSFIS;;
4: len 2; hex 4b50; asc KP;;
5: len 10; hex 6373666973696e747031; asc csfisintp1;;
6: len 11; hex 703063736669736d646231; asc p0csfismdb1;;
7: len 2; hex 4b50; asc KP;;
8: len 5; hex 99a05b26c2; asc [& ;;
9: len 3; hex 52554e; asc RUN;;
10: len 5; hex 99a16b52c2; asc kR ;;
11: len 5; hex 99a16b5300; asc kS ;;
Record lock, heap no 3 PHYSICAL RECORD: n_fields 12; compact format; info bits 32
0: len 8; hex 800000000001590e; asc Y ;;
1: len 6; hex 0000002129f2; asc !) ;;
2: len 7; hex 0000000823013a; asc # :;;
3: len 4; hex 45504159; asc EPAY;;
4: len 2; hex 4b50; asc KP;;
5: len 9; hex 65706179696e747031; asc epayintp1;;
6: len 10; hex 7031657061796d646231; asc p1epaymdb1;;
7: len 2; hex 4b50; asc KP;;
8: len 5; hex 99a0480ade; asc H ;;
9: len 3; hex 52554e; asc RUN;;
10: len 5; hex 99a16b52c2; asc kR ;;
11: len 5; hex 99a16b5300; asc kS ;;
Record lock, heap no 4 PHYSICAL RECORD: n_fields 12; compact format; info bits 32
0: len 8; hex 8000000000015910; asc Y ;;
1: len 6; hex 0000002129f2; asc !) ;;
2: len 7; hex 00000008230164; asc # d;;
3: len 5; hex 4f47455053; asc OGEPS;;
4: len 2; hex 4b50; asc KP;;
5: len 10; hex 6f67657073696e747031; asc ogepsintp1;;
6: len 12; hex 70306f676570736d64627332; asc p0ogepsmdbs2;;
7: len 2; hex 5350; asc SP;;
8: len 5; hex 99a04e0aed; asc N ;;
9: len 3; hex 52554e; asc RUN;;
10: len 5; hex 99a16b52c2; asc kR ;;
11: len 5; hex 99a16b5300; asc kS ;;
Just for the records, a copy from IM:
I do not see a problem in your patch. If it fixes the problem, then it should be ok. Some minor things: make sure that the checks will not crash due to trx or trx->mysql_thd being NULL (I am not sure if that is possible). Also in the test - you do not need galera_bf_background_statistics.opt with --innodb_stats_persistent=ON inside, it should be the default and you already check @@innodb_stats_persistent at the beginning of the test. In "--eval SELECT $wsrep_local_bf_aborts_after - $wsrep_local_bf_aborts_before = 1 AS ..." I would remove the " = 1" part because it would mask any failure as 0 instead of providing the actual difference for analysis. E.g. if you just do select new - old as ... and the .result file contains 1 and anything else is returned, like 0 or 2 or 3 or 5 you will actually see that value in the failure message from mtr.