[MDEV-28180] BF lock wait long for trx Created: 2022-03-28 Updated: 2024-01-04 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Galera, Storage Engine - InnoDB |
| Affects Version/s: | 10.6.5, 10.3.34, 10.8.5 |
| Fix Version/s: | 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Khai Ping | Assignee: | Seppo Jaakola |
| Resolution: | Unresolved | Votes: | 4 |
| Labels: | None | ||
| Environment: |
3 node Galera multi-master cluster. MariaDB 10.6.5 and Galera 26.4.9 |
||
| Description |
|
We are facing random and intermittently issue with our 3 node galera multi-master cluster since 10.2. This When this BF lock issue happens, the affected node simply block/lock out the whole cluster and none of our clients can read/write. The only way to get out of it is to kill the affected node and let it join via IST.
These errors keep looping itself. Is this a known issue? |
| Comments |
| Comment by Karl Dane [ 2022-05-04 ] | |
|
We're encountering the same issue with almost the same setup: 3-node galera multi-master. , running MariaDB 10.3.34, Galera 25.3.35. Everything will be running fine for hours or days, and then one node will get into a state, dragging down the rest of the cluster. Logs fill up with hundreds of:
Cluster doesn't recover until the affected node is killed. | |
| Comment by Khai Ping [ 2022-05-09 ] | |
|
we resolve it by reverting the wsrep_slave_threads from 2 to 1. | |
| Comment by Luke Cousins [ 2022-10-20 ] | |
|
We're getting this issue almost weekly on our 3 node cluster running 10.8.5. There's nothing in the logs before a barrage of `InnoDB: WSREP: BF lock wait long for trx:0xdc38a55 query:...` messages and to clear the issue the node needs to be force killed, have its data dir cleared and a full SST. IST fails. Until we kill the node the entire cluster is locked up. What can we do to help get this fixed? How can we get information to help you fix it? Thanks. | |
| Comment by Kin [ 2023-04-19 ] | |
|
Thanks @Khai Ping, it appears that wsrep_slave_threads was set with default value 4 in our Bitnami mariadb galera helmchart. | |
| Comment by Khai Ping [ 2023-04-19 ] | |
|
@kin, do you mean you also faced the same issue as us and you resolved it by setting wsrep_slave_threads to 1 too? | |
| Comment by Kin [ 2023-04-19 ] | |
|
@Khai Ping, yes. My wsrep_slave_threads had a value of 4 while my pod has a CPU request/limit of 500 mCPU. This causes one or more nodes to have wsrep issues like "BF lock wait long" , "WSREP: BF applier failed to open_and_lock_tables:" which leads to the nodes to be not in quorum. I first tried to test with wsrep_slave_threads=1 which dispite with 500 mCPU it runs stable. Then I have set my CPU request/limit to 3 CPU for my pod spec and set wsrep_slave_threads=3. I am running and testing this config for two days and it runs stable. All this is based on the official documentation on "wsrep_slave_threads" where it is stated that it should be the nr. of CPU cores. | |
| Comment by Khai Ping [ 2023-04-19 ] | |
|
@kin, ic, thats great news to hear that it resolve it for you. | |
| Comment by Kin [ 2023-05-17 ] | |
|
@Khai Ping, unfortunately a write conflict occurred after running it for two weeks without issues. Gonna set it back to 1 and see if this wil run stable for a longer period. |