Details
-
Bug
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
-
10.2.5
-
None
-
Ubuntu 16.04.2 LTS (Xenial)
Description
Hi there,
Every few days our cluster stalls. You always know doomsday has arrived, when this message appears in the log:
May 17 06:46:25 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
This message then repeats a lot and other weird stuff finally appears (see below). The cluster never recovers. We always need to restart all nodes. I find this especially surprising, since there are various timeouts defined. If there is a deadlock, I wonder why it doesnt resolve automatically after some time by aborting with some error message. Googling for this message to find out whether it might be an application problem didnt yield any results either.
May 17 06:46:26 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:27 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:28 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:29 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:30 uhu1 mysqld[27227]: message repeated 6 times: [ WSREP: BF lock wait long]
|
May 17 06:46:31 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:32 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:33 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:34 uhu1 mysqld[27227]: message repeated 8 times: [ WSREP: BF lock wait long]
|
May 17 06:46:35 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:36 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]
|
May 17 06:46:37 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:38 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:39 uhu1 mysqld[27227]: message repeated 2 times: [ WSREP: BF lock wait long]
|
May 17 06:46:40 uhu1 mysqld[27227]: WSREP: BF lock wait long
|
May 17 06:46:40 uhu1 mysqld[27227]: =====================================
|
May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 06:46:40 0x7f52c3cdd700 INNODB MONITOR OUTPUT
|
May 17 06:46:40 uhu1 mysqld[27227]: =====================================
|
May 17 06:46:40 uhu1 mysqld[27227]: Per second averages calculated from the last 16 seconds
|
May 17 06:46:40 uhu1 mysqld[27227]: -----------------
|
May 17 06:46:40 uhu1 mysqld[27227]: BACKGROUND THREAD
|
May 17 06:46:40 uhu1 mysqld[27227]: -----------------
|
May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread loops: 570080 srv_active, 0 srv_shutdown, 120059 srv_idle
|
May 17 06:46:40 uhu1 mysqld[27227]: srv_master_thread log flush and writes: 690137
|
May 17 06:46:40 uhu1 mysqld[27227]: ----------
|
May 17 06:46:40 uhu1 mysqld[27227]: SEMAPHORES
|
May 17 06:46:40 uhu1 mysqld[27227]: ----------
|
May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: reservation count 2567993
|
May 17 06:46:40 uhu1 mysqld[27227]: OS WAIT ARRAY INFO: signal count 1741581
|
May 17 06:46:40 uhu1 mysqld[27227]: RW-shared spins 0, rounds 1276685, OS waits 365942
|
May 17 06:46:40 uhu1 mysqld[27227]: RW-excl spins 0, rounds 16000314, OS waits 198061
|
May 17 06:46:40 uhu1 mysqld[27227]: RW-sx spins 210079, rounds 4201641, OS waits 50893
|
May 17 06:46:40 uhu1 mysqld[27227]: Spin rounds per wait: 1276685.00 RW-shared, 16000314.00 RW-excl, 20.00 RW-sx
|
May 17 06:46:40 uhu1 mysqld[27227]: ------------------------
|
May 17 06:46:40 uhu1 mysqld[27227]: LATEST DETECTED DEADLOCK
|
May 17 06:46:40 uhu1 mysqld[27227]: ------------------------
|
May 17 06:46:40 uhu1 mysqld[27227]: 2017-05-17 03:24:35 0x7f52add94700
|
May 17 06:46:40 uhu1 mysqld[27227]: *** (1) TRANSACTION:
|
May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857376, ACTIVE 0 sec inserting
|
May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1
|
May 17 06:46:40 uhu1 mysqld[27227]: LOCK WAIT 3 lock struct(s), heap size 1136, 2 row lock(s)
|
May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704966, OS thread handle 0x7f52ae768700, query id 400303172 localhost 127.0.0.1 eposservicescom update
|
May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1193996601591ba65393eb33.09632175', '1494984305.6058')
|
May 17 06:46:40 uhu1 mysqld[27227]: *** (1) WAITING FOR THIS LOCK TO BE GRANTED:
|
May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857376 lock_mode X locks rec but not gap waiting
|
May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
|
May 17 06:46:40 uhu1 mysqld[27227]: 0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
|
May 17 06:46:40 uhu1 mysqld[27227]: 1: len 6; hex 000011a26cb5; asc l ;;
|
May 17 06:46:40 uhu1 mysqld[27227]: 2: len 7; hex 40000006a82965; asc @ )e;;
|
May 17 06:46:40 uhu1 mysqld[27227]: 3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
|
May 17 06:46:40 uhu1 mysqld[27227]: 4: len 8; hex 4850f49be946d641; asc HP F A;;
|
May 17 06:46:40 uhu1 mysqld[27227]: *** (2) TRANSACTION:
|
May 17 06:46:40 uhu1 mysqld[27227]: TRANSACTION 295857375, ACTIVE 0 sec inserting
|
May 17 06:46:40 uhu1 mysqld[27227]: mysql tables in use 1, locked 1
|
May 17 06:46:40 uhu1 mysqld[27227]: 3 lock struct(s), heap size 1136, 2 row lock(s)
|
May 17 06:46:40 uhu1 mysqld[27227]: MySQL thread id 1704967, OS thread handle 0x7f52add94700, query id 400303173 localhost 127.0.0.1 eposservicescom update
|
May 17 06:46:40 uhu1 mysqld[27227]: INSERT INTO semaphore (name, value, expire) VALUES ('state:Drupal\\Core\\Cache\\CacheCollector', '1474565599591ba65393ebe9.44223691', '1494984305.6058')
|
May 17 06:46:40 uhu1 mysqld[27227]: *** (2) HOLDS THE LOCK(S):
|
May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock mode S locks rec but not gap
|
May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
|
May 17 06:46:40 uhu1 mysqld[27227]: 0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
|
May 17 06:46:40 uhu1 mysqld[27227]: 1: len 6; hex 000011a26cb5; asc l ;;
|
May 17 06:46:40 uhu1 mysqld[27227]: 2: len 7; hex 40000006a82965; asc @ )e;;
|
May 17 06:46:40 uhu1 mysqld[27227]: 3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
|
May 17 06:46:40 uhu1 mysqld[27227]: 4: len 8; hex 4850f49be946d641; asc HP F A;;
|
May 17 06:46:40 uhu1 mysqld[27227]: *** (2) WAITING FOR THIS LOCK TO BE GRANTED:
|
May 17 06:46:40 uhu1 mysqld[27227]: RECORD LOCKS space id 4205 page no 3 n bits 72 index PRIMARY of table "eposservicescom"."semaphore" trx id 295857375 lock_mode X locks rec but not gap waiting
|
May 17 06:46:40 uhu1 mysqld[27227]: Record lock, heap no 2 PHYSICAL RECORD: n_fields 5; compact format; info bits 32
|
May 17 06:46:40 uhu1 mysqld[27227]: 0: len 30; hex 73746174653a44727570616c5c436f72655c43616368655c436163686543; asc state:Drupal\Core\Cache\CacheC; (total 38 bytes);
|
May 17 06:46:40 uhu1 mysqld[27227]: 1: len 6; hex 000011a26cb5; asc l ;;
|
May 17 06:46:40 uhu1 mysqld[27227]: 2: len 7; hex 40000006a82965; asc @ )e;;
|
May 17 06:46:40 uhu1 mysqld[27227]: 3: len 30; hex 36353632373439313735393162613635316337623362302e313335333631; asc 656274917591ba651c7b3b0.135361; (total 32 bytes);
|
May 17 06:46:40 uhu1 mysqld[27227]: 4: len 8; hex 4850f49be946d641; asc HP F A;;
|
May 17 06:46:40 uhu1 mysqld[27227]: *** WE ROLL BACK TRANSACTION (2)
|
Attachments
- mysql_backtrace.log
- 140 kB
Issue Links
- includes
-
MDEV-14401 Stored procedure that declares a handler that catches ER_LOCK_DEADLOCK error causes thd->is_error() assertion
-
- Closed
-
- is caused by
-
MDEV-11039 Add new scheduling algorithm for reducing tail latencies
-
- Closed
-
- relates to
-
MDEV-14793 Document innodb-lock-schedule-algorithm details
-
- Closed
-
-
MDEV-15747 WSREP: BF lock wait long for trx
-
- Closed
-
-
MDEV-15154 WSREP: BF lock wait long after a TRUNCATE TABLE
-
- Closed
-
-
MDEV-16664 InnoDB: Failing assertion: !other_lock || wsrep_thd_is_BF ... if innodb_lock_schedule_algorithm=VATS
-
- Closed
-
Activity
Does anybody at least have an idea what causes this? Our cluster somtimes stalls multiple times a day and needs to be restarted which is very, very ugly. I can't say for sure but I have the feeling this bug has even been introducted in recent versions (10.2.3, 10.2.4 ... ?). If I only knew a workaround or some setting that made this error at least less probable too appear. Right now, it's not really usable. I'm even considering turning off Galera and let MariaDB run as a single node for the time being. My current config in case that's of any help:
[mysqld]
|
|
# Enough buffer for data, indexes etc
|
innodb_buffer_pool_size=4G
|
|
# Do not wait too long for locks - connections might build up (Especially important for Drupal semaphore table)
|
innodb_lock_wait_timeout=10
|
|
#Recommended by Drupal
|
max_allowed_packet=512M
|
|
#Try to prevent crashes
|
skip-log-bin
|
|
# Purge faster :-)
|
innodb_purge_threads = 16
|
|
# Replicate faster :-)
|
wsrep_slave_threads = 32
|
|
# Reduce dead-locks
|
transaction-isolation = READ-COMMITTED
|
|
#Query Cache for faster performance
|
query_cache_type = 1
|
query_cache_size = 128M # Should not be too large to avoid cache contention
|
|
#mysql settings mandatory for galera
|
binlog_format=ROW
|
binlog_row_image=MINIMAL
|
default_storage_engine=InnoDB
|
innodb_autoinc_lock_mode=2
|
bind-address=0.0.0.0
|
innodb_flush_log_at_trx_commit=0
|
|
# Accept up to 300 incoming connections but use the thread pool to prevent too many simultaneous connections actually working at the same time
|
max_connections=300
|
thread-handling=pool-of-threads
|
thread_pool_size=32
|
extra-port=3307
|
extra-max-connections=1
|
|
#galera settings
|
wsrep_on=ON
|
wsrep_provider=/usr/lib/galera/libgalera_smm.so
|
wsrep_cluster_name="uhu_cluster"
|
wsrep_cluster_address="gcomm://node1,node2"
|
wsrep_node_address=192.168.0.1
|
wsrep_provider_options='gcache.size=2G'
|
|
enforce_storage_engine=InnoDB
|
sql_mode=NO_AUTO_CREATE_USER
|
Bad News: we still got this issue.
At the moment i configured the Loadbalancer to send requests only to 1 node. that seems to work for a while. since 3 days we had no crash.
waiting for a update/patch..
Unfortunately, even if I send requests only to 1 node, our cluster still crashes. Currently, I've got only 1 single node running. This defeats the whole purpose why we are using MariaDB/Galera in the first place, though. It is, however, stable.
When researching this bug, I find many different anecdotal reports over the years:
- https://groups.google.com/forum/#!topic/codership-team/6wsv_jUmc5w
- https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1671542
- https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1197771
- https://www.percona.com/forums/questions-discussions/percona-xtradb-cluster/43848-cluster-locks-insert-update-issues
- https://www.mail-archive.com/maria-discuss@lists.launchpad.net/msg02441.html
None with a real solution or any explanation on why (i.e. under which circumstances) this happens. The error seems very random to me. Sometimes it happens at night when there is hardly any load on the server at all. All reports have in common that the database never seems to recover by itself. In all reported cases a restart seemed to be required. Therefore, at the very least, I guess the database should just shutdown (signal 11) in case this error happens. This would be far better than the current situation where the database is happily accepting connections which then subsequently just hang/freeze forever. I think this bug is particularly evil just because of that said property: We use Galera in particular to have an automatic failover if sth. goes wrong. This bug stalls the database and because it isn't properly shutdown, the second node cannot take over and hangs as well.
Please try to get a backtrace
Also what galera version are you running? Have you tried 25.3.25?
edit: camera -> galera (groan - phone autospell)
You meant Galera version, I assume? Well, I use the one that comes with the OS (Ubuntu) AFAICT:
~ # dpkg -l | grep galera
|
ii galera-3 25.3.19-xenial amd64 Replication framework for transactional applications
|
I've also added the MariaDB PPAs but the most recent version I can find there is 25.3.20. How can I upgrade to a newer version? Is there a separate PPA?
Futhermore, I understand why you'd like to have a backtrace, although it's obviously very, very difficult to get that in this case. I'd first need to enable the second node again. Then I'd need to compile a separate MariaDB version. Then wait until it hangs. It doesnt crash. It only hangs. Sometimes this is only once every 1-2 days. Sometimes in the very early morning hours. Then I would need to send it a "kill -6" if I understand correctly and grab the backtrace. As this is a live server with important websites on it not an easy feat. I'll need sometime to make that possible. In the mean time: Any insights into why this might happen would be much appreciated. Right now, I'm totally in the dark on what things I should avoid or watch out for...
OK, I wrote a small BASH script as an intermediate step (see below, SystemD required). Perhaps this is also useful to others. I will add this to the crontab to run every minute. It will watch the server if the problem occurs and shut down the node. This makes it hopefully at least possible again to run with two nodes unsupervised, because the system should recover automatically. Now, next step: Compile MariaDB with debug symbols. Let's see how this works out...
#!/bin/bash
|
|
# Dont do anything if MariaDB isnt running or is already in the process of shutting down
|
if ! (systemctl status mariadb | grep --quiet 'active (running)'); then
|
echo "MariaDB not running";
|
exit;
|
fi
|
|
# Get the start date/time of MariaDB and the date/time of 10 minutes ago
|
ts_start=$(systemctl show -p ExecMainStartTimestamp mariadb | cut -d "=" -f 2 | awk '{print $2 " " $3}')
|
ts_ago=$(date --date="10 minutes ago" +"%Y-%m-%d %H:%M:%S")
|
|
# Use whatever is greater as a log start date (i.e: Get the logs only for the current run but only the most recent 10 minutes)
|
if [[ "$ts_start" > "$ts_ago" ]]; then
|
ts=$ts_start;
|
else
|
ts=$ts_ago;
|
fi
|
|
# If the log contains "BF lock wait long" the database is frozen and needs to be shutdown.
|
if (journalctl -u mariadb --since "$ts" | grep --quiet "BF lock wait long"); then
|
echo "BF lock wait detected! Shutting down MariaDB.";
|
systemctl --no-block stop mariadb
|
fi
|
@danblack: I hope it will be sufficient to build MariaDB with just the additional debug symbols to get the backtrace, right? I won't need the 30% slower full-debug, version? Obviously, I'd like to avoid that as it is a production server. I'm referring to this page here:
https://mariadb.com/kb/en/mariadb/compiling-mariadb-for-debugging/
There are debug symbol packages for the existing install in the mariadb repos that should be installable. i.e.:
http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/m/mariadb-10.2/mariadb-server-core-10.2-dbgsym_10.2.6+maria~xenial_amd64.deb
Don't forget galera syms too:
http://mirror.aarnet.edu.au/pub/MariaDB/repo/10.2/ubuntu/pool/main/g/galera-3/galera-3-dbg_25.3.20-xenial_amd64.deb
Otherwise use -DCMAKE_BUILD_TYPE=RelWithDebInfo and it won't be any slower.
Nice script. An alternate way:
journalctl -u mariadb -f | while read line
|
do
|
if [[ $line =~ "BF lock wait long" ]] ....
|
done
|
OK, I've now installed the dbgsym packages. I wondered if I needed to restart the DB to actually "enable" the symbols. Well, I did it anyway just to be sure.
D'oh - your script is of course waaaay shorter than mine Thanks for sharing. I've added the line to create the backtrace I got from here:
Although, I think the docs were a bit wrong/misleading (didnt work for me) and "program" should really be "-p". I now have this line to get the backtrace:
gdb --batch --eval-command="thread apply all bt" -p $(pgrep -xn mysqld) > /root/mysql_backtrace.log
|
When I ran it, it created a backtrace most of the time. Sometimes, though, it did just hang at the end and I had to kill gdb Perhaps I was impatient. Hope this wont happen during the BFLockWait.
OK, now that everything's setup I guess we'll just have to wait until the rare striped tiger runs in front of the camera :-/
Thanks again, @danblack for all the helpful infos. It's much appreciated.
Ha! Finally the crash happened. The script worked perfectly. The DB was shut down. The backtrace was automatically created. Here it is:
Exactly the same here: Just upgraded a 3 nodes cluster from 10.1.25 to 10.2.7. After 4 minutes i got: WSREP: BF lock wait long and the cluster hangs for all Updates/Inserts until restarting the hanging one. After less then 5 Minutes, got the same message again. Any good hints to downgrade back to 10.1?
Debian 7.1,
Linux mariadb-node1 3.2.0-4-amd64 #1 SMP Debian 3.2.81-2 x86_64 GNU/Linux
Thanks in advance
Same issue here.. but on Centos 7.3
Linux taskdb02 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
KVM
Jul 14 14:50:22 vmtaskdb03 mysqld: mysql tables in use 1, locked 1
Jul 14 14:50:22 vmtaskdb03 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1
Jul 14 14:50:22 vmtaskdb03 mysqld: MySQL thread id 2027, OS thread handle 0x7f5196e33700, query id 30214 10.232.10.31 taskitud_ianis query end
Jul 14 14:50:22 vmtaskdb03 mysqld: INSERT INTO `mw_campaign_track_open` (`campaign_id`, `subscriber_id`, `ip_address`, `user_agent`, `date_added`) VALUES (6351, 2314769, '90.9.29.141', 'Mozilla/5.0 (Windows NT 10.0; Win6
4; x64; rv:54.0) Gecko/20100101 Firefox/54.0', NOW())
Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign_track_open` trx id 16643769 lock mode IX
Jul 14 14:50:22 vmtaskdb03 mysqld: TABLE LOCK table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode IS
Jul 14 14:50:22 vmtaskdb03 mysqld: RECORD LOCKS space id 8 page no 76 n bits 120 index PRIMARY of table `taskitud_database`.`mw_campaign` trx id 16643769 lock mode S locks rec but not gap
Jul 14 14:50:45 vmtaskdb03 journal: Suppressed 5655 messages from /system.slice/mariadb.service
Jul 14 14:50:45 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:46 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:47 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:48 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:49 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:50 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:51 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:52 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:53 vmtaskdb03 mysqld: WSREP: BF lock wait long
Jul 14 14:50:54 vmtaskdb03 mysqld: WSREP: BF lock wait long
thanks
We have also received this error running on RedHat 7.3, MariaDB-server-10.2.7-1.el7, and galera-25.3.20-1.rhel7. The error only occurred on a system with high load and did not occur in the testing environments that have a much lower load. Does this bug affect all of 10.2? Does this bug affect any 10.1.x release?
Well, I'm only a poor user - no MariaDB guru, but what I can say is this:
1) There are some anecdotal reports of people upgrading from 10.1.x. to 10.2.x and they suddently encountered this bug. This suggests that the error probably didnt happen on 10.1.
2) I, myself, never used 10.1 with Galera, so I cant tell for sure. But I've been using Galera with many earlier 10.2.x versions and always had this bug as far as I can remember. This suggests that all of 10.2 is probably affected or at least it won't help to downgrade say from 10.2.7 to 10.2.5 or sth. like that as a work-around.
3) Like stated in one of my previous comments, when researching this message on the web, it seems that this message does indeed show up in many varieties/versions of MySQL over the years. This suggests that this message might only be a symptom that has/had multiple causes. I assume though, that the current influx of error reports on this have all the same underlying reason.
Up until know, I have not the slightest idea why it happens. The word "lock" in the error message, however, seems to suggest that it has sth. to do with locking problems. That is, the more InnoDB locking is going on, the more conflicts/deadlocks whatever occur and the more likely could it be that the error shows up. This is perhaps why you're experiencing the bug under high load. Just wildly guessing here, of course. On our server, the bug happens at arbitrary times. Sometimes in the middle of the night. Not necessarily under exceptionally high load.
In a previous comment, I posted my mysql configuration. I'd like to know if someone sees sth. striking when comparing it to their config Some correlation - anything.
I find it strange that there seems to be not much concern about this bug and the current MariaDB version is even considered "stable". For us, it renders the whole Galera feature totally useless - it's a huge problem. We are currently running without proper failover. The reason for this calm is probably: There are only very few people who are affected by this. Even though I assume there is a sizable population out there in the world who uses a current MariaDB version with Galera - especially now where it's out of beta. This leads to the question: What are they doing differently than us? I'd really like to find out, because there doesnt seem to be much progress on this otherwise.
Thanks Ranjan, for your response. We also upgraded from an older version of 10.1.x. We were trying to fix another bug and then ran into this much bigger bug. Sorry, I didn't include that info. We are trying to assess downgrade and want to be above 10.1.23 if possible, but we would like to know before hand if the bug exists instead of going through all the work of a downgrade and then being surprised by the same bug that we are trying to get away from. Hopefully, we can get a response from MariaDB to confirm this.
We used a lot of 10.1 versions up to 10.1.25 in a three node cluster and never had that error. After upgrading to 10.1.7 we got that error every 5-15 minutes.
Also, Ralf, just to clarify. You got this bug after you upgraded to 10.2.7, correct? You said 10.1.7 and it may just be a typo
note: we used serverals versions of mariadb in the major release of 10.1. that issue never occured.
After upgrading to 10.2 we are facing this issue.
so i think a downgrad to 10.1 can be a workaround.
Got the same issue with a fresh install of mariadb-server v. 10.2.7 from mariadb repo. The server software is Ubuntu 16.04
Some related logs here
Aug 15 22:38:22 mariadb mysqld[13566]: WSREP: BF lock wait long
|
Aug 15 22:38:27 mariadb mysqld[13566]: message repeated 5 times: [ WSREP: BF lock wait long]
|
Aug 15 22:38:27 mariadb mysqld[13566]: =====================================
|
Aug 15 22:38:27 mariadb mysqld[13566]: 2017-08-15 22:38:27 0x7fbfdb84d700 INNODB MONITOR OUTPUT
|
Aug 15 22:38:27 mariadb mysqld[13566]: =====================================
|
```
|
Basically log the same as @Michael Veber posted
Could this perhaps be related to MDEV-13331? I am experiencing the same issue and noticed that MDEV-13331 was introduced in 10.2.2 and fixed in 10.2.8. I find that I hit the bug if the application tied to our DB (JIRA) attempts to make a ton of updates to the same row at the same time resulting in a row lock. It would make sense if update got a similar patch as drop from upstream and was ignoring innodb_lock_wait_timeout.
Very interesting idea. Makes a whole lot of sense to me. There is a DELETE CASCADE. It locks. It never times out due to the bug. The lock is never released. More and more stuff that's supposed to be replicated is piling up due to the lock. Boom!
Unfortunately I cannot upgrade our cluster to 10.2.8 at the moment - perhaps in the forthcoming weeks. Would be really helpful, though, if someone else could verify this earlier. Especially those, whose cluster is shutting down after a few minutes. With our cluster, I'd probably have to wait for days to see if the error is really resolved or not.
Ah. Bummer! I used this call on my database:
select concat(table_name, '.', column_name) as 'foreign key',
|
concat(referenced_table_name, '.', referenced_column_name) as 'references'
|
from information_schema.key_column_usage
|
where referenced_table_name is not null;
|
This is supposed to show all tables with foreign keys. Unfortunately, there are none. So MDEV-13331 can at least not be the only reason AFAICT, because my DB keeps crashing anyway
Just checked 10.2.8 with our 3 nodes cluster and as soon as we write on two or more nodes instead of only one, we got the same error as with the other 10.2 version.
Aug 24 07:22:46 mysql-node2 mysqld: WSREP: BF lock wait long
Aug 24 07:22:46 mysql-node2 mysqld:
Aug 24 07:22:46 mysql-node2 mysqld: =====================================
Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 07:22:46 0x7f4e89a4b700 INNODB MONITOR OUTPUT
Aug 24 07:22:46 mysql-node2 mysqld: =====================================
Aug 24 07:22:46 mysql-node2 mysqld: Per second averages calculated from the last 15 seconds
Aug 24 07:22:46 mysql-node2 mysqld: -----------------
Aug 24 07:22:46 mysql-node2 mysqld: BACKGROUND THREAD
Aug 24 07:22:46 mysql-node2 mysqld: -----------------
Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread loops: 40245 srv_active, 0 srv_shutdown, 25086 srv_idle
Aug 24 07:22:46 mysql-node2 mysqld: srv_master_thread log flush and writes: 65331
Aug 24 07:22:46 mysql-node2 mysqld: ----------
Aug 24 07:22:46 mysql-node2 mysqld: SEMAPHORES
Aug 24 07:22:46 mysql-node2 mysqld: ----------
Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: reservation count 520504
Aug 24 07:22:46 mysql-node2 mysqld: OS WAIT ARRAY INFO: signal count 1030299
Aug 24 07:22:46 mysql-node2 mysqld: RW-shared spins 0, rounds 1439898, OS waits 149766
Aug 24 07:22:46 mysql-node2 mysqld: RW-excl spins 0, rounds 14688668, OS waits 112658
Aug 24 07:22:46 mysql-node2 mysqld: RW-sx spins 410293, rounds 5591848, OS waits 49179
Aug 24 07:22:46 mysql-node2 mysqld: Spin rounds per wait: 1439898.00 RW-shared, 14688668.00 RW-excl, 13.63 RW-sx
Aug 24 07:22:46 mysql-node2 mysqld: ------------------------
Aug 24 07:22:46 mysql-node2 mysqld: LATEST FOREIGN KEY ERROR
Aug 24 07:22:46 mysql-node2 mysqld: ------------------------
Aug 24 07:22:46 mysql-node2 mysqld: 2017-08-24 06:50:26 0x7f53fe4e9700 Transaction:
Aug 24 07:22:46 mysql-node2 mysqld: TRANSACTION 3020224898, ACTIVE 0 sec inserting
Aug 24 07:22:46 mysql-node2 mysqld: mysql tables in use 2, locked 2
Aug 24 07:22:46 mysql-node2 mysqld: 5 lock struct(s), heap size 1136, 2 row lock(s)
Aug 24 07:22:46 mysql-node2 mysqld: MySQL thread id 2, OS thread handle 0x7f53fe4e9700, query id 5813680 copy to tmp table
Aug 24 07:22:46 mysql-node2 mysqld: ALTER TABLE `catalog_product_flat_1` ADD CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE
Aug 24 07:22:46 mysql-node2 mysqld: Foreign key constraint fails for table `magento_felt_b2b`.`#sql-4c6d_2`:
Aug 24 07:22:46 mysql-node2 mysqld: ,
Aug 24 07:22:46 mysql-node2 mysqld: CONSTRAINT `FK_CAT_PRD_FLAT_1_ENTT_ID_CAT_PRD_ENTT_ENTT_ID` FOREIGN KEY (`entity_id`) REFERENCES `catalog_product_entity` (`entity_id`) ON DELETE CASCADE ON UPDATE CASCADE in parent table, in index PRIMARY tuple:
Sorry, I should have been more clear. 10.2.8 fixed a bug that was pulled in from upstream MySQL that caused MariaDB to ignore timeouts for DROP operations. I'm wondering if a similar piece of code was pulled in at the same time for other operations like UPDATE and SELECT?
Well, the issue issue you mentioned earlier, MDEV-13331, is only about DELETE CASCADE with FOREIGN KEYS (FK). At least this is what it says in the title. And this is, indeed, the bug Ralf send the logs from. Although it seems to happen to him with 10.2.8 as well (@Ralf: Perhaps you'd like to mention the fact that it's still not fixed for you over there as well). For my part, I'm not aware of using any foreign keys at all (see my previous comment how I figured this out). So, this bug here seems to be at least not exclusively about foreign keys. I now understand your idea is: Try to find some similar pulled in code for other operations... It's difficult to find those, though, without knowing specifically what exactly we're search for...
I can confirm, that the problem still exists in 10.2.8:
Sep 19 08:34:50 dbm-ace-live-data-03 mysqld: 2017-09-19 8:34:50 140484532815616 [Warning] WSREP: Failed to report last committed 10768816, -4 (Interrupted system call) |
Sep 19 08:34:54 dbm-ace-live-data-03 mysqld: 2017-09-19 8:34:54 140484532815616 [Warning] WSREP: Failed to report last committed 10768818, -4 (Interrupted system call) |
Sep 19 08:35:01 dbm-ace-live-data-03 CRON[1269]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) |
Sep 19 08:35:02 dbm-ace-live-data-03 mysqld: 2017-09-19 8:35:02 140484532815616 [Warning] WSREP: Failed to report last committed 10768925, -4 (Interrupted system call) |
Sep 19 08:35:51 dbm-ace-live-data-03 mysqld: WSREP: BF lock wait long |
Sep 19 08:36:00 dbm-ace-live-data-03 mysqld: message repeated 9 times: [ WSREP: BF lock wait long] |
Sep 19 08:36:01 dbm-ace-live-data-03 CRON[1361]: (root) CMD (/bin/chmod -R a+r /var/log >/dev/null 2>&1) |
Sep 19 08:36:01 dbm-ace-live-data-03 mysqld: WSREP: BF lock wait long |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: message repeated 5 times: [ WSREP: BF lock wait long] |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: ===================================== |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: 2017-09-19 08:36:06 0x7fc3dc748700 INNODB MONITOR OUTPUT |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: ===================================== |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: Per second averages calculated from the last 50 seconds |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: ----------------- |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: BACKGROUND THREAD |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: ----------------- |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: srv_master_thread loops: 333435 srv_active, 0 srv_shutdown, 144764 srv_idle |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: srv_master_thread log flush and writes: 478163 |
Sep 19 08:36:06 dbm-ace-live-data-03 mysqld: ---------- |
So, is it an option to downgrade to 10.1?
So "good news" for me..
since version 10.2.9 we have the cluster running without any issues for >5 days.
mysql --version
|
mysql Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2 |
|
systemctl status mysql
|
● mariadb.service - MariaDB database server
|
Loaded: loaded (/lib/systemd/system/mariadb.service; enabled; vendor preset: enabled)
|
Drop-In: /etc/systemd/system/mariadb.service.d
|
└─migrated-from-my.cnf-settings.conf
|
Active: active (running) since Thu 2017-09-28 16:27:02 CEST; 5 days ago |
I dont trust myself but i hope that this verison fixed it for me/us
any experience with this version from other users?
No good news for me, the problem stil exists:
dbm-ace-pre-02:~ # mysql --version
|
mysql Ver 15.1 Distrib 10.2.9-MariaDB, for debian-linux-gnu (x86_64) using readline 5.2
|
....
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld: WSREP: BF lock wait long
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld:
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld: =====================================
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld: 2017-10-05 09:45:58 0x7fb33bcf5700 INNODB MONITOR OUTPUT
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld: =====================================
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld: Per second averages calculated from the last 30 seconds
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld: -----------------
|
Oct 5 09:45:59 dbm-ace-pre-02 mysqld: BACKGROUND THREAD
|
My setup consists of three datanodes with two maxscale proxies in front of it and one hardware loadbalancer as entry point
i can very easily reproduce the problem when i switch the two maxscale to "Galera-Mode" e.g. write to all nodes allowed.
Hm. Right, still unsolved with the most recent version. Sigh. I have to admit I had faintly hoped I might return from my vacation and this bug is finally solved What I don't understand is that nobody of the core dev team seems to even be interested in this. Don't get me wrong - as a developer, I know that there are difficult bugs to solve that take a lot of time to figure out etc. But there is not even a reaction, nothing. I think this is a huge bug that in fact renders Galera completely useless. If the database locks up once per day (some even report it to happen every few minutes) and requires manual intervention it is - let's be honest - simply not usable on a production server. And I don't need Galera on a dev system. So, you cannot use it at all. But the integrated Galera was one of the main selling points of using MariaDB over MySQL. At least it was for us. For a long time I thought: "OK, it's a beta version. Just wait. They'll certainly fix this in the future". Now, MariaDB 10.2.x is even considered as "stable" AFAIK. Nothing could be further from the truth.
You speak out of my soul. The reason we gave mariadb(galera) a shot, is the write scalability. We tested it with one of our key.systems and run into this bug. The combination with maxcale appeared so promising. Now we move to our Oracle-RAC system.
Hi Everyone!
I am assigned to this bug, but I was a lot busy with other bugs. I will look into this bug in this week and hopefully will come with a work arround / patch.
- Sachin
I was able to reproduce the problem with script below, which downloads and unpacks 10.2.9, configures 3 nodes on 3307, 3308 and 3309 and puts some concurrent write load :
set -e |
ver=${1:-10.2.9}
|
# just use current directory if called from framework
|
if [ ! -f common.sh ] ; then |
[ -d mariadb-environs ] || git clone http://github.com/AndriiNikitin/mariadb-environs |
cd mariadb-environs |
./get_plugin.sh galera |
fi
|
|
_template/plant_cluster.sh cluster1 |
echo m1 > cluster1/nodes.lst |
echo m2 >> cluster1/nodes.lst |
echo m3 >> cluster1/nodes.lst |
cluster1/replant.sh ${ver} |
|
./build_or_download.sh m1 |
|
cluster1/gen_cnf.sh innodb_buffer_pool_size=1G \ |
innodb_log_file_size=256M \
|
innodb_flush_log_at_trx_commit=2 \
|
innodb_lock_wait_timeout=10 \
|
transaction-isolation=READ-COMMITTED \
|
innodb_stats_persistent=0
|
cluster1/install_db.sh |
# cluster1/galera_setup_acl.sh # this needed for mysqldump
|
cluster1/galera_start_new.sh wsrep_sst_method=rsync |
|
sleep 10 |
cluster1/galera_cluster_size.sh |
|
for i in {1..5}; do |
m1*/sql.sh create table t$i select seq as a, seq as b from seq_1_to_10000 |
m1*/sql.sh "alter table t$i add unique index(a)" |
done
|
|
trap 'kill $(jobs -p)' EXIT |
|
for m in {1..3}; do |
for i in {1..5}; do |
export I=$i |
export M=$m |
( while : ; do m$M*/sql.sh "start transaction; update t$I set b=b+1 where a=$I; delete from t$I where a=2*$I; do sleep(10); commit" 2>/dev/null || : ; done; )& |
done |
( while : ; do |
for i in {1..30}; do m$M*/sql.sh "delete from t$i where a=floor(rand()*10000);" &>/dev/null ; done |
done ) & |
( while : ; do |
for i in {1..30}; do m$M*/sql.sh "delete from t$i where a=floor(rand()*10000);" &>/dev/null ; done |
done ) & |
done
|
|
while : ; do |
m1*/sql.sh show processlist |
m2*/sql.sh show processlist |
m3*/sql.sh show processlist |
grep 'BF lock wait long' m1*/dt/error.log | wc -l |
grep 'BF lock wait long' m2*/dt/error.log | wc -l |
grep 'BF lock wait long' m3*/dt/error.log | wc -l |
sleep 10 |
done |
The output of write SQL commands is suppressed and diplayed only processlist for each node, followed by number of "BF lock wait long" occurrences in error log.
After few minutes I can see problems in galera committer threads:
0
|
0
|
0
|
2 system user NULL Sleep 99 wsrep aborter idle NULL 0.000
|
1 system user NULL Sleep 10 committed 25 NULL 0.000
|
3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
43 root localhost test Query 9 User sleep do sleep(10) 0.000
|
44 root localhost test Query 9 User sleep do sleep(10) 0.000
|
45 root localhost test Query 9 User sleep do sleep(10) 0.000
|
46 root localhost test Query 9 User sleep do sleep(10) 0.000
|
47 root localhost test Query 9 updating delete from t4 where a=floor(rand()*10000) 0.000
|
48 root localhost test Query 9 updating delete from t4 where a=floor(rand()*10000) 0.000
|
49 root localhost test Query 9 User sleep do sleep(10) 0.000
|
50 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 97 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 9 committed 30 NULL 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
27 root localhost test Query 9 User sleep do sleep(10) 0.000
|
29 root localhost test Query 9 User sleep do sleep(10) 0.000
|
30 root localhost test Query 9 User sleep do sleep(10) 0.000
|
31 root localhost test Query 9 User sleep do sleep(10) 0.000
|
32 root localhost test Query 9 User sleep do sleep(10) 0.000
|
33 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 89 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 20 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
12 root localhost test Query 20 init commit 0.000
|
14 root localhost test Query 20 init commit 0.000
|
15 root localhost test Query 30 query end delete from t1 where a=floor(rand()*10000) 0.000
|
20 root localhost test Query 15 init commit 0.000
|
21 root localhost test Query 10 init commit 0.000
|
25 root localhost test Query 9 updating update t1 set b=b+1 where a=1 0.000
|
26 root localhost test Query 0 init show processlist 0.000
|
0
|
0
|
10
|
2 system user NULL Sleep 109 wsrep aborter idle NULL 0.000
|
1 system user NULL Sleep 9 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000
|
3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
47 root localhost test Query 19 query end delete from t4 where a=floor(rand()*10000) 0.000
|
49 root localhost test Query 9 init commit 0.000
|
51 root localhost test Query 9 User sleep do sleep(10) 0.000
|
52 root localhost test Query 9 User sleep do sleep(10) 0.000
|
53 root localhost test Query 9 User sleep do sleep(10) 0.000
|
54 root localhost test Query 9 updating update t4 set b=b+1 where a=4 0.000
|
55 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 107 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 9 committed 40 NULL 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
34 root localhost test Query 9 User sleep do sleep(10) 0.000
|
35 root localhost test Query 9 User sleep do sleep(10) 0.000
|
36 root localhost test Query 9 User sleep do sleep(10) 0.000
|
37 root localhost test Query 9 User sleep do sleep(10) 0.000
|
38 root localhost test Query 9 User sleep do sleep(10) 0.000
|
39 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 99 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 30 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
12 root localhost test Query 30 init commit 0.000
|
14 root localhost test Query 30 init commit 0.000
|
15 root localhost test Query 40 query end delete from t1 where a=floor(rand()*10000) 0.000
|
20 root localhost test Query 25 init commit 0.000
|
21 root localhost test Query 20 init commit 0.000
|
27 root localhost test Query 8 updating update t1 set b=b+1 where a=1 0.000
|
28 root localhost test Query 0 init show processlist 0.000
|
0
|
0
|
20
|
2 system user NULL Sleep 119 wsrep aborter idle NULL 0.000
|
1 system user NULL Sleep 19 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000
|
3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
47 root localhost test Query 29 query end delete from t4 where a=floor(rand()*10000) 0.000
|
49 root localhost test Query 19 init commit 0.000
|
51 root localhost test Query 9 init commit 0.000
|
52 root localhost test Query 9 init commit 0.000
|
53 root localhost test Query 9 init commit 0.000
|
56 root localhost test Query 8 updating update t4 set b=b+1 where a=4 0.000
|
57 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 117 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 19 committed 40 NULL 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
40 root localhost test Query 9 User sleep do sleep(10) 0.000
|
41 root localhost test Query 9 User sleep do sleep(10) 0.000
|
42 root localhost test Query 9 User sleep do sleep(10) 0.000
|
43 root localhost test Query 9 User sleep do sleep(10) 0.000
|
44 root localhost test Query 9 User sleep do sleep(10) 0.000
|
45 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 109 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 40 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
12 root localhost test Query 40 init commit 0.000
|
14 root localhost test Query 40 init commit 0.000
|
15 root localhost test Query 50 query end delete from t1 where a=floor(rand()*10000) 0.000
|
20 root localhost test Query 35 init commit 0.000
|
21 root localhost test Query 30 init commit 0.000
|
29 root localhost test Query 7 updating update t1 set b=b+1 where a=1 0.000
|
30 root localhost test Query 0 init show processlist 0.000
|
9
|
0
|
30
|
2 system user NULL Sleep 129 wsrep aborter idle NULL 0.000
|
1 system user NULL Sleep 29 Update_rows_log_event::find_row(35) update t4 set b=b+1 where a=4 0.000
|
3 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
47 root localhost test Query 39 query end delete from t4 where a=floor(rand()*10000) 0.000
|
49 root localhost test Query 29 init commit 0.000
|
51 root localhost test Query 20 init commit 0.000
|
52 root localhost test Query 19 init commit 0.000
|
53 root localhost test Query 19 init commit 0.000
|
58 root localhost test Query 7 updating update t4 set b=b+1 where a=4 0.000
|
59 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 127 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 29 committed 40 NULL 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
40 root localhost test Query 9 init commit 0.000
|
41 root localhost test Query 9 init commit 0.000
|
42 root localhost test Query 9 init commit 0.000
|
43 root localhost test Query 9 init commit 0.000
|
44 root localhost test Query 9 init commit 0.000
|
46 root localhost test Query 0 init show processlist 0.000
|
1 system user NULL Sleep 119 wsrep aborter idle NULL 0.000
|
2 system user NULL Sleep 50 Update_rows_log_event::find_row(13) update t1 set b=b+1 where a=1 0.000
|
3 system user NULL Daemon NULL InnoDB purge coordinator NULL 0.000
|
4 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
5 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
6 system user NULL Daemon NULL InnoDB purge worker NULL 0.000
|
7 system user NULL Daemon NULL InnoDB shutdown handler NULL 0.000
|
12 root localhost test Query 50 init commit 0.000
|
14 root localhost test Query 50 init commit 0.000
|
15 root localhost test Query 60 query end delete from t1 where a=floor(rand()*10000) 0.000
|
20 root localhost test Query 45 init commit 0.000
|
21 root localhost test Query 40 init commit 0.000
|
31 root localhost test Query 6 updating update t1 set b=b+1 where a=1 0.000
|
32 root localhost test Query 0 init show processlist 0.000
|
19
|
0
|
40
|
sachin.setiya.007 let me know if this script is OK or if I should work on mtr case with similar load.
Hi anikitin!
I tried to run this script on jessie(buildbot), One hour run , No BF lock wait long, The thing is script is not deterministic (I think so is the bug). If we can have some mtr case which simulates it for sure , would be great
Hi Sachin,
The below scenario will help you in reproducing the bug.
1. On a single node, for a particular ID updates are fired continuously.
2. On another node, for the same WHERE clause I set a different value.
I have a table dummy, where I generated concurrent updates on node 1 and node 2 at the same time.
*Node 1* update dummy set status=100 where id='716071007'; //This runs in 500 concurrency with 100 iteration |
*Node 2* update dummy set status=101 where id=716071007; //This runs in 500 concurrency with 100 iteration |
GC should have thrown a deadlock error, but instead it keep accepting the queries.
In node 2, when the query from node 1 replicates and on node 2 when a direct UPDATE hits, it thrown the below error.
WSREP: BF lock wait long |
WSREP: BF lock wait long |
This makes the cluster stalls and need to forcefully kill the same to bring back.
NOTE: Tested in 10.2.8,10.2.9, creating "BF lock wait long" but not in 10.1.26.
Steps to reproduce :
CREATE TABLE `dummy` (
-> `id` bigint(20) NOT NULL AUTO_INCREMENT,
-> service varchar(20),
-> status int(11),
-> PRIMARY KEY (`id`)
-> ) ENGINE=InnoDB AUTO_INCREMENT=716071002 DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (0.43 sec)
MariaDB [test]> insert into dummy (service,status) values ('MySQL',1),('MSSQL',2),('MONGODB',3);
Query OK, 3 rows affected (0.13 sec)
Records: 3 Duplicates: 0 Warnings: 0
*Generate concurrent updates *
*Node 1* update dummy set status=100 where id='716071007'; |
*Node 2* update dummy set status=101 where id=716071007; |
Process list :
------------------------------------------------------------------------------------------------------------------------------------------
Id | User | Host | db | Command | Time | State | Info | Progress |
------------------------------------------------------------------------------------------------------------------------------------------
1 | system user | NULL | Sleep | 4 | Update_rows_log_event::find_row(718) | update dummy set status=100 where id='716071007' | 0.000 | |
2 | system user | NULL | Sleep | 501 | wsrep aborter idle | NULL | 0.000 | |
6 | system user | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 | |
3 | system user | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | |
4 | system user | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | |
5 | system user | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | |
7 | system user | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 | |
11 | monitor | localhost:44902 | NULL | Sleep | 0 | NULL | 0.000 | |
17 | root | localhost | test | Query | 0 | init | show processlist | 0.000 |
22 | sbtest | localhost:46666 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
23 | sbtest | localhost:46696 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
24 | sbtest | localhost:46850 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
25 | sbtest | localhost:46854 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
26 | sbtest | localhost:46856 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
36 | sbtest | localhost:47108 | test | Query | 4 | updating | update dummy set status=100 where id='716071007' | 0.000 |
38 | sbtest | localhost:47144 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
39 | sbtest | localhost:47182 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
37 | sbtest | localhost:47140 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
40 | sbtest | localhost:47186 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
42 | sbtest | localhost:47222 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
41 | sbtest | localhost:47202 | test | Query | 4 | updating | update dummy set status=100 where id='716071007' | 0.000 |
43 | sbtest | localhost:47232 | test | Query | 4 | updating | update dummy set status=100 where id='716071007' | 0.000 |
27 | sbtest | localhost:46892 | test | Query | 4 | query end | update dummy set status=100 where id='716071007' | 0.000 |
30 | sbtest | localhost:47042 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
31 | sbtest | localhost:47048 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
32 | sbtest | localhost:47062 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
33 | sbtest | localhost:47066 | test | Query | 4 | updating | update dummy set status=101 where id=716071007 | 0.000 |
Finally, cluster stalls until we kill the MySQL
+----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+
|
| Id | User | Host | db | Command | Time | State | Info | Progress |
|
+----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+
|
| 1 | system user | | NULL | Sleep | 144 | Update_rows_log_event::find_row(718) | update dummy set status=100 where id='716071007' | 0.000 | |
| 2 | system user | | NULL | Sleep | 641 | wsrep aborter idle | NULL | 0.000 | |
| 6 | system user | | NULL | Daemon | NULL | InnoDB purge coordinator | NULL | 0.000 | |
| 3 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | |
| 4 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | |
| 5 | system user | | NULL | Daemon | NULL | InnoDB purge worker | NULL | 0.000 | |
| 7 | system user | | NULL | Daemon | NULL | InnoDB shutdown handler | NULL | 0.000 | |
| 11 | monitor | localhost:44902 | NULL | Sleep | 0 | | NULL | 0.000 | |
| 17 | root | localhost | test | Query | 0 | init | show processlist | 0.000 | |
| 27 | sbtest | localhost:46892 | test | Query | 144 | query end | update dummy set status=100 where id='716071007' | 0.000 | |
+----+-------------+-----------------+------+---------+------+--------------------------------------+--------------------------------------------------+----------+
|
|
Hope this will help you in addressing the issue.
-Sukan
DBA, Mafiree
@sachin.setiya.007 Are you able to reproduce the problem now? I can make it happen on 10.2.7 within a few minutes by running Jira with the loadbalancer JDBC connection across two nodes. This happens with very little load almost right away.
Since we have no workaround and no way of avoiding the bug, there needs to be some recommendation to avoid Galera entirely with MariaDB 10.2.x. Clearly this isn't ready for production. There are also other important issues to watch out for since 10.2.9 will not start with Galera because of this bug.
Is there some way we can get a higher level of attention for Galera issues, or for more automated testing in this area? I think this could be mariaDB's key advantage over the competition if this stuff was just really solid.
Sorry , I was travelling , But will try to do it now, I got to hit the bug by anikitin script. So I am writing mtr test , and on the way to fix
This bug also effect server version: 10.2.6-MariaDB-10.2.6+maria~xenial-log mariadb.org binary distribution, in my production environment. it's drive me crazy i need force to restart full cluster each times.
i have a lot's of client process update a table, some times with same id on different node, like:
UPDATE foo SET foo.bar = 1 or 0 WHERE id=xxx;
very similar to Sukan's test case.
any progress for this bug?
or any temporary workaround for this bug?
Hi Tan,
I would suggest you to run the same cluster in single node, you can point all the write traffic to single node. This would be temporary workaround without making any big changes.
Else you can downgrade the cluster completely to 10.1.x version, where we ran the test and didn't hit this bug.
Thanks,
Sukan
DBA, Mafiree
THANK YOU !!!!!! Sukan. thank you for your advice.
I'm trying to use maxscale's ReadWriteSplit module to route all write to single node, hope my life can be easier with this.
I hit the error 'BF-BF X lock conflict' even though I am only writing to one node.
2017-11-09 4:18:49 50152503296 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification. |
2017-11-09 10:30:37 50153508864 [Note] InnoDB: BF-BF X lock conflict,mode: 1027 supremum: 0 |
2017-11-09 10:30:37 50153508864 [Note] InnoDB: conflicts states: my 0 locked 0 |
RECORD LOCKS space id 316718 page no 3049 n bits 240 index PRIMARY of table ` |
Is there any update for this ticket working twords a fix for this? There has been a distinct lack of feedback about this issue. Not that I expect a fix right away, but let us know whats happening, and maybe a better understanding of what the issue is.
We have been have been experiencing this problem even though we are sending all of the traffic to one node only. (out off 3) Our application seems to lock up about once a month or so.
Mark
Hi ,
I have been assigned to this issue, but for arround 10 days I was on vacations. Currently I am occupied with 10715 and 10177. I will look into this issue after those are done. And this issue require somewhat more understanding , because I am not familiar with innodb code.
Regards
sachin
Hi Sachin
I know you have your own set of priorities based on your customers, but I'd like to encourage you to look at this bug with higher priority. Particularly when compared to building a new feature for 10.3 (MDEV-10177). This bug:
- completely crashes mariadb
- has no workaround
- affects anyone who is using galera, even if they disable writing to multiple nodes
In short, this bug is completely catastrophic for all production use of mariaDB if galera is enabled. I think mariaDB management should consider pulling the entire 10.2 release from production ready status and think about how users might roll back to 10.1 if this bug can't be fixed in the next minor 10.2.11 release. This should be on the front page of your site or download page.
mtr test case is in MDEV-14401 according to instructions from https://jira.mariadb.org/browse/MDEV-12837?focusedCommentId=101543&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-101543
reassigned to jplindst because seppo has found that
My debugging session from today shows that the potential smoking gun is
function lock_grant_and_move_on_page() in lock0lock.cc. This function does not
honor locking priority policy, and leaves a high priority lock in wait queue
favoring a normal priority lock.
When using VATS we favor older transactions this is bad for Galera as we should favor BF transactions.
Please try to find a solution that does not involve adding fields to trx_t. I cannot approve the addition of trx_t::trx_mutex_taken.
There are many Galera test failures on bb-10.2-MDEV-12837. Can you please rebase the fix on the latest 10.2, and then post a detailed analysis of all test failures?
The change looks much cleaner now, but I’d like to have one more review round.
I would like to see a revised test push to buildbot that contains only this change, based on the latest 10.2. The current bb-10.2-MDEV-12837 contains some commits related to MDEV-14401. It seems that MDEV-14401 actually depends on this MDEV-12837. It looks like the changes to lock_rec_lock_slow() and lock_table() in the initial MDEV-14401 patch have been made redundant by the MDEV-12837 change to innobase_init().
Split the commits on two parts:
https://github.com/MariaDB/server/commit/1ffd8b30f9f86aedd42c91f72db9d54f2f8576d7
Test case is on:
https://github.com/MariaDB/server/commit/7f555777b36983d4b4590df6634aa22170e90bc3
Because it requires both fixes.
I sent some review comments to the first commit. I still think that we must convert the algorithm at InnoDB startup, not while InnoDB is already open for SQL.
Where is this running on Buildbot?
I now see that bb-10.2-galera consists of two commits on top of 10.2: MDEV-12837 and MDEV-14401.
If the MDEV-12837 fix were working as requested (adjusting the configuration at InnoDB startup), then I think that all the configuration changes
innodb-lock-schedule-algorithm=fcfs
|
in the MDEV-14401 fix should be unnecessary and should be removed. The default VATS should be fine. We should not even need to suppress any warning; at startup, InnoDB could issue an informational note (not warning) that the incompatible parameter was adjusted.
One more question: Why does the MDEV-14401 commit disable the following occasionally failing tests:
diff --git a/mysql-test/suite/galera/disabled.def b/mysql-test/suite/galera/disabled.def
|
index 423c83208f8..801b12a1f84 100644
|
--- a/mysql-test/suite/galera/disabled.def
|
+++ b/mysql-test/suite/galera/disabled.def
|
@@ -56,3 +56,8 @@ MW-284 : MDEV-13549 Galera test failures 10.1
|
galera_as_slave : MDEV-13549 Galera test failures 10.1
|
galera_var_innodb_disallow_writes : MDEV-10949
|
galera_kill_applier : race condition at the start of the test
|
+MW-328C: MDEV-13549 Galera test failures 10.1
|
+MW-328A: MDEV-13549 Galera test failures 10.1
|
+MW-328B: MDEV-13549 Galera test failures 10.1
|
+MW-328: MDEV-13549 Galera test failures 10.1
|
+galera_suspend_slave: MDEV-13549 Galera test failures 10.1
|
\ No newline at end of file
|
The commit message does not explain this.
I think that if it is considered necessary to disable these tests, they should be disabled in a separate commit, with a reference to separate tasks, such as MDEV-14149 or MDEV-13876 or even MDEV-13549.
commit da3a3a68df34c7fef387ce890d3925166edeef2c
Author: Jan Lindström <jan.lindstrom@mariadb.com>
Date: Thu Dec 7 12:26:29 2017 +0200
MDEV-12837: WSREP: BF lock wait long
Problem was a merge error from MySQL wsrep i.e. Galera.
wsrep_on_check
New check function. Galera can't be enabled
if innodb-lock-schedule-algorithm=VATS.
innobase_kill_query
In Galera async kill we could own lock mutex.
innobase_init
If Variance-Aware-Transaction-Sheduling Algorithm (VATS) is
used on Galera we fall back to First-Come-First-Served (FCFS)
with notice to user.
Changed innodb-lock-schedule-algorithm as read-only parameter
as it was designed to be.
lock_reset_lock_and_trx_wait
Use ib::hex() to print out transaction ID.
lock_rec_other_has_expl_req,
lock_rec_other_has_conflicting,
RecLock::add_to_waitq
lock_rec_lock_slow
lock_table_other_has_incompatible
lock_rec_insert_check_and_lock
lock_prdt_other_has_conflicting
Change pointer to conflicting lock to normal pointer as this
pointer contents could be changed later.
RecLock::create
Conclicting lock pointer is moved to last parameter with
default value NULL. This conflicting transaction could
be selected as victim in Galera if requesting transaction
is BF (brute force) transaction. In this case contents
of conflicting lock pointer will be changed. Use ib::hex() to print
transaction ids.
greenman Can you document that innodb-lock-schedule-algorithm is read-only variables (i.e. it can't be changed dynamically) and that VATS method is not supported on Galera.
Here what documentation should mention:
- 10.1, default is FCFS and InnoDB will refuse to start if VATS is used with galera.
- >=10.2 default is VATS and in galera it will be changed to FCFS.
Wowwwww. Incredible. After many months of suffering finally a MariaDB version that really deserved the label "stable"... perhaps before christmas? Can it be true? Don't tell me I'm dreaming. Now we're all waiting anxiously for the binary to arrive in the repos... Thanks to everyone involved and, especially, to Jan for solving this.
@jplindst Was there supposed to be a commit to the 10.2 branch as well? I could only see one to the 10.1 branch (where I think the bug wasn't manifesting) and to 10.3.
https://github.com/MariaDB/server/search?q=MDEV-12837&type=Commits&utf8=%E2%9C%93
Are you able to confirm this will land in 10.2.12? That will help us plan our testing for being able to use galera cluster in production.
No, there wasn't supposed to be a separate 10.2 commit — every 10.1 commit will inevitably land in 10.2 eventually.
Normally this happens at least before the next 10.2 release, so this commit will be in the 10.2.12, just like "Fix Version/s" field at the top of the page says.
Hi, there seems to be some kind of information loss here: commit da3a3a68df34c7fef387ce890d3925166edeef2c is for 10.2 (10.1 had no merge error and FCFS is default so change there it is slightly different).
I'd suggest adding a note (like the one below 10.1.22) on:
https://downloads.mariadb.org/mariadb/+releases/
Perhaps sth. like: "Updating from 10.2.11 to 10.2.12 is highly recommended for Galera users because previous versions are unstable".
In addition - IMHO - this should also be mentioned in the release notes. The term "stable" on the above-mentioned page is really misleading for older versions because we all know they will probably crash sooner or later due to this bug - even with very ordinary use patterns.
rgpublic great work. I suggest the description in https://mariadb.com/kb/en/library/mariadb-10130-release-notes/ could be meaningfully expanded (replaced). This is a page humble users like you and me can edit and probably serves as a good enough place people who are looking to upgrade would read.
Ah, really cool. Thanks, Daniel. Didnt know that everyone can change these pages. I've added a warning box and reference to this bug here:
https://mariadb.com/kb/en/library/mariadb-10212-release-notes/
For 10.1.x I think the already existing line under "Notable changes" should be sufficent as the bug doesnt seem to manifest there - so no need to cause panic.
It looks like it still happens ....
innodb_version 5.7.20
protocol_version 10
slave_type_conversions
version 10.2.12-MariaDB
version_comment MariaDB Server
version_compile_machine x86_64
version_compile_os Linux
version_malloc_library system
version_ssl_library OpenSSL 1.0.1e-fips 11 Feb 2013
wsrep_patch_version wsrep_25.21
What can I provide to help troubleshoot this? It looks like the issue is triggered by a truncate table...
LOCK WAIT
MySQL thread id 138641, OS thread handle 139936878089984, query id 139638795 ........ Opening tables
TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1`
2018-01-24 11:51:38 139938787669760 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1a3598f query: TRUNCATE TABLE `ds$etl_dimension_discovery_local_standard_1`
TRANSACTION 27482511, ACTIVE 1997 sec
As soon as this kicks in, everything locks. Cannot reproduce it in a consistent way .... also, when it happens, kill 138641 says "now an owner"... even for root!!! how is that possible?
Can I have full unedited error log? wsrep_debug=on and wait until it repeats and again error log would also help.
I'll try ... hard to say when it happens ... hopefully the log will not be GB of data with debug on ....
I have 500MB of logs (uncompressed) ... I'll send you an email with a link to download the logs ....
Forgot to mention ... we are using a 3 node cluster (development database) with HA Proxy, with one node dedicated for writes...nothing interesting is logged in (mostly) read-only nodes .... A stress test is started with the database in the same state ... after a while it locks and it is always triggered by a DDL statement (truncate in this case, in the past I've seen these dead-locks with ALTER TABLE ... ADD PARTITION ...). Once the DDL enters in dead-lock, no other query can be executed (and there are plenty of threads/connections available), even queries not related to that table lock. Same with ALTER TABLE ... ADD PARTITION
First reference of a dead-lock starts at 2018-01-24 15:39:09 139784560858880 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1e9eb4f query
I am also still seeing this issue with 10.2.12, although the behaviour has changed somewhat compared to 10.2.9 (the previous version we were running).
Now with 10.2.12, it seems to only happen on one master at a time (not multiple masters simultaneously). Also, the query that gets regularly printed to the logs is always a DDL statement now, which it was not before.
The other noteworthy change with 10.2.12 is that the log output is much terser. Instead of a complete dump of SHOW ENGINE INNODB STATUS; every second, we now just see 5 lines per second, like so:
Jan 18 02:53:12 tm2 mysqld[16437]: 2018-01-18 2:53:12 140246917351168 [Note] InnoDB: WSREP: BF lock wait long for trx:0x76a45d4 query: CREATE TABLE IF NOT EXISTS ...
Jan 18 02:53:12 tm2 mysqld[16437]: TRANSACTION 124405204, ACTIVE 269206 sec
Jan 18 02:53:12 tm2 mysqld[16437]: LOCK WAIT
Jan 18 02:53:12 tm2 mysqld[16437]: MySQL thread id 16, OS thread handle 140246857389824, query id 27596013 creating table
Jan 18 02:53:12 tm2 mysqld[16437]: CREATE TABLE IF NOT EXISTS ...
adrian.tarau, please create a new one. You can refer to this one in the description if you'd like or copy-paste from it.
Updated correct releases where actual fix is:
commit 57cde8ccd19675dc98e3cbacf0ef5c72cb188e49 (HEAD -> 10.1, origin/10.1)
Author: Jan Lindström <jan.lindstrom@mariadb.com>
Date: Wed Jul 25 08:21:25 2018 +0300
MDEV-15822: WSREP: BF lock wait long for trx
In Galera BF (brute force) transactions may not wait for lock requests
and normally BF-transaction would select transaction holding conflicting
locks as a victim for rollback. However, background statistic calculation
transaction is InnoDB internal transaction and it has no thd i.e. it can't be
selected as a victim. If background statistics calculation transaction holds
conflicting locks to statistics tables it will cause BF lock wait long
error message. Correct way to handle background statistics calculation is to
acquire thd for transaction but that change is too big for GA-releases and
there are other reported problems on background statistics calculation.
This fix avoids adding a table to background statistics calculation if
Hello all.
Since the update from 10.1 to 10.2.6 yesterday we are facing the same issue "WSREP: BF lock wait long".
The whole cluster is stuck after that. we only can bring it back if we restart the node with the "WSREP: BF lock wait long".
Note: we are using Debian 3.16.39-1+deb8u2