[MDEV-10796] tokudb+mariadb server stalled Created: 2016-09-12 Updated: 2018-01-01 Resolved: 2018-01-01 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Storage Engine - TokuDB |
| Affects Version/s: | 10.0.22 |
| Fix Version/s: | 10.1.20, 10.2.4 |
| Type: | Bug | Priority: | Major |
| Reporter: | Manuel Arostegui | Assignee: | Lixun Peng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | mariadb, tokudb, upstream | ||
| Environment: |
Ubuntu, tokudb-7.5.7 |
||
| Description |
|
We have a server running several processes of Mariadb (10.0.22) and tables running tokudb engine (tokudb-7.5.7) replicating from a master in a normal master-slave setup. We realised that one of the proceses was lagging a lot found a query that was stuck. The process was completely stuck (stop slave didn't even work, it was just hanging trying to kill the replication thread). The only solution was to kill -9 mysqld. After migrating the table to InnoDB the problem looks gone. |
| Comments |
| Comment by Elena Stepanova [ 2016-09-19 ] | ||||||
|
There isn't much information to look at at the moment. If it happens again, please
| ||||||
| Comment by Manuel Arostegui [ 2016-09-20 ] | ||||||
|
Hello Elena, This happened again and I have attached all the information to the percona bug report: https://bugs.launchpad.net/percona-server/+bug/1621852 Answering to some of your specific questions:
Again, once it happened, the only solution was to kill -9 the process, alter the table to InnoDB and let replication flow again. | ||||||
| Comment by Elena Stepanova [ 2016-09-20 ] | ||||||
|
Thanks. | ||||||
| Comment by Manuel Arostegui [ 2016-09-30 ] | ||||||
|
Hello Elena, Can you review what Percona said? https://bugs.launchpad.net/percona-server/+bug/1621852 | ||||||
| Comment by Manuel Arostegui [ 2016-10-14 ] | ||||||
|
Any update on this? Did you have time to look at what Percona said? Thanks | ||||||
| Comment by Elena Stepanova [ 2016-10-14 ] | ||||||
|
plinux, | ||||||
| Comment by Manuel Arostegui [ 2016-10-14 ] | ||||||
|
Note that we do have parallel replication disabled: MariaDB SANITARIUM localhost (none) > show global variables like 'slave_parallel_mode'; | ||||||
| Comment by Xie Yongmei [ 2017-01-11 ] | ||||||
|
Hi, I am trying to explain my understanding of this issue. The root cause of this issue might be the way to signal rangelock's waiting list. The current solution of rangelock is: 3) the process of acquire rangelock (in toku_db_get_range_lock):
II. if grant or deadlock, toku_db_start_range_lock just returns. 4) the process of release rangelock () when transaction commit or abort:
The following scenario could happen: the above example shows: there's no rangelock conflict, but transaction txn1 was waiting for a long time. The imlementation for tokudb rangelock is rare: So, the wakeup process is tricky: the transaction releasing the rangelock is responsible for acquiring rangelock for blocking transaction and signal it to execute. The rough workaround is shown as below: | ||||||
| Comment by Manuel Arostegui [ 2017-05-26 ] | ||||||
|
This is an update from Percona at: https://bugs.launchpad.net/bugs/1621852
| ||||||
| Comment by Daniel Black [ 2018-01-01 ] | ||||||
|
TDB-3 is fixed. Merged into MariaDB as https://github.com/MariaDB/server/commit/d145d1b6 Closing might have been overly keen. Please check but I did follow the patches to the above commit. |