Hi, I am trying to explain my understanding of this issue.
This Xie Yongmei, from alibaba rds team.
The root cause of this issue might be the way to signal rangelock's waiting list.
The current solution of rangelock is:
1) each write transaction should acquire rangelock before modification in index tree (actually FT in tokudb) to prevent concurrent read/write operation on the index row.
2) read-only query should acquire rangelock at the callback of cursor get for snapshot read.
3) the process of acquire rangelock (in toku_db_get_range_lock):
I. call toku_db_start_range_lock to get rangelock (in fact, it's trylock semantics)
- if conflict, notifies locktree to track it in pending list.
II. if grant or deadlock, toku_db_start_range_lock just returns.
III.if conflict, toku_db_get_range_lock will call toku_db_wait_range_lock: let itself be waiting on the condition variable in its own context.
4) the process of release rangelock () when transaction commit or abort:
I. release the rangelock it held
II. retry all the rangelocks waiting on the same locktree
- signal the condition variable in the request lock context, if succ
The following scenario could happen:
t1: txn1 call toku_db_start_range_lock, but found conflict, it told locktree's pending list to track it.
t2: txn2 commit, it release the rangelock which txn1 was waiting for and retry to get rangelock for txn1 and signal txn1 to execute. because the rangelock has been tracked in locktree pending list on time t1.
t3: txn1 call toku_db_wait_range_lock to sleep on it's own condition variable. but unfortunately it miss the signal, it won't wakeup until timeout occurs.
the above example shows: there's no rangelock conflict, but transaction txn1 was waiting for a long time.
The imlementation for tokudb rangelock is rare:
It uses centralized waiting list (locktree's pending list) and centralized mutex; But, for each rangelock request, it has its own condition variable which is defined in its context and sleeps on its own condition variable.
So, the wakeup process is tricky: the transaction releasing the rangelock is responsible for acquiring rangelock for blocking transaction and signal it to execute.
The rough workaround is shown as below:
Before sleep, it should verify whether there's still rangelock conflict with m_info->mutex held.
If conflict dispears, remove it from locktree's pending list and return grant; otherwise sleep on its cv.
plinux,
Percona suspects MariaDB parallel replication to be the reason of the problem. Could you please review their assessment?