[MDEV-7202] [PATCH] additional statistics for parallel replication - Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending Created: 2014-11-25  Updated: 2015-04-04  Resolved: 2015-04-04

Status: Closed
Project: MariaDB Server
Component/s: Replication
Fix Version/s: N/A

Type: Task Priority: Minor
Reporter: Daniel Black Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: parallelslave

Attachments: File Slave_parallel_eventqueue.patch    
Issue Links:
Duplicate
is duplicated by MDEV-7340 [PATCH] parallel replication status v... Open

 Description   

in MDEV-6680 I thought some additional status would be helpful.

Attached patch adds a total status for all threads for the Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending.

Rather/in addition to totals, would push a per thread status as slave_parallel_eventqueue_0_size be acceptable?

Anything else useful to capture/graph here?



 Comments   
Comment by Kristian Nielsen [ 2015-02-04 ]

Ok, I (finally) got to look at this patch.

> Attached patch adds a total status for all threads for the
> Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending.

The patch exposes the loc_qev_size and qev_free_pending fields as status
variables. I don't really see how this is useful?

This is completely internal detail to the memory management of the parallel
replication. It is the size of an internal free list of buffers that each
thread keeps for handling event queueing efficiently. Its size does not seem
to mean much for how parallel replication is working?

> Rather/in addition to totals, would push a per thread status as
> slave_parallel_eventqueue_0_size be acceptable?

Do you mean here that there would be N status variables, one for each worker
thread? Maybe there are better places to expose per-thread statistics, like
performance schema or information_schema? But as I said, it seems to me that
these particular values will be more confusing than useful.

> I thought some additional status would be helpful.
> Anything else useful to capture/graph here?

I 100% agree that more monitoring of parallel replication is needed.

With respect to size of event queues, the issue here is that the code does not
update the queue size after every event execution, in order to reduce lock
contention. So the information is not easily available in the current code.

I'm trying to think of a way to get size of pending events without introducing
additional locking overhead.

The SQL driver thread takes LOCK_rpl_thread whenever an event is queued. And
the worker thread takes LOCK_parallel_entry whenever a new event group
(transaction) is started. So maybe we could do something while these locks are
held?

Under LOCK_parallel_entry, a worker thread could update a counter of size of
events processsed but not yet freed (in class rpl_parallel_entry). And under
LOCK_rpl_thread, the SQL driver thread could increment per-thread size of
events queued. And somehow the status variable would combine these to obtain
the right value. But it sounds a bit too complicated... would be nice to come
up with a simpler idea to add good monitoring of parallel replication status
(not just queue size).

In general, I'm unsure how to balance the need for more monitoring against the
overhead of locking/atomics needed to maintain such monitoring. There is
already significant locking overhead in parallel replication, and not much
benchmarking has been done to understand the significance of this overhead.

Comment by Kristian Nielsen [ 2015-02-04 ]

I tried to assign the issue back to user Daniel Black, but that did not seem possible

Comment by Daniel Black [ 2015-04-03 ]

pivanof suggested much better options in MDEV-7340 so lets close this and continue there.

Generated at Thu Feb 08 07:17:47 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.