[MDEV-7340] [PATCH] parallel replication status variables Created: 2014-12-18  Updated: 2017-03-09

Status: Open
Project: MariaDB Server
Component/s: Replication
Fix Version/s: None

Type: Task Priority: Major
Reporter: Daniel Black Assignee: Kristian Nielsen
Resolution: Unresolved Votes: 1
Labels: patch

Issue Links:
Duplicate
duplicates MDEV-7202 [PATCH] additional statistics for par... Closed
Relates
relates to MDEV-5296 No status information available for p... Open

 Description   

from pivanof on mailing list in reference to parallel replication:

> some status variables which could be plotted over time and show (or at
> least hint on) whether this is significant bottleneck for performance
> or not.

> This could be something like total time (in both wall time and
> accumulated CPU time) spent executing transactions in parallel, time
> spent rolling back transactions due to this lock conflict, time spent
> rolling back transactions because of other reasons (e.g. due to STOP
> SLAVE or reconnect after master crash), maybe also time spent waiting
> in one parallel thread while transaction is executing in another
> thread, etc.



 Comments   
Comment by Daniel Black [ 2015-01-08 ]

from MDEV-7396 - max consecutive parallel deadlocks is probably useful

Comment by Daniel Black [ 2015-03-19 ]

any more suggestions here? I may get started on this soon. Note it and it might be included.

Comment by Daniel Black [ 2015-04-03 ]

slave-domain-parallel-threads reached - # of time this is reached ? ( per domain?)

microsecond_interval_timer is wall clock right? any tips on cpu time functions?

Comment by Jean-François Gagné [ 2015-04-30 ]

I would be interested in slave_parallel_workers_reached: number of time all workers were active executing a transaction (if this happen too often, slave_parallel_workers might be increased).

I would also be interested in slave_parallel_workers_waiting: number of time a worker waited for a previous worker to complete (if this happen too often, slave_parallel_workers might be decreased).

I would also be interested in relaylog_group_commits: count of the group commit received from masters (allows, in combination with binlog_group_commits, to monitor parallelism gain or lost with log-slave-updates: see http://blog.booking.com/better_parallel_replication_for_mysql.html for more details).

More ideas to come later.

Thanks for improving monitoring and tuning-ability.

Comment by Daniel Black [ 2015-04-30 ]

If a relay_commits is possible at the same time as relay_group_commits we can get an approximate graph of active threads over time. Is slave_parallel_workers_reached still useful then?

s/_worker/_thread/g

Not sure slave_parallel_workers_waiting helps, all but one thread will be waiting at some point. Please correct me if I'm missing something. Having an indicator or more relay log events waiting in a group but are restricted by slave_parallel_threads might be useful (parallel_in_order_group_thread_exhaustion?). On thread utilisation the original busy vs wait time vs rollback probably covers that well enough.

Further down the track for the breakdown of which slave_parallel_mode decision each transaction is reaching (and how many per relevant category are rolled back).

And then there is monitoring the out of order commits from slave_domain_parallel_threads and the limit it imposes however I'm thinking that presenting an information_schema table for that per domain if possible.

Comment by Kristian Nielsen [ 2015-07-02 ]

I made a patch that adds some status variables, measuring time spent by
worker threads being idle, processing events, and waiting for other
transactions:

http://lists.askmonty.org/pipermail/commits/2015-July/008126.html

More details in the commit message in the link. This is not necessarily
meant to be the final form (or any form) of this MDEV-7340, but it might be
interesting, at least. Testing of the patch welcome.

Comment by Daniel Black [ 2015-07-04 ]

nice. Thank you. Was looking to see if anything other than status_lock could be used but like you didn't see an easy approach. status vars look good.

Comment by Kristian Nielsen [ 2015-07-04 ]

Do you mean LOCK_status?

  statistic_add(*current_status_var,
                new_time - slave_worker_phase_start_time, &LOCK_status);

If my understanding is correct, this statistic_add compiles into an atomic
add operation on platforms of interest. The lock is not actually used,
unless some wierd platform that does not have atomic operations.

EDIT: Actually, it appears that neither lock nor atomic operations are
used. There is a SAFE_STATISTICS define that causes them to be used, but it
is never enabled. So there is no locking apparently, and the statistics can
at least theoretically be off, trading 100% accuracy for improved
performance.

Comment by Daniel Black [ 2015-07-06 ]

thanks for looking this up.

I'm quite happy with this tradeoff.

If you're happy with its final form any chance of a backport? It applies to 10.0 with minimal fuzz.

Comment by Daniel Black [ 2015-08-09 ]

Any chance of a port to 10.0? I don't particularly care if it changes sightly in the future. Something is better than nothing.

Comment by Daniel Black [ 2017-03-09 ]

Amazingly this patch still applies (with fuzz level 3) to 10.1 head. I didn't test the current correctness however.

Generated at Thu Feb 08 07:18:50 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.