[MDEV-7340] [PATCH] parallel replication status variables - Jira

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Replication
Labels:
- foundation
- patch

Description

from pivanof on mailing list in reference to parallel replication:

> some status variables which could be plotted over time and show (or at
> least hint on) whether this is significant bottleneck for performance
> or not.

> This could be something like total time (in both wall time and
> accumulated CPU time) spent executing transactions in parallel, time
> spent rolling back transactions due to this lock conflict, time spent
> rolling back transactions because of other reasons (e.g. due to STOP
> SLAVE or reconnect after master crash), maybe also time spent waiting
> in one parallel thread while transaction is executing in another
> thread, etc.

Attachments

Issue Links

duplicates

MDEV-7202 [PATCH] additional statistics for parallel replication - Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending

Closed

relates to

MDEV-5296 No status information available for parallel replication

Open

Activity

Ascending order - Click to sort in descending order

Daniel Black added a comment - 2015-01-08 05:44

from ~~MDEV-7396~~ - max consecutive parallel deadlocks is probably useful

Daniel Black added a comment - 2015-01-08 05:44 from MDEV-7396 - max consecutive parallel deadlocks is probably useful

Daniel Black added a comment - 2015-03-19 06:21

any more suggestions here? I may get started on this soon. Note it and it might be included.

Daniel Black added a comment - 2015-03-19 06:21 any more suggestions here? I may get started on this soon. Note it and it might be included.

Daniel Black added a comment - 2015-04-03 16:03

slave-domain-parallel-threads reached - # of time this is reached ? ( per domain?)

microsecond_interval_timer is wall clock right? any tips on cpu time functions?

Daniel Black added a comment - 2015-04-03 16:03 slave-domain-parallel-threads reached - # of time this is reached ? ( per domain?) microsecond_interval_timer is wall clock right? any tips on cpu time functions?

Jean-François Gagné added a comment - 2015-04-30 13:44

I would be interested in slave_parallel_workers_reached: number of time all workers were active executing a transaction (if this happen too often, slave_parallel_workers might be increased).

I would also be interested in slave_parallel_workers_waiting: number of time a worker waited for a previous worker to complete (if this happen too often, slave_parallel_workers might be decreased).

I would also be interested in relaylog_group_commits: count of the group commit received from masters (allows, in combination with binlog_group_commits, to monitor parallelism gain or lost with log-slave-updates: see http://blog.booking.com/better_parallel_replication_for_mysql.html for more details).

More ideas to come later.

Thanks for improving monitoring and tuning-ability.

Jean-François Gagné added a comment - 2015-04-30 13:44 I would be interested in slave_parallel_workers_reached : number of time all workers were active executing a transaction (if this happen too often, slave_parallel_workers might be increased). I would also be interested in slave_parallel_workers_waiting : number of time a worker waited for a previous worker to complete (if this happen too often, slave_parallel_workers might be decreased). I would also be interested in relaylog_group_commits : count of the group commit received from masters (allows, in combination with binlog_group_commits, to monitor parallelism gain or lost with log-slave-updates: see http://blog.booking.com/better_parallel_replication_for_mysql.html for more details). More ideas to come later. Thanks for improving monitoring and tuning-ability.

Daniel Black added a comment - 2015-04-30 17:14

If a relay_commits is possible at the same time as relay_group_commits we can get an approximate graph of active threads over time. Is slave_parallel_workers_reached still useful then?

s/_worker/_thread/g

Not sure slave_parallel_workers_waiting helps, all but one thread will be waiting at some point. Please correct me if I'm missing something. Having an indicator or more relay log events waiting in a group but are restricted by slave_parallel_threads might be useful (parallel_in_order_group_thread_exhaustion?). On thread utilisation the original busy vs wait time vs rollback probably covers that well enough.

Further down the track for the breakdown of which slave_parallel_mode decision each transaction is reaching (and how many per relevant category are rolled back).

And then there is monitoring the out of order commits from slave_domain_parallel_threads and the limit it imposes however I'm thinking that presenting an information_schema table for that per domain if possible.

Daniel Black added a comment - 2015-04-30 17:14 If a relay_commits is possible at the same time as relay_group_commits we can get an approximate graph of active threads over time. Is slave_parallel_workers_reached still useful then? s/_worker/_thread/g Not sure slave_parallel_workers_waiting helps, all but one thread will be waiting at some point. Please correct me if I'm missing something. Having an indicator or more relay log events waiting in a group but are restricted by slave_parallel_threads might be useful ( parallel_in_order_group_thread_exhaustion ?). On thread utilisation the original busy vs wait time vs rollback probably covers that well enough. Further down the track for the breakdown of which slave_parallel_mode decision each transaction is reaching (and how many per relevant category are rolled back). And then there is monitoring the out of order commits from slave_domain_parallel_threads and the limit it imposes however I'm thinking that presenting an information_schema table for that per domain if possible.

Kristian Nielsen added a comment - 2015-07-02 14:31

I made a patch that adds some status variables, measuring time spent by
worker threads being idle, processing events, and waiting for other
transactions:

http://lists.askmonty.org/pipermail/commits/2015-July/008126.html

More details in the commit message in the link. This is not necessarily
meant to be the final form (or any form) of this MDEV-7340, but it might be
interesting, at least. Testing of the patch welcome.

Kristian Nielsen added a comment - 2015-07-02 14:31 I made a patch that adds some status variables, measuring time spent by worker threads being idle, processing events, and waiting for other transactions: http://lists.askmonty.org/pipermail/commits/2015-July/008126.html More details in the commit message in the link. This is not necessarily meant to be the final form (or any form) of this MDEV-7340 , but it might be interesting, at least. Testing of the patch welcome.

Daniel Black added a comment - 2015-07-04 02:45

nice. Thank you. Was looking to see if anything other than status_lock could be used but like you didn't see an easy approach. status vars look good.

Daniel Black added a comment - 2015-07-04 02:45 nice. Thank you. Was looking to see if anything other than status_lock could be used but like you didn't see an easy approach. status vars look good.

Kristian Nielsen added a comment - 2015-07-04 08:44 - edited

Do you mean LOCK_status?

  statistic_add(*current_status_var,

                new_time - slave_worker_phase_start_time, &LOCK_status);

If my understanding is correct, this statistic_add compiles into an atomic
add operation on platforms of interest. The lock is not actually used,
unless some wierd platform that does not have atomic operations.

EDIT: Actually, it appears that neither lock nor atomic operations are
used. There is a SAFE_STATISTICS define that causes them to be used, but it
is never enabled. So there is no locking apparently, and the statistics can
at least theoretically be off, trading 100% accuracy for improved
performance.

Kristian Nielsen added a comment - 2015-07-04 08:44 - edited Do you mean LOCK_status? statistic_add(*current_status_var, new_time - slave_worker_phase_start_time, &LOCK_status); If my understanding is correct, this statistic_add compiles into an atomic add operation on platforms of interest. The lock is not actually used, unless some wierd platform that does not have atomic operations. EDIT: Actually, it appears that neither lock nor atomic operations are used. There is a SAFE_STATISTICS define that causes them to be used, but it is never enabled. So there is no locking apparently, and the statistics can at least theoretically be off, trading 100% accuracy for improved performance.

Daniel Black added a comment - 2015-07-06 08:39

thanks for looking this up.

I'm quite happy with this tradeoff.

If you're happy with its final form any chance of a backport? It applies to 10.0 with minimal fuzz.

Daniel Black added a comment - 2015-07-06 08:39 thanks for looking this up. I'm quite happy with this tradeoff. If you're happy with its final form any chance of a backport? It applies to 10.0 with minimal fuzz.

Daniel Black added a comment - 2015-08-09 07:48

Any chance of a port to 10.0? I don't particularly care if it changes sightly in the future. Something is better than nothing.

Daniel Black added a comment - 2015-08-09 07:48 Any chance of a port to 10.0? I don't particularly care if it changes sightly in the future. Something is better than nothing.

Daniel Black added a comment - 2017-03-09 10:04

Amazingly this patch still applies (with fuzz level 3) to 10.1 head. I didn't test the current correctness however.

Daniel Black added a comment - 2017-03-09 10:04 Amazingly this patch still applies (with fuzz level 3) to 10.1 head. I didn't test the current correctness however.

People

Assignee:: Kristian Nielsen

Reporter:: Daniel Black

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2014-12-18 01:07

Updated:: 2025-02-06 07:19

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server