Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7202

[PATCH] additional statistics for parallel replication - Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending

Details

    Description

      in MDEV-6680 I thought some additional status would be helpful.

      Attached patch adds a total status for all threads for the Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending.

      Rather/in addition to totals, would push a per thread status as slave_parallel_eventqueue_0_size be acceptable?

      Anything else useful to capture/graph here?

      Attachments

        Issue Links

          Activity

            danblack Daniel Black created issue -
            knielsen Kristian Nielsen made changes -
            Field Original Value New Value
            Assignee Kristian Nielsen [ knielsen ]
            danblack Daniel Black made changes -
            Summary additional statistics for parallel replication - Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending [PATCH] additional statistics for parallel replication - Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending
            serg Sergei Golubchik made changes -
            Fix Version/s 10.1 [ 16100 ]
            knielsen Kristian Nielsen made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            Ok, I (finally) got to look at this patch.

            > Attached patch adds a total status for all threads for the
            > Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending.

            The patch exposes the loc_qev_size and qev_free_pending fields as status
            variables. I don't really see how this is useful?

            This is completely internal detail to the memory management of the parallel
            replication. It is the size of an internal free list of buffers that each
            thread keeps for handling event queueing efficiently. Its size does not seem
            to mean much for how parallel replication is working?

            > Rather/in addition to totals, would push a per thread status as
            > slave_parallel_eventqueue_0_size be acceptable?

            Do you mean here that there would be N status variables, one for each worker
            thread? Maybe there are better places to expose per-thread statistics, like
            performance schema or information_schema? But as I said, it seems to me that
            these particular values will be more confusing than useful.

            > I thought some additional status would be helpful.
            > Anything else useful to capture/graph here?

            I 100% agree that more monitoring of parallel replication is needed.

            With respect to size of event queues, the issue here is that the code does not
            update the queue size after every event execution, in order to reduce lock
            contention. So the information is not easily available in the current code.

            I'm trying to think of a way to get size of pending events without introducing
            additional locking overhead.

            The SQL driver thread takes LOCK_rpl_thread whenever an event is queued. And
            the worker thread takes LOCK_parallel_entry whenever a new event group
            (transaction) is started. So maybe we could do something while these locks are
            held?

            Under LOCK_parallel_entry, a worker thread could update a counter of size of
            events processsed but not yet freed (in class rpl_parallel_entry). And under
            LOCK_rpl_thread, the SQL driver thread could increment per-thread size of
            events queued. And somehow the status variable would combine these to obtain
            the right value. But it sounds a bit too complicated... would be nice to come
            up with a simpler idea to add good monitoring of parallel replication status
            (not just queue size).

            In general, I'm unsure how to balance the need for more monitoring against the
            overhead of locking/atomics needed to maintain such monitoring. There is
            already significant locking overhead in parallel replication, and not much
            benchmarking has been done to understand the significance of this overhead.

            knielsen Kristian Nielsen added a comment - Ok, I (finally) got to look at this patch. > Attached patch adds a total status for all threads for the > Slave_parallel_eventqueue_size/Slave_parallel_eventqueue_freepending. The patch exposes the loc_qev_size and qev_free_pending fields as status variables. I don't really see how this is useful? This is completely internal detail to the memory management of the parallel replication. It is the size of an internal free list of buffers that each thread keeps for handling event queueing efficiently. Its size does not seem to mean much for how parallel replication is working? > Rather/in addition to totals, would push a per thread status as > slave_parallel_eventqueue_0_size be acceptable? Do you mean here that there would be N status variables, one for each worker thread? Maybe there are better places to expose per-thread statistics, like performance schema or information_schema? But as I said, it seems to me that these particular values will be more confusing than useful. > I thought some additional status would be helpful. > Anything else useful to capture/graph here? I 100% agree that more monitoring of parallel replication is needed. With respect to size of event queues, the issue here is that the code does not update the queue size after every event execution, in order to reduce lock contention. So the information is not easily available in the current code. I'm trying to think of a way to get size of pending events without introducing additional locking overhead. The SQL driver thread takes LOCK_rpl_thread whenever an event is queued. And the worker thread takes LOCK_parallel_entry whenever a new event group (transaction) is started. So maybe we could do something while these locks are held? Under LOCK_parallel_entry, a worker thread could update a counter of size of events processsed but not yet freed (in class rpl_parallel_entry). And under LOCK_rpl_thread, the SQL driver thread could increment per-thread size of events queued. And somehow the status variable would combine these to obtain the right value. But it sounds a bit too complicated... would be nice to come up with a simpler idea to add good monitoring of parallel replication status (not just queue size). In general, I'm unsure how to balance the need for more monitoring against the overhead of locking/atomics needed to maintain such monitoring. There is already significant locking overhead in parallel replication, and not much benchmarking has been done to understand the significance of this overhead.
            knielsen Kristian Nielsen made changes -
            Status In Progress [ 3 ] Stalled [ 10000 ]
            knielsen Kristian Nielsen made changes -
            Assignee Kristian Nielsen [ knielsen ]

            I tried to assign the issue back to user Daniel Black, but that did not seem possible

            knielsen Kristian Nielsen added a comment - I tried to assign the issue back to user Daniel Black, but that did not seem possible
            danblack Daniel Black added a comment -

            pivanof suggested much better options in MDEV-7340 so lets close this and continue there.

            danblack Daniel Black added a comment - pivanof suggested much better options in MDEV-7340 so lets close this and continue there.
            danblack Daniel Black made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s N/A [ 14700 ]
            Fix Version/s 10.1 [ 16100 ]
            Resolution Duplicate [ 3 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Workflow MariaDB v2 [ 58722 ] MariaDB v3 [ 64862 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 64862 ] MariaDB v4 [ 132470 ]

            People

              Unassigned Unassigned
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.