Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6096

Ideas about parallel query execution

Details

    • Task
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • None
    • None

    Description

      Some ideas about using multiple threads to run a query.

      == Position at N% of table/index ==
      Consider queries

      select sum(a) from tbl group by non_key_col

      select sum(a) from tbl where key between C1 and C2 group by non_key_col

      If we want to run these with N threads, we need to give 1/Nth of table to each thread. (An alternative is to run one "reader" thread and distribute work to multiple compute threads. The problem with this is that reading from the table won't be parallel. This will put a cap on the performance.)

      In order to do that, we will need storage engine calls that do

      • "position at N% in the table"
      • "position at N% in the index range between [C1 and C2]".

      these calls would also let us build equi-height histograms based on sampling.

      == General execution ==
      There are many works about converting SQL into MapReduce jobs. Are they relevant to this task? The difference seems to be in the Map phase - we assume that source data is equi-distant to all worker threads.

      == Evaluation ==
      It would be nice to assess how much speedup we will get. In order to get an idea, we could break the query apart and run the parts manually. The merge step could also be done manually in some cases (by writing to, and reading from temporary tables).

      Attachments

        Issue Links

          Activity

            I'm sharing some parameters that are of. interest for that case
            loose_spider_reset_sql_alloc=1 # because multiple part scan consume more memory

            Those are in my favorites but i have no more clue if they play a role on this
            loose_spider_connect_mutex = 0
            loose_spider_conn_recycle_mode=0
            loose_spider_conn_recycle_strict = 0
            loose_spider_local_lock_table=0
            oose_spider_semi_table_lock = 0
            loose_spider_support_xa=0
            loose_spider_direct_dup_insert = 1
            loose_spider_remote_sql_log_off=1
            loose_spider_casual_read=0
            loose_spider_bka_mode = 0
            loose_spider_quick_mode=3
            loose_spider_quick_page_size=1000
            loose_spider_sync_trx_isolation=1
            loose_spider_sync_autocommit=1

            Intrested to get any feadback if they do .

            stephane@skysql.com VAROQUI Stephane added a comment - I'm sharing some parameters that are of. interest for that case loose_spider_reset_sql_alloc=1 # because multiple part scan consume more memory Those are in my favorites but i have no more clue if they play a role on this loose_spider_connect_mutex = 0 loose_spider_conn_recycle_mode=0 loose_spider_conn_recycle_strict = 0 loose_spider_local_lock_table=0 oose_spider_semi_table_lock = 0 loose_spider_support_xa=0 loose_spider_direct_dup_insert = 1 loose_spider_remote_sql_log_off=1 loose_spider_casual_read=0 loose_spider_bka_mode = 0 loose_spider_quick_mode=3 loose_spider_quick_page_size=1000 loose_spider_sync_trx_isolation=1 loose_spider_sync_autocommit=1 Intrested to get any feadback if they do .

            When Kentoku was onboarded he was supposed to work on SQL rewrite plugin that would have enable easy SQL syntax integration of such concurrent execution

            stephane@skysql.com VAROQUI Stephane added a comment - When Kentoku was onboarded he was supposed to work on SQL rewrite plugin that would have enable easy SQL syntax integration of such concurrent execution

            The first integration of this feature was possible on queries using a single table, but a following work was made that if other tables of in the query are federated to the same backend as the partition then it possible to used parallel partition scan as well.

            stephane@skysql.com VAROQUI Stephane added a comment - The first integration of this feature was possible on queries using a single table, but a following work was made that if other tables of in the query are federated to the same backend as the partition then it possible to used parallel partition scan as well.
            TheWitness Larry Adams added a comment - - edited

            Stephane,

            I've been going through the documentation, I don't think there is a real good write-up on how to run Spider queries in a true parallel map reduce fashion (or force it to). I'll do more investigation in my home lab. Thus far, I have distributed a years worth of tables (365) to 4 servers that are all separate from the server that will solve as the Spider server. So, I've got a good setup. When I did my first test in a customer environment, as mentioned, it was pure serial one server and one partition at a time (disappointing). But that environment used the spider server as a data server as well, so not sure if that had any impact.

            What I'm really trying to do I will attach as a screen grab.

            My plan will be to:

            1) Setup 20 servers to 4 back-ends
            2) Create a script to UNION 40 tables into a partition table (programmatically)
            3) Run some queries while watching the process list for parallelism.
            4) Tweak settings and repeat 3.

            Larry

            TheWitness Larry Adams added a comment - - edited Stephane, I've been going through the documentation, I don't think there is a real good write-up on how to run Spider queries in a true parallel map reduce fashion (or force it to). I'll do more investigation in my home lab. Thus far, I have distributed a years worth of tables (365) to 4 servers that are all separate from the server that will solve as the Spider server. So, I've got a good setup. When I did my first test in a customer environment, as mentioned, it was pure serial one server and one partition at a time (disappointing). But that environment used the spider server as a data server as well, so not sure if that had any impact. What I'm really trying to do I will attach as a screen grab. My plan will be to: 1) Setup 20 servers to 4 back-ends 2) Create a script to UNION 40 tables into a partition table (programmatically) 3) Run some queries while watching the process list for parallelism. 4) Tweak settings and repeat 3. Larry
            TheWitness Larry Adams added a comment -

            Stephane,

            With all those settings, zero performance difference. Each partition is handled serially. There is this document where Shiba-san was claiming that he had a patch that added parallelization and he provided no instructions in the write-up as to how to leverage it. I'm not sure what ever became of that though.

            https://blog.mariadb.org/wp-content/uploads/2014/05/Spider_in_MariaDB_20140403.pdf

            Larry

            TheWitness Larry Adams added a comment - Stephane, With all those settings, zero performance difference. Each partition is handled serially. There is this document where Shiba-san was claiming that he had a patch that added parallelization and he provided no instructions in the write-up as to how to leverage it. I'm not sure what ever became of that though. https://blog.mariadb.org/wp-content/uploads/2014/05/Spider_in_MariaDB_20140403.pdf Larry

            People

              Unassigned Unassigned
              psergei Sergei Petrunia
              Votes:
              12 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.