Details

    • New Feature
    • Status: Closed (View Workflow)
    • Blocker
    • Resolution: Fixed
    • None
    • 6.1.1
    • None
    • 2021-7, 2021-8, 2021-9

    Description

      Optimizer statistics

      Most engines feeds their query optimizers with different types of statistics, e.g histograms of most common values, fractions of NULL values, Number of Distinct Values, relation cardinality. The statistics allows to make educated guesses regarding cardinality of intermediate and final result of a query, e.g. optimizer finds the optimal join order using this statistics. MCS now has a rudimentary statistics that consist of a relation cardinality only.

      The goal

      We need to design and build facilities that allows MCS to make educated guesses about cardinality of intermediate and final result of a query and use this knowledge to produce an optimal plan.

      What is statistics

      The simplest version of statistics must have the list of attributes listed earlier. The improved version should have an extensible format to add attributes(e.g. correlated or dependant histograms on functionaly dependant columns) w/o lots of changes.

      What triggers stats collection

      There must be a user knob in the form of ANALYZE SQL statement. Bulk insert operation inserting more then N values must also trigger ANALYZE to regresh the stats.

      Consumers

      ExeMgr or whatever the optimizer goes into is the most important consumer. The fact that there might be more then one EM in the cluster must be taken into account.

      Where to put stats and how to save it b/w restarts

      EMs in the cluster must have direct access to the stats so stats in-memory representation must reside in the RAM of the primary EM. Other EMs must have a synchronized version of the stats.

      Producers

      EM must initiate statistics collection. The statistics collection process must not differ much from SELECT * FROM query. The sampling method isn't yet choosen.

      Please see the design doc for more details.

      Attachments

        Issue Links

          Activity

            gdorman Gregory Dorman (Inactive) created issue -
            gdorman Gregory Dorman (Inactive) made changes -
            Field Original Value New Value
            gdorman Gregory Dorman (Inactive) made changes -
            gdorman Gregory Dorman (Inactive) made changes -
            Affects Version/s All [ 25803 ]
            gdorman Gregory Dorman (Inactive) made changes -
            Description From the beginning of time ColumnStore does not keep any meta information required for logical optimization of queries. This task intends to resolve it, and enable a large set of very high impact optimization projects. Primary examples:

            1. Even though CS data is always taken from OLTP systems which have primary keys and unique indexes, this information is not recorded. As a result, it is not possible to estimate the impact of any pushed predicate on row size of the a filtering job step - instrumental in deciding on a best join order, or perform a proper join elimination in case of circular joins, among others.
            2. Nor is there any column cardinality estimation mechanism (a viable, and likely a more generally applicable alternative to additional table metadata).
            3. There is no distribution information either.

            There are multiple possibilities.
            1. A more difficult yet more impactful would be to collect column statistics - cardinalities and distributions, and use them at the planning time.


            2. In shorter term, it should be easy enough to enable syntactical constructs PRIMARY KEY and UNIQUE INDEX. There is no need to build the additional data structures. But knowing that a given column is UNIUE would provide tremendous help in estimating the size of a rowset resulting from application of an equi predicate to a scalar (or of a "short" IN list of scalar values), hence in deciding on the proper join order.
            gdorman Gregory Dorman (Inactive) made changes -
            Description From the beginning of time ColumnStore does not keep any meta information required for logical optimization of queries. This task intends to resolve it, and enable a large set of very high impact optimization projects. Primary examples:

            1. Even though CS data is always taken from OLTP systems which have primary keys and unique indexes, this information is not recorded. As a result, it is not possible to estimate the impact of any pushed predicate on row size of the a filtering job step - instrumental in deciding on a best join order, or perform a proper join elimination in case of circular joins, among others.
            2. Nor is there any column cardinality estimation mechanism (a viable, and likely a more generally applicable alternative to additional table metadata).
            3. There is no distribution information either.

            There are multiple possibilities.
            1. A more difficult yet more impactful would be to collect column statistics - cardinalities and distributions, and use them at the planning time.


            2. In shorter term, it should be easy enough to enable syntactical constructs PRIMARY KEY and UNIQUE INDEX. There is no need to build the additional data structures. But knowing that a given column is UNIUE would provide tremendous help in estimating the size of a rowset resulting from application of an equi predicate to a scalar (or of a "short" IN list of scalar values), hence in deciding on the proper join order.
            From the beginning of time ColumnStore does not keep any meta information required for logical optimization of queries. This task intends to resolve it, and enable a large set of very high impact optimization projects. Primary examples:

            1. Even though CS data is always taken from OLTP systems which have primary keys and unique indexes, this information is not recorded. As a result, it is not possible to estimate the impact of any pushed predicate on row size of the a filtering job step - instrumental in deciding on a best join order, or perform a proper join elimination in case of circular joins, among others.
            2. Nor is there any column cardinality estimation mechanism (a viable, and likely a more generally applicable alternative to additional table metadata).
            3. There is no distribution information either.

            There are multiple possibilities.
            1. A more difficult yet more impactful would be to collect column statistics - cardinalities and distributions, and use them at the planning time.

            Knowing that a given column is predominantly UNIUE would provide tremendous help in estimating the size of a rowset resulting from application of an equi predicate to a scalar (or of a "short" IN list of scalar values), hence in deciding on the proper join order.

            Inversely, knowing that a column has low cardinality relative to the cardinality of the table would enable the detection of a cartesian explosion risks inherent in joins which use such columns on both sides (e.g. Query 5 of TPC-H) - so called join elimination.

            2. In shorter term, it should be easy enough to enable syntactical constructs like PRIMARY KEY and UNIQUE INDEX. There is no need to build the additional data structures beyond traditional table metadata - these would be merely PROXY PRIMARY KEY, or a PROXY UNIQUE INDEX, helping the planner to make proper guesses about cardinalities of various job steps leading to proper estimation of their costs, and ending in a proper decision making as regards join ordering or join elimination.
            gdorman Gregory Dorman (Inactive) made changes -
            denis0x0D Denis Khalikov (Inactive) made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            denis0x0D Denis Khalikov (Inactive) made changes -
            Sprint 2021-7 [ 514 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked higher
            drrtuy Roman made changes -
            Description From the beginning of time ColumnStore does not keep any meta information required for logical optimization of queries. This task intends to resolve it, and enable a large set of very high impact optimization projects. Primary examples:

            1. Even though CS data is always taken from OLTP systems which have primary keys and unique indexes, this information is not recorded. As a result, it is not possible to estimate the impact of any pushed predicate on row size of the a filtering job step - instrumental in deciding on a best join order, or perform a proper join elimination in case of circular joins, among others.
            2. Nor is there any column cardinality estimation mechanism (a viable, and likely a more generally applicable alternative to additional table metadata).
            3. There is no distribution information either.

            There are multiple possibilities.
            1. A more difficult yet more impactful would be to collect column statistics - cardinalities and distributions, and use them at the planning time.

            Knowing that a given column is predominantly UNIUE would provide tremendous help in estimating the size of a rowset resulting from application of an equi predicate to a scalar (or of a "short" IN list of scalar values), hence in deciding on the proper join order.

            Inversely, knowing that a column has low cardinality relative to the cardinality of the table would enable the detection of a cartesian explosion risks inherent in joins which use such columns on both sides (e.g. Query 5 of TPC-H) - so called join elimination.

            2. In shorter term, it should be easy enough to enable syntactical constructs like PRIMARY KEY and UNIQUE INDEX. There is no need to build the additional data structures beyond traditional table metadata - these would be merely PROXY PRIMARY KEY, or a PROXY UNIQUE INDEX, helping the planner to make proper guesses about cardinalities of various job steps leading to proper estimation of their costs, and ending in a proper decision making as regards join ordering or join elimination.
            Optimizer statistics

            Most engines feeds their query optimizers with different types of statistics, e.g histograms of most common values, fractions of NULL values, Number of Distinct Values, relation cardinality. The statistics allows to make educated guesses regarding cardinality of intermediate and final result of a query, e.g. optimizer finds the optimal join order using this statistics. MCS now has a rudimentary statistics that consist of a relation cardinality only.

            The goal

            We need to design and build facilities that allows MCS to make educated guesses about cardinality of intermediate and final result of a query and use this knowledge to produce an optimal plan.

            What is statistics

            The simplest version of statistics must have the list of attributes listed earlier. The improved version should have an extensible format to add attributes(e.g. correlated or dependant histograms on functionaly dependant columns) w/o lots of changes.

            What triggers stats collection

            There must be a user knob in the form of ANALYZE SQL statement. Bulk insert operation inserting more then N values must also trigger ANALYZE to regresh the stats.

            Consumers

            ExeMgr or whatever the optimizer goes into is the most important consumer. The fact that there might be more then one EM in the cluster must be taken into account.

            Producers

            EM must initiate statistics collection. The statistics collection process must not differ much from SELECT * FROM query. The sampling method isn't yet choosen.

            Please see the design doc for more details.
            drrtuy Roman made changes -
            Summary Create mechanism for keeping and interpreting column statistics Optimizer statistics
            gdorman Gregory Dorman (Inactive) made changes -
            Sprint 2021-7 [ 514 ] 2021-7, 2021-8 [ 514, 521 ]
            drrtuy Roman made changes -
            Description Optimizer statistics

            Most engines feeds their query optimizers with different types of statistics, e.g histograms of most common values, fractions of NULL values, Number of Distinct Values, relation cardinality. The statistics allows to make educated guesses regarding cardinality of intermediate and final result of a query, e.g. optimizer finds the optimal join order using this statistics. MCS now has a rudimentary statistics that consist of a relation cardinality only.

            The goal

            We need to design and build facilities that allows MCS to make educated guesses about cardinality of intermediate and final result of a query and use this knowledge to produce an optimal plan.

            What is statistics

            The simplest version of statistics must have the list of attributes listed earlier. The improved version should have an extensible format to add attributes(e.g. correlated or dependant histograms on functionaly dependant columns) w/o lots of changes.

            What triggers stats collection

            There must be a user knob in the form of ANALYZE SQL statement. Bulk insert operation inserting more then N values must also trigger ANALYZE to regresh the stats.

            Consumers

            ExeMgr or whatever the optimizer goes into is the most important consumer. The fact that there might be more then one EM in the cluster must be taken into account.

            Producers

            EM must initiate statistics collection. The statistics collection process must not differ much from SELECT * FROM query. The sampling method isn't yet choosen.

            Please see the design doc for more details.
            Optimizer statistics

            Most engines feeds their query optimizers with different types of statistics, e.g histograms of most common values, fractions of NULL values, Number of Distinct Values, relation cardinality. The statistics allows to make educated guesses regarding cardinality of intermediate and final result of a query, e.g. optimizer finds the optimal join order using this statistics. MCS now has a rudimentary statistics that consist of a relation cardinality only.

            The goal

            We need to design and build facilities that allows MCS to make educated guesses about cardinality of intermediate and final result of a query and use this knowledge to produce an optimal plan.

            What is statistics

            The simplest version of statistics must have the list of attributes listed earlier. The improved version should have an extensible format to add attributes(e.g. correlated or dependant histograms on functionaly dependant columns) w/o lots of changes.

            What triggers stats collection

            There must be a user knob in the form of ANALYZE SQL statement. Bulk insert operation inserting more then N values must also trigger ANALYZE to regresh the stats.

            Consumers

            ExeMgr or whatever the optimizer goes into is the most important consumer. The fact that there might be more then one EM in the cluster must be taken into account.

            Where to put stats and how to save it b/w restarts

            EMs in the cluster must have direct access to the stats so stats in-memory representation must reside in the RAM of the primary EM. Other EMs must have a synchronized version of the stats.

            Producers

            EM must initiate statistics collection. The statistics collection process must not differ much from SELECT * FROM query. The sampling method isn't yet choosen.

            Please see the design doc for more details.
            denis0x0D Denis Khalikov (Inactive) made changes -
            toddstoffel Todd Stoffel (Inactive) made changes -
            Affects Version/s All [ 25803 ]
            gdorman Gregory Dorman (Inactive) made changes -
            Rank Ranked higher
            gdorman Gregory Dorman (Inactive) made changes -
            Sprint 2021-7, 2021-8 [ 514, 521 ] 2021-7, 2021-8, 2021-9 [ 514, 521, 541 ]
            gdorman Gregory Dorman (Inactive) made changes -
            Fix Version/s 6.1.1 [ 25600 ]
            Fix Version/s 6.5.1 [ 25801 ]
            denis0x0D Denis Khalikov (Inactive) made changes -
            Attachment screen.jpg [ 58134 ]
            toddstoffel Todd Stoffel (Inactive) made changes -
            Rank Ranked lower
            denis0x0D Denis Khalikov (Inactive) made changes -
            Assignee Denis Khalikov [ JIRAUSER48434 ] Daniel Lee [ dleeyh ]
            denis0x0D Denis Khalikov (Inactive) made changes -
            Status In Progress [ 3 ] In Testing [ 10301 ]
            dleeyh Daniel Lee (Inactive) made changes -
            Resolution Fixed [ 1 ]
            Status In Testing [ 10301 ] Closed [ 6 ]
            drrtuy Roman made changes -

            People

              dleeyh Daniel Lee (Inactive)
              gdorman Gregory Dorman (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.