Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-19556

Support native storage engine sampling of rows

Details

    Description

      Histogram collection has been augmented in 10.4 with the ability to collect a percentage of rows. This was implemented via Bernoulli sampling. The drawback is that one has to perform a full table scan to perform sampling. This technique has reduced the bottleneck of Histograms collection substantially, however it can still be improved.

      Storage engine API should be extended to allow the server to make use (if available) storage engine sampling capabilities.

      This feature can be additionally used to support fast approximation functions such as a version of fast count-distinct with an estimator attached. (ex: Smoothed Jackknife Estimator) Additionally, with native sampling support, one could afford to perform various optimizations in the background (such as statistics collection), as the performance impact would be much smaller. Another use case for native sampling is SELECT FROM <table-sample>

      https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/

      The implementation will be done for 2 different storage engines (Aria & Innodb). The algorithm will make use of a weighted index-dive (to counteract if index pages are unbalanced).

      Attachments

        Issue Links

          Activity

            cvicentiu Vicențiu Ciorbaru created issue -
            cvicentiu Vicențiu Ciorbaru made changes -
            Field Original Value New Value
            Fix Version/s 10.5 [ 23123 ]
            cvicentiu Vicențiu Ciorbaru made changes -
            Component/s Storage Engine - Aria [ 10126 ]
            Component/s Storage Engine - InnoDB [ 10129 ]
            Component/s Storage Engine - MyISAM [ 10600 ]
            cvicentiu Vicențiu Ciorbaru made changes -
            Summary Support native storage engine Sampling Support native storage engine sampling of rows
            cvicentiu Vicențiu Ciorbaru made changes -
            Description Histogram collection has been augmented in 10.4 with the ability to collect a percentage of rows. This was implemented via Bernoulli sampling. The drawback is that one has to perform a full table scan to perform sampling. This technique has reduced the bottleneck of Histograms collection substantially, however it can still be improved.

            Storage engine API should be extended to allow the server to make use (if available) storage engine sampling capabilities.

            This feature can be additionally used to support fast approximation functions such as a version of fast count-distinct. Additionally, with native sampling support, one could afford to perform various optimizations in the background (such as statistics collection), as the performance impact would be much smaller.
            Histogram collection has been augmented in 10.4 with the ability to collect a percentage of rows. This was implemented via Bernoulli sampling. The drawback is that one has to perform a full table scan to perform sampling. This technique has reduced the bottleneck of Histograms collection substantially, however it can still be improved.

            Storage engine API should be extended to allow the server to make use (if available) storage engine sampling capabilities.

            This feature can be additionally used to support fast approximation functions such as a version of fast count-distinct with an estimator attached. (ex: Smoothed Jackknife Estimator) Additionally, with native sampling support, one could afford to perform various optimizations in the background (such as statistics collection), as the performance impact would be much smaller. Another use case for native sampling is SELECT FROM <table-sample>

            https://www.2ndquadrant.com/en/blog/tablesample-in-postgresql-9-5-2/

            The implementation will be done for 2 different storage engines (Aria & Innodb). The algorithm will make use of a weighted index-dive (to counteract if index pages are unbalanced).
            serg Sergei Golubchik made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.6 [ 24028 ]
            Fix Version/s 10.5 [ 23123 ]
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Priority Critical [ 2 ] Major [ 3 ]
            julien.fritsch Julien Fritsch made changes -
            julien.fritsch Julien Fritsch made changes -
            serg Sergei Golubchik made changes -
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.6 [ 24028 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.8 [ 26121 ]
            Fix Version/s 10.7 [ 24805 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 96891 ] MariaDB v4 [ 131090 ]
            serg Sergei Golubchik made changes -
            Fix Version/s 10.9 [ 26905 ]
            Fix Version/s 10.8 [ 26121 ]
            psergei Sergei Petrunia made changes -

            See MDEV-28637 for a patch that allows to test the sampling.

            psergei Sergei Petrunia added a comment - See MDEV-28637 for a patch that allows to test the sampling.
            serg Sergei Golubchik made changes -
            Fix Version/s 10.9 [ 26905 ]
            svoj Sergey Vojtovich made changes -
            Labels contribution
            vlad.radu Vlad Radu made changes -
            Labels contribution contribution foundation

            People

              cvicentiu Vicențiu Ciorbaru
              cvicentiu Vicențiu Ciorbaru
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.