Details

    Description

      One of the reasons of bad query plans is inadequate cost estimation of individual operations. A cost of reading a row in one engine might be a lot higher than in some other, but optimizer cannot know it. Also, it uses hard-coded constants, assuming, for example, that evaluating a WHERE clause is 5 times cheaper than reading a row from a table (it used to 10 earlier, now it's 5 ).

      Obviously, some kind of calibration procedure is needed to get these cost estimates to be relatively correct. It is not easy, because the estimates depend on the actual hardware where MariaDB is run (a cost of a row read is different on HD and SSD), and also - somewhat - on the application (optimizer model isn't perfect, and cost factors vary for different types of queries, thus it's better to calibrate on the queries that the user application will run).

      A simple and low-maintenance solution would be to use self-tuning cost coefficients. They measure the timing and adjust automatically to the configuration where MariaDB is run.

      Assorted thoughts:

      • create a tuning coefficient for every low-level cost generator in the optimizer. That is for every function or a handler method that generates a cost value. For every hard-coded constant or a macro too. But not for functions that create cost values from other cost values. For a virtual method - for every implementation of it, that is, one coefficient for every read_time in every handler class. But not for different parameters of it. For example, different tables in some engine might have different costs for reading a row, but we will still have one coefficient for read_time in this engine, not one for each table. The engine is suppose to provide different costs internally, if it needs to. The goal of these coefficients is to normalize the cost between different engines.
      • measure the time that the query took, split proportionally between tables (according to the number of rows), save the statistics per coefficient.
      • collect the statistics locally in the THD, add to the global statistics on disconnect. it helps to avoid contention on a shared resource
      • optimizer will use the global statistics, not thread local. It shouldn't matter, as coefficients will change very slowly.
      • in splitting the time use the actual number of rows, not the estimated one, that the optimizer used.
      • per coefficient store - counter (bigint), sum of times (double), sum of times squared (double).
      • store the results persistently in mysql database
      • make them available via the I_S table. In this table show two more columns - the average and the standard deviation.
      • report these data via the feedback plugin. we can adjust built-in constants or initial values of these coefficients. and very large deviation is a sign that an engine estimates the cost incorrectly (e.g. doesn't take the table name into account, see above)
      • a user can update the table manually, if she wishes so. she can even freeze the coefficients by setting the count column to the very large value.
      • system load anomalies may introduce undesired changes in the coefficients. is it a problem? should we implement some countermeasures?

      Attachments

        Issue Links

          Activity

            hi sergei, there's a file that constants are declared? i could start changing the constants to variables without problem

            rspadim roberto spadim added a comment - hi sergei, there's a file that constants are declared? i could start changing the constants to variables without problem

            Part of this task is, exactly, finding and identifying these constants. Some of them are implicit and not easy to find.

            Another big part if building the framework for collecting and storing time measurements and solving equations against these constants.

            serg Sergei Golubchik added a comment - Part of this task is, exactly, finding and identifying these constants. Some of them are implicit and not easy to find. Another big part if building the framework for collecting and storing time measurements and solving equations against these constants.

            ok, what about files that i should start finding? sql_select.cc, sql_select.h, opt_range.cc (i found a constant inside NOT IN that rewrite to IN(), it's a kind of constant that i should look or not?), update and delete should be find too?
            they are CAPS_LOCK_ON or they could be CAPS_LOCK_ON_and_off?

            rspadim roberto spadim added a comment - ok, what about files that i should start finding? sql_select.cc, sql_select.h, opt_range.cc (i found a constant inside NOT IN that rewrite to IN(), it's a kind of constant that i should look or not?), update and delete should be find too? they are CAPS_LOCK_ON or they could be CAPS_LOCK_ON_and_off?

            I'm afraid, it's something that not even I will be able to do, one of
            our optimizer gurus should.

            Sometimes there is no constant at all, it is implicitly assumed to be 1.

            For example, optimizer compares table->file->read_time() values for
            different tables. Here it assumes that results of ::read_time() for
            different storage engines are directly comparable. They're not, they
            should be multiplied by a storage engine specific factor.

            Those places are very difficult to find.

            We did have a GSoC-2013 project for this task, but it was assumed that
            we will provide a list of constants to tune, and the student will only
            work on time measurements and aggregation.

            serg Sergei Golubchik added a comment - I'm afraid, it's something that not even I will be able to do, one of our optimizer gurus should. Sometimes there is no constant at all, it is implicitly assumed to be 1. For example, optimizer compares table->file->read_time() values for different tables. Here it assumes that results of ::read_time() for different storage engines are directly comparable. They're not, they should be multiplied by a storage engine specific factor. Those places are very difficult to find. We did have a GSoC-2013 project for this task, but it was assumed that we will provide a list of constants to tune, and the student will only work on time measurements and aggregation.

            ok, should i start a list of constants that i find in the optimizer? or i will waste my time?

            rspadim roberto spadim added a comment - ok, should i start a list of constants that i find in the optimizer? or i will waste my time?
            igniting Anshu Avinash added a comment -

            I'll be pushing the code on my github repo: https://github.com/igniting/server/tree/selfTuningOptimizer . @serg: I guess you can assign this issue to me.

            igniting Anshu Avinash added a comment - I'll be pushing the code on my github repo: https://github.com/igniting/server/tree/selfTuningOptimizer . @serg: I guess you can assign this issue to me.
            serg Sergei Golubchik added a comment - - edited

            So, to summarize: there are two approaches to obtaining timing data:

            • counting operations, timing the query as a whole
            • actually timing individual operations, P_S-style

            In the first approach we only count number of index lookups (N i), rows read in a table scan (N s), etc. And time the query as a whole (T q). This gives an equation:

            T i * N i + T s * N s + ... = T q

            Collecting the data from many queries, we have a system of linear equations (may be even over-defined). It can be solved to give durations on individual operations.

            In the second approach we directly time individual operations, much like PERFORMANCE_SCHEMA does. Preferably, reusing P_S measurements, not duplicating them.
            These approaches both have their benefits:

            • The first one is potentially much cheaper, it does not add timing overhead to every operation (and with constants like TIME_FOR_COMPARE, the individual operation is a comparison, timing it might be prohibitively expensive).
            • Also, it uses the complete query execution time — the only time users care about. If we see that values of the coefficients don't converge over time, it means that there's an unaccounted-for part in the query time, and we add another coefficient.
            • Making the second approach to use P_S when P_S is disabled is not exactly trivial.

            But

            • The second needs less memory: for N coefficients, it'll need — per THD — 4N doubles at most (value, sum, sum squared, counter). The first one needs to store(in addition to the above) N equations, that's N² values. Although for large N this might be a sparse matrix, I don't know.
            • The second one measures only what we use in the optimizer, directly. No hidden parameters, and most coefficients should converge nicely. Still, it's meaningless from the user point of view.
            serg Sergei Golubchik added a comment - - edited So, to summarize: there are two approaches to obtaining timing data: counting operations, timing the query as a whole actually timing individual operations, P_S-style In the first approach we only count number of index lookups ( N  i ), rows read in a table scan ( N  s ), etc. And time the query as a whole ( T  q ). This gives an equation: T i * N i + T s * N s + ... = T q Collecting the data from many queries, we have a system of linear equations (may be even over-defined). It can be solved to give durations on individual operations. In the second approach we directly time individual operations, much like PERFORMANCE_SCHEMA does. Preferably, reusing P_S measurements, not duplicating them. These approaches both have their benefits: The first one is potentially much cheaper, it does not add timing overhead to every operation (and with constants like TIME_FOR_COMPARE, the individual operation is a comparison, timing it might be prohibitively expensive). Also, it uses the complete query execution time — the only time users care about. If we see that values of the coefficients don't converge over time, it means that there's an unaccounted-for part in the query time, and we add another coefficient. Making the second approach to use P_S when P_S is disabled is not exactly trivial. But The second needs less memory: for N coefficients, it'll need — per THD — 4N doubles at most (value, sum, sum squared, counter). The first one needs to store(in addition to the above) N equations, that's N² values. Although for large N this might be a sparse matrix, I don't know. The second one measures only what we use in the optimizer, directly. No hidden parameters, and most coefficients should converge nicely. Still, it's meaningless from the user point of view.
            serg Sergei Golubchik added a comment - - edited

            more comments:
            first approach

            • time spent on locks, in the optimizer, etc
              • measure only query execution time, after the optimizer, this will also cut off table locks
              • but row locks happen during execution, they will distort the statistics
              • they will show up significantly different and can be detected statistically and ignored

            second

            • P_S is disabled in MariaDB by default
            • the hypothesys: most of the P_S overhead comes from housekeeping, storing the data, particularly in the shared data structures, calculating aggregations.
            • that is, simply timing the waits is cheap
            • so even with the disabled P_S we can enable the timing instrumentation

            third - combined method

            • measure all waits directly, using P_S instrumentation
            • TIME_FOR_COMPARE and other internal stuff - indirectly
            • this gives exact data for longer operations
            • but doesn't introduce timing overhead for short operations
            • and reduces the number of coefficients that we put into a system of equations, thus reducing memory footprints

            Also, the one index read might have vastly different cost even for the same engine, same table, and the same index. For example, depending on whether it's cached or not. This is the job of the engine to recognize this situation and adjust the read_time() accordingly. The problem is - over time we'll see different timings for the same number of index reads, so the corresponding factor won't converge. For it to converge we need to take engine's read_time() value into account. That is, we'll have not

            total_index_read_time = factor * Nrows

            but

            total_index_read_time = factor * Nrows * read_time() / Nrows_estimated

            (where Nrows is how many rows were actually read from the index, while Nrows_estimated is how many rows optimizer expected to read, the number of rows that read_time() got as an argument)

            serg Sergei Golubchik added a comment - - edited more comments: first approach time spent on locks, in the optimizer, etc measure only query execution time, after the optimizer, this will also cut off table locks but row locks happen during execution, they will distort the statistics they will show up significantly different and can be detected statistically and ignored second P_S is disabled in MariaDB by default the hypothesys: most of the P_S overhead comes from housekeeping, storing the data, particularly in the shared data structures, calculating aggregations. that is, simply timing the waits is cheap so even with the disabled P_S we can enable the timing instrumentation third - combined method measure all waits directly, using P_S instrumentation TIME_FOR_COMPARE and other internal stuff - indirectly this gives exact data for longer operations but doesn't introduce timing overhead for short operations and reduces the number of coefficients that we put into a system of equations, thus reducing memory footprints Also, the one index read might have vastly different cost even for the same engine, same table, and the same index. For example, depending on whether it's cached or not. This is the job of the engine to recognize this situation and adjust the read_time() accordingly. The problem is - over time we'll see different timings for the same number of index reads, so the corresponding factor won't converge. For it to converge we need to take engine's read_time() value into account. That is, we'll have not total_index_read_time = factor * Nrows but total_index_read_time = factor * Nrows * read_time() / Nrows_estimated (where Nrows is how many rows were actually read from the index, while Nrows_estimated is how many rows optimizer expected to read, the number of rows that read_time() got as an argument)
            ebergen Eric Bergen added a comment -

            I can see this going horribly wrong both when system load changes and due to hardware anomalies. For example when a hard drive hits a bad sector it may do an i/o which takes up to 8 seconds by default which may throw off all of your measurement for that query. If that query ends up updating the global statistics then all future queries may be dramatically punished all because of a single isolated problem.

            cpu overload can also impact timing issues. For example if you have a burst of threads running then certain parts of a query may take longer as they fight for locks or cpu time.

            It would be nice to run a sample workload to gather these statistics then be able to freeze them on an instance. This would take care of the cases where the optimizer makes assumptions about disk seek time that were true on hard drives but aren't true on flash hardware but without having to worry about hosts all the sudden wildly changing their query plans due to hardware blips.

            ebergen Eric Bergen added a comment - I can see this going horribly wrong both when system load changes and due to hardware anomalies. For example when a hard drive hits a bad sector it may do an i/o which takes up to 8 seconds by default which may throw off all of your measurement for that query. If that query ends up updating the global statistics then all future queries may be dramatically punished all because of a single isolated problem. cpu overload can also impact timing issues. For example if you have a burst of threads running then certain parts of a query may take longer as they fight for locks or cpu time. It would be nice to run a sample workload to gather these statistics then be able to freeze them on an instance. This would take care of the cases where the optimizer makes assumptions about disk seek time that were true on hard drives but aren't true on flash hardware but without having to worry about hosts all the sudden wildly changing their query plans due to hardware blips.

            A specific value (say, "read_time") will be measured many times and the on-disk table will store the sum of all measurements and the number of them. The optimizer will use the average (sum/count). This means:

            • hopefully, short spikes won't disrupt the average too much
            • one can effectively "freeze" the average by updating the table and putting a huge "count" value.
            serg Sergei Golubchik added a comment - A specific value (say, "read_time") will be measured many times and the on-disk table will store the sum of all measurements and the number of them. The optimizer will use the average (sum/count). This means: hopefully, short spikes won't disrupt the average too much one can effectively "freeze" the average by updating the table and putting a huge "count" value.

            Eric, yes that's a problem, but... that's a nice solution to many problems
            The statistic is the main part here, if the standard mean (or any other usable variable) become too distance from "current" measures we could alert DBA about disk problems or too much queries/second, think about a usefull feature to alert DBA and developers/engeneer about a possible problem =)
            about increasing robustness of the self-tunning cost coefficients algorithm we can develop with more time and use cases, for the first version, a useless (no changes to cost coefficients/optimizer) but measurable feature is ok, with time we get the experience to write a good (robust/inteligent) control =]

            rspadim roberto spadim added a comment - Eric, yes that's a problem, but... that's a nice solution to many problems The statistic is the main part here, if the standard mean (or any other usable variable) become too distance from "current" measures we could alert DBA about disk problems or too much queries/second, think about a usefull feature to alert DBA and developers/engeneer about a possible problem =) about increasing robustness of the self-tunning cost coefficients algorithm we can develop with more time and use cases, for the first version, a useless (no changes to cost coefficients/optimizer) but measurable feature is ok, with time we get the experience to write a good (robust/inteligent) control =]

            Hi guys any news here?

            rspadim roberto spadim added a comment - Hi guys any news here?

            I would also give +1 for a global statistic approach more that gathering fixe constants like in a benchmark

            But i would add that a per table set of constant would be better because in any workload some specifics tables may cause more often non determinist cache overflow

            • Data Cache
            • Index Cache
            • FS Cache
            • Controller Cache
            • CPU Caches

            Non determinism's is also driven by the workload itself, like :

            • writing reading from last time range ,
            • number of secondary indexes in a write,
            • storage engine
            • randomness data access pattern
            • lock time
            • column and row size fetching

            looks like such metrics are mostly per table dependent , and can help Query Plan to better estimate join cost at each depth

            stephane@skysql.com VAROQUI Stephane added a comment - I would also give +1 for a global statistic approach more that gathering fixe constants like in a benchmark But i would add that a per table set of constant would be better because in any workload some specifics tables may cause more often non determinist cache overflow Data Cache Index Cache FS Cache Controller Cache CPU Caches Non determinism's is also driven by the workload itself, like : writing reading from last time range , number of secondary indexes in a write, storage engine randomness data access pattern lock time column and row size fetching looks like such metrics are mostly per table dependent , and can help Query Plan to better estimate join cost at each depth
            jwang james wang added a comment -

            Hi All,

            Went to MariaDB road Show yesterday at London, I suggested:

            1). look at top slow queries in performance_schema table
            2). record the query execution plan
            3). try different index (especially compound index)
            4). if new index works better, use it for the same query in future.

            Hope this makes sense

            jwang james wang added a comment - Hi All, Went to MariaDB road Show yesterday at London, I suggested: 1). look at top slow queries in performance_schema table 2). record the query execution plan 3). try different index (especially compound index) 4). if new index works better, use it for the same query in future. Hope this makes sense

            1). look at top slow queries in performance_schema table
            not only slow queries, but stats about these queries, they should be grouped by hash, like percona do "select * from table where column=? and column1>=? and column2>=?" constants are changed to "?" and operations (graph and/or/>/</=/etc...) are ordered

            3). try different index (especially compound index)
            gridsearching index is ok to do, consume resources but find many solutions, it's import to record not only time to execute, but resources and metadata information to create a good metric function, for example: one index with (int,int) is "better" than an index with (double,double,string,double) with the same time or not? a performace metric should be created to evaluate the best search path

            4). if new index works better, use it for the same query in future.
            at some point the problem is how to "same query" (the same of 1) at parser/optimizer and don't penalize global performace

            rspadim roberto spadim added a comment - 1). look at top slow queries in performance_schema table not only slow queries, but stats about these queries, they should be grouped by hash, like percona do "select * from table where column=? and column1>=? and column2>=?" constants are changed to "?" and operations (graph and/or/>/</=/etc...) are ordered 3). try different index (especially compound index) gridsearching index is ok to do, consume resources but find many solutions, it's import to record not only time to execute, but resources and metadata information to create a good metric function, for example: one index with (int,int) is "better" than an index with (double,double,string,double) with the same time or not? a performace metric should be created to evaluate the best search path 4). if new index works better, use it for the same query in future. at some point the problem is how to "same query" (the same of 1) at parser/optimizer and don't penalize global performace

            People

              serg Sergei Golubchik
              serg Sergei Golubchik
              Votes:
              2 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.