Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7383

[PATCH] engine-independent-stats column_stats has limited values for max/min values

Details

    • 10.0.20

    Description

      Engine-independent stats can't parse some values into min/max_value in column stats. The result is a NON NULL empty string. I suspect query plans based on this information won't work too well.

      MariaDB [logsleads]> analyze table GeoIPOrg persistent for all;
      +--------------------+---------+----------+---------------------------------------------------------------------------+
      | Table              | Op      | Msg_type | Msg_text                                                                  |
      +--------------------+---------+----------+---------------------------------------------------------------------------+
      | logsleads.GeoIPOrg | analyze | status   | Engine-independent statistics collected                                   |
      | logsleads.GeoIPOrg | analyze | Warning  | Incorrect string value: '\xD8ybj\xF8r...' for column 'max_value' at row 1 |
      | logsleads.GeoIPOrg | analyze | status   | Table is already up to date                                               |
      +--------------------+---------+----------+---------------------------------------------------------------------------+
      3 rows in set (9.52 sec)
       
       
      | GeoIPOrg | CREATE TABLE `GeoIPOrg` (
        `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
        `ipv4n_start` int(10) unsigned NOT NULL,
        `ipv4n_end` int(10) unsigned NOT NULL,
        `org_name` varchar(100) NOT NULL,
        `bucket` smallint(5) unsigned NOT NULL,
        PRIMARY KEY (`id`),
        UNIQUE KEY `ipv4n_start` (`ipv4n_start`,`ipv4n_end`),
        KEY `bucket` (`bucket`)
      ) ENGINE=MyISAM AUTO_INCREMENT=1292213 DEFAULT CHARSET=latin1 |
       
      MariaDB [mysql]> select *,IF(max_value IS NULL,'NULL','NOT NULL') as max_value_null,hex(max_value) as max_value_hex,length(max_value) as max_value_len from column_stats where db_name='logsleads' and table_name='GeoIPOrg';
      +-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
      | db_name   | table_name | column_name | min_value                                        | max_value  | nulls_ratio | avg_length | avg_frequency | hist_size | hist_type | histogram | max_value_null | max_value_hex        | max_value_len |
      +-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
      | logsleads | GeoIPOrg   | bucket      | 256                                              | 57343      |      0.0000 |     2.0000 |       49.4683 |         0 | NULL      | NULL      | NOT NULL       | 3537333433           |             5 |
      | logsleads | GeoIPOrg   | id          | 1                                                | 1292212    |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 31323932323132       |             7 |
      | logsleads | GeoIPOrg   | ipv4n_end   | 16777471                                         | 3758096127 |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 33373538303936313237 |            10 |
      | logsleads | GeoIPOrg   | ipv4n_start | 16777216                                         | 3758095872 |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 33373538303935383732 |            10 |
      | logsleads | GeoIPOrg   | org_name    | ! atlngahpla1_g8/7 to atlngahpla2_g8/7 ospf area |            |      0.0000 |    21.6174 |        2.6580 |         0 | NULL      | NULL      | NOT NULL       |                      |             0 |
      +-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
       
      select org_name,hex(org_name) from GeoIPOrg order by org_name desc limit 1;
      +------------------+--------------------------+
      | org_name         | hex(org_name)            |
      +------------------+--------------------------+
      | Øybjørg_sæbø     | D879626AF872675F73E662F8 |
      +------------------+--------------------------+

      Attachments

        Issue Links

          Activity

            danblack Daniel Black added a comment -

            I'm part way filling out these functions and limiting comparison/storage to 255 bytes. Please tell me if its a bad idea.

            danblack Daniel Black added a comment - I'm part way filling out these functions and limiting comparison/storage to 255 bytes. Please tell me if its a bad idea.

            danblack, two reactions:

            1. I am wondering whether this is at all important to have good statistics for blob fields. My impression is that the only kind of condition on blob fields that occurs in real life is blob_field LIKE '%foo' (Does this match your observations?) , for which statistics is not helpful. It is bad that the server writes an invalid column stats record, and the simplest solution is to not collect or store any statistics for blob columns. This would make ANALYZE TABLE faster, too.

            2. Capping comparison/storage looks like a correct thing to do, although #1 is a simpler solution.

            psergei Sergei Petrunia added a comment - danblack , two reactions: 1. I am wondering whether this is at all important to have good statistics for blob fields. My impression is that the only kind of condition on blob fields that occurs in real life is blob_field LIKE '%foo' (Does this match your observations?) , for which statistics is not helpful. It is bad that the server writes an invalid column stats record, and the simplest solution is to not collect or store any statistics for blob columns. This would make ANALYZE TABLE faster, too. 2. Capping comparison/storage looks like a correct thing to do, although #1 is a simpler solution.
            danblack Daniel Black added a comment - - edited

            psergey, agree with 1 (and arjen does too), anything searching/joining/ordering on a blob column or part there of is probably doing something wrong and implementing stats for it isn't going to add much value to the case where people have proper queries.

            Should just document somewhere that there is no column_stats on blob columns.

            I'll create a new task for ANALYZE TABLE to skip blob column retrieval for the purpose of engine independent stats.

            So happy to close this.

            For option 2, I did get part way as per attached patch (fails for varying length of blob columns) but if there is someone who can put a decent case for it in.

            danblack Daniel Black added a comment - - edited psergey , agree with 1 (and arjen does too), anything searching/joining/ordering on a blob column or part there of is probably doing something wrong and implementing stats for it isn't going to add much value to the case where people have proper queries. Should just document somewhere that there is no column_stats on blob columns. I'll create a new task for ANALYZE TABLE to skip blob column retrieval for the purpose of engine independent stats. So happy to close this. For option 2, I did get part way as per attached patch (fails for varying length of blob columns) but if there is someone who can put a decent case for it in.
            elenst Elena Stepanova added a comment - - edited

            Please don't forget to watch buildbot. The push broke 10.0 tests.
            Better still, please run tests before pushing, at least default suites. There is nothing fancy that only buildbot would catch, both main.system_mysql_db and funcs_1.is_columns_mysql fail on regular builds.

            http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-quantal-amd64/builds/1891/steps/test/logs/stdio

            elenst Elena Stepanova added a comment - - edited Please don't forget to watch buildbot. The push broke 10.0 tests. Better still, please run tests before pushing, at least default suites. There is nothing fancy that only buildbot would catch, both main.system_mysql_db and funcs_1.is_columns_mysql fail on regular builds. http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-quantal-amd64/builds/1891/steps/test/logs/stdio
            danblack Daniel Black added a comment - psergey fixed in 2776159e424dc04ef162b9cc0a2bd075419e0eb4

            People

              psergei Sergei Petrunia
              danblack Daniel Black
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.