[MDEV-7383] [PATCH] engine-independent-stats column_stats has limited values for max/min values Created: 2014-12-27  Updated: 2015-11-18  Resolved: 2015-11-18

Status: Closed
Project: MariaDB Server
Component/s: Optimizer
Affects Version/s: 10.0.15
Fix Version/s: 10.0.23

Type: Bug Priority: Major
Reporter: Daniel Black Assignee: Sergei Petrunia
Resolution: Fixed Votes: 1
Labels: eits, optimizer
Environment:

Debian Wheezy 10.0.15


Attachments: Text File incomplete-attempt-at-stats-for-blobs.txt     File mdev-7383-column-stats-min-max-value.patch     File mdev-7383-statistics-to-handle-as-my_charset_bin.patch    
Issue Links:
Duplicate
is duplicated by MDEV-9097 Engine independent statistics only al... Closed
Relates
relates to MDEV-9118 ANALYZE TABLE for Engine independent ... Closed
relates to MDEV-9119 Document that engine independent stat... Closed
Sprint: 10.0.20

 Description   

Engine-independent stats can't parse some values into min/max_value in column stats. The result is a NON NULL empty string. I suspect query plans based on this information won't work too well.

MariaDB [logsleads]> analyze table GeoIPOrg persistent for all;
+--------------------+---------+----------+---------------------------------------------------------------------------+
| Table              | Op      | Msg_type | Msg_text                                                                  |
+--------------------+---------+----------+---------------------------------------------------------------------------+
| logsleads.GeoIPOrg | analyze | status   | Engine-independent statistics collected                                   |
| logsleads.GeoIPOrg | analyze | Warning  | Incorrect string value: '\xD8ybj\xF8r...' for column 'max_value' at row 1 |
| logsleads.GeoIPOrg | analyze | status   | Table is already up to date                                               |
+--------------------+---------+----------+---------------------------------------------------------------------------+
3 rows in set (9.52 sec)
 
 
| GeoIPOrg | CREATE TABLE `GeoIPOrg` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `ipv4n_start` int(10) unsigned NOT NULL,
  `ipv4n_end` int(10) unsigned NOT NULL,
  `org_name` varchar(100) NOT NULL,
  `bucket` smallint(5) unsigned NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `ipv4n_start` (`ipv4n_start`,`ipv4n_end`),
  KEY `bucket` (`bucket`)
) ENGINE=MyISAM AUTO_INCREMENT=1292213 DEFAULT CHARSET=latin1 |
 
MariaDB [mysql]> select *,IF(max_value IS NULL,'NULL','NOT NULL') as max_value_null,hex(max_value) as max_value_hex,length(max_value) as max_value_len from column_stats where db_name='logsleads' and table_name='GeoIPOrg';
+-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
| db_name   | table_name | column_name | min_value                                        | max_value  | nulls_ratio | avg_length | avg_frequency | hist_size | hist_type | histogram | max_value_null | max_value_hex        | max_value_len |
+-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
| logsleads | GeoIPOrg   | bucket      | 256                                              | 57343      |      0.0000 |     2.0000 |       49.4683 |         0 | NULL      | NULL      | NOT NULL       | 3537333433           |             5 |
| logsleads | GeoIPOrg   | id          | 1                                                | 1292212    |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 31323932323132       |             7 |
| logsleads | GeoIPOrg   | ipv4n_end   | 16777471                                         | 3758096127 |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 33373538303936313237 |            10 |
| logsleads | GeoIPOrg   | ipv4n_start | 16777216                                         | 3758095872 |      0.0000 |     4.0000 |        1.0000 |         0 | NULL      | NULL      | NOT NULL       | 33373538303935383732 |            10 |
| logsleads | GeoIPOrg   | org_name    | ! atlngahpla1_g8/7 to atlngahpla2_g8/7 ospf area |            |      0.0000 |    21.6174 |        2.6580 |         0 | NULL      | NULL      | NOT NULL       |                      |             0 |
+-----------+------------+-------------+--------------------------------------------------+------------+-------------+------------+---------------+-----------+-----------+-----------+----------------+----------------------+---------------+
 
select org_name,hex(org_name) from GeoIPOrg order by org_name desc limit 1;
+------------------+--------------------------+
| org_name         | hex(org_name)            |
+------------------+--------------------------+
| Øybjørg_sæbø     | D879626AF872675F73E662F8 |
+------------------+--------------------------+



 Comments   
Comment by Daniel Black [ 2014-12-27 ]

MariaDB [mysql]> alter table column_stats modify min_value VARBINARY(255) DEFAULT NULL, modify max_value VARBINARY(255) DEFAULT NULL;
Query OK, 6907 rows affected (0.03 sec)                
Records: 6907  Duplicates: 0  Warnings: 0
 
MariaDB [mysql]> analyze table logsleads.GeoIPOrg persistent for all;
+--------------------+---------+----------+-----------------------------------------+
| Table              | Op      | Msg_type | Msg_text                                |
+--------------------+---------+----------+-----------------------------------------+
| logsleads.GeoIPOrg | analyze | status   | Engine-independent statistics collected |
| logsleads.GeoIPOrg | analyze | status   | Table is already up to date             |

Is varbinary(255) a sufficient solution?

Comment by Elena Stepanova [ 2014-12-27 ]

create table t1 (a varchar(100)) engine=MyISAM;
insert into t1 values(unhex('D879626AF872675F73E662F8'));
analyze table t1 persistent for all;

+---------+---------+----------+---------------------------------------------------------------------------+
| Table   | Op      | Msg_type | Msg_text                                                                  |
+---------+---------+----------+---------------------------------------------------------------------------+
| test.t1 | analyze | status   | Engine-independent statistics collected                                   |
| test.t1 | analyze | Warning  | Incorrect string value: '\xD8ybj\xF8r...' for column 'min_value' at row 1 |
| test.t1 | analyze | Warning  | Incorrect string value: '\xD8ybj\xF8r...' for column 'max_value' at row 1 |
| test.t1 | analyze | status   | OK                                                                        |
+---------+---------+----------+---------------------------------------------------------------------------+
4 rows in set (0.59 sec)

Comment by Daniel Black [ 2015-01-15 ]

assuming the move to varbinary(255) is an acceptable solution here is the patch and test case.

Comment by Daniel Black [ 2015-01-15 ]

patch 2 of the statistics.cc to handle as my_charset_bin rather than my_charset_utf8_bin.

Comment by Sergei Petrunia [ 2015-02-20 ]

Trying to figure out what column_stats.min/max_value should store

  • For numeric types, it stores a string representation of the value (in a latin1-compatible charset).
  • For string-based types, it stores the min/max value in the charset of the field.
Comment by Daniel Black [ 2015-08-31 ]

For string-based types, it stores the min/max value in the charset of the field.

Makes sense to me. Which would mean changing to a binary type for storage (for compatibility with direct manipulation of the stats tables) (patch as attached) and changing the store/retrieve methods in sql_statistics.cc

Comment by Daniel Black [ 2015-09-01 ]

From one of our clients that does a lot of multilingual work:

MariaDB [mysql]> select count(*) as isnulltotal  from information_schema.columns i join information_schema.tables tt on i.TABLE_SCHEMA=tt.TABLE_SCHEMA and i.TABLE_NAME=tt.TABLE_NAME and TABLE_ROWS>0   join column_stats cs         on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name)  and nulls_ratio<1.00   where max_value is null or min_value is null ;
+-------------+
| isnulltotal |
+-------------+
|        4083 |
+-------------+
 
without where clause - total not empty tables 69526
 
MariaDB [mysql]> select 4083/69526;
+------------+
| 4083/69526 |
+------------+
|     0.0587 |
+------------+
 

5.87% of columns couldn't set the value of min or max in the column stats.

note: updated to exclude empty tables and columsn where all was null which warped results significantly

Comment by Daniel Black [ 2015-09-01 ]

breakdown:

MariaDB [mysql]> select i.character_set_name as charset,data_type as dt,count(*) as isnulltotal, t as total
   from information_schema.columns i
   join information_schema.tables tt
        on i.TABLE_SCHEMA=tt.TABLE_SCHEMA and i.TABLE_NAME=tt.TABLE_NAME and TABLE_ROWS>0
   join column_stats cs
        on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name) and nulls_ratio<1.00
   join (select count(*) as t,character_set_name as c, data_type as dt
                    from information_schema.columns i
                     join column_stats cs
                           on (i.TABLE_SCHEMA=cs.db_name and i.TABLE_NAME=cs.table_name and i.column_name=cs.column_name) and nulls_ratio<1.00
                    group by c,dt) itots
        on (i.character_set_name=itots.c or (i.character_set_name is null and itots.c is null)) and data_type=itots.dt
   where max_value is null or min_value is null
   group by i.character_set_name,dt
   with rollup;
 
| charset | dt         | isnulltotal | total |
+---------+------------+-------------+-------+
| NULL    | blob       |         207 |   207 |
| NULL    | longblob   |         500 |   509 |
| latin1  | longtext   |           7 |     7 |
| latin1  | text       |          18 |    18 |
| latin1  | tinytext   |           1 |     1 |
| utf8    | longtext   |        1331 |  1331 |
| utf8    | mediumtext |         712 |   712 |
| utf8    | text       |        1293 |  1293 |
| utf8    | tinytext   |          14 |    14 |

So it looks to need text truncation to the right length at the same time.

updated to exclude empty tables and those colums with all nulls:

Comment by Sergei Petrunia [ 2015-11-09 ]

danblack, thanks for the patch. Pushed into 10.0 tree.

Comment by Daniel Black [ 2015-11-09 ]

Thanks psergey

Sorry, there's one more case that I didn't patch here.

I should of included a full text case from my last comment. blob and text column end up with NULL in max/min too (without warnings).

MariaDB [test]> create table t2 (a blob) engine=MyISAM;
Query OK, 0 rows affected (0.00 sec)
 
MariaDB [test]> insert into t2 values(repeat(unhex('D879626AF872675F73E662F8'), 1024));
Query OK, 1 row affected (0.00 sec)
 
MariaDB [test]> analyze table t2 persistent for all;
+---------+---------+----------+-----------------------------------------+
| Table   | Op      | Msg_type | Msg_text                                |
+---------+---------+----------+-----------------------------------------+
| test.t2 | analyze | status   | Engine-independent statistics collected |
| test.t2 | analyze | status   | OK                                      |
+---------+---------+----------+-----------------------------------------+
2 rows in set (0.00 sec)
 
MariaDB [test]> select * from mysql.column_stats;
+---------+------------+-------------+-----------+-----------+-------------+------------+---------------+-----------+-----------+-----------+
| db_name | table_name | column_name | min_value | max_value | nulls_ratio | avg_length | avg_frequency | hist_size | hist_type | histogram |
+---------+------------+-------------+-----------+-----------+-------------+------------+---------------+-----------+-----------+-----------+
| test    | t2         | a           | NULL      | NULL      |      0.0000 | 12288.0000 |          NULL |      NULL | NULL      | NULL      |
+---------+------------+-------------+-----------+-----------+-------------+------------+---------------+-----------+-----------+-----------+

Comment by Daniel Black [ 2015-11-10 ]

Field_blob::update_min and Field_blob::update_max appear to be intentionally not implemented (comment /* Never update the value of min_val for a blob field */ in field.h) and hence the result is NULL ( and not empty as per the original encoding problem).

Comment by Daniel Black [ 2015-11-10 ]

I'm part way filling out these functions and limiting comparison/storage to 255 bytes. Please tell me if its a bad idea.

Comment by Sergei Petrunia [ 2015-11-11 ]

danblack, two reactions:

1. I am wondering whether this is at all important to have good statistics for blob fields. My impression is that the only kind of condition on blob fields that occurs in real life is blob_field LIKE '%foo' (Does this match your observations?) , for which statistics is not helpful. It is bad that the server writes an invalid column stats record, and the simplest solution is to not collect or store any statistics for blob columns. This would make ANALYZE TABLE faster, too.

2. Capping comparison/storage looks like a correct thing to do, although #1 is a simpler solution.

Comment by Daniel Black [ 2015-11-12 ]

psergey, agree with 1 (and arjen does too), anything searching/joining/ordering on a blob column or part there of is probably doing something wrong and implementing stats for it isn't going to add much value to the case where people have proper queries.

Should just document somewhere that there is no column_stats on blob columns.

I'll create a new task for ANALYZE TABLE to skip blob column retrieval for the purpose of engine independent stats.

So happy to close this.

For option 2, I did get part way as per attached patch (fails for varying length of blob columns) but if there is someone who can put a decent case for it in.

Comment by Elena Stepanova [ 2015-11-12 ]

Please don't forget to watch buildbot. The push broke 10.0 tests.
Better still, please run tests before pushing, at least default suites. There is nothing fancy that only buildbot would catch, both main.system_mysql_db and funcs_1.is_columns_mysql fail on regular builds.

http://buildbot.askmonty.org/buildbot/builders/kvm-bintar-quantal-amd64/builds/1891/steps/test/logs/stdio

Comment by Daniel Black [ 2015-11-13 ]

psergey fixed in 2776159e424dc04ef162b9cc0a2bd075419e0eb4

Generated at Thu Feb 08 07:19:10 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.