[MDEV-26787] Document that NATURAL_SORT_KEY virtual column should be longer than the base column (and other notes) Created: 2021-10-07  Updated: 2021-10-19  Resolved: 2021-10-19

Status: Closed
Project: MariaDB Server
Component/s: Documentation
Fix Version/s: N/A

Type: Task Priority: Critical
Reporter: Elena Stepanova Assignee: Ian Gilfillan
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
Problem/Incident
is caused by MDEV-4742 A function for native natural sorting... Closed

 Description   

Contrary to the example in the blog post about natural sort, the virtual column should be longer than the base column. This is due to the specifics of the design, which makes values in the virtual columns be longer than the originally inserted values.

We can't know at the table creation how much longer it should be, so I suppose there is no point throwing an error or a warning upon table creation, but at least it should be documented. Otherwise, even a simple example like the on in the blog post only works by pure chance.

Here is the same example, only instead of ('a1') it inserts ('a10'):

MariaDB [test]> create table t(c varchar(3), k varchar(3) as (natural_sort_key(c)) invisible);
Query OK, 0 rows affected (0.038 sec)
 
MariaDB [test]> insert into t(c) values ('b1'),('a2'),('a11'),('a10');
Query OK, 4 rows affected (0.006 sec)
Records: 4  Duplicates: 0  Warnings: 0
 
MariaDB [test]> select * from t order by k;
+------+
| c    |
+------+
| a2   |
| a11  |
| a10  |
| b1   |
+------+
4 rows in set (0.001 sec)

This is because both a11 and a10 cause an overflow in column k.
The overflow is currently undetected due to the bug MDEV-24582, but even if it is detected and causes an error or truncation with a warning, it is hardly what the users want. Instead, they should make sure that the column is longer from the start.



 Comments   
Comment by Vladislav Vaintroub [ 2021-10-07 ]

I doubt there could be any design where you get this function to be smaller or the same size as the original source string.
Strings without embedded numbers behave like normal strings, so that the size must be at least original string. Embedded numbers add something to the behavior, and this "something" has to cost some extrabits.

Comment by Elena Stepanova [ 2021-10-07 ]

Sure, but users can't know it, hence "documentation". There is no documentation for the virtual column part yet (or I haven't found it), thus not a bug, but a task to make a mention of it when the time comes.

Comment by Elena Stepanova [ 2021-10-11 ]

Some optional usage examples for documentation (just suggestions).

While it is not the main idea of natural sort, it is handy for sorting ENUM values. Usually they are sorted by their numeric value, which is not necessarily what the user wants. Natural sort sorts them by their text value instead.

create or replace table t (pk int auto_increment primary key, fname varchar(64), ftype ENUM('AUDIO','VIDEO','TEXT','ARCHIVE'));
insert into t (fname,ftype) values ('doc.txt','TEXT'),('movie.mp4','VIDEO'),('song.mp3','AUDIO'),('arch.zip','ARCHIVE');
MariaDB [test]> select * from t order by ftype;
+----+-----------+---------+
| pk | fname     | ftype   |
+----+-----------+---------+
|  3 | song.mp3  | AUDIO   |
|  2 | movie.mp4 | VIDEO   |
|  1 | doc.txt   | TEXT    |
|  4 | arch.zip  | ARCHIVE |
+----+-----------+---------+
MariaDB [test]> select * from t order by natural_sort_key(ftype);
+----+-----------+---------+
| pk | fname     | ftype   |
+----+-----------+---------+
|  4 | arch.zip  | ARCHIVE |
|  3 | song.mp3  | AUDIO   |
|  1 | doc.txt   | TEXT    |
|  2 | movie.mp4 | VIDEO   |
+----+-----------+---------+

On the other hand, for types like INET6 it is unlikely to be useful:

create or replace table t (pk int auto_increment primary key, i inet6);
insert into t (i) values ('2001::10'),('::'),('ffff:ffff::ffff');
MariaDB [test]> select * from t order by i;
+----+-----------------+
| pk | i               |
+----+-----------------+
|  2 | ::              |
|  1 | 2001::10        |
|  3 | ffff:ffff::ffff |
+----+-----------------+
MariaDB [test]> select * from t order by natural_sort_key(i);
+----+-----------------+
| pk | i               |
+----+-----------------+
|  1 | 2001::10        |
|  2 | ::              |
|  3 | ffff:ffff::ffff |
+----+-----------------+

While it is technically correct, it's not very meaningful in relation to IPv6 addresses.

Comment by Elena Stepanova [ 2021-10-11 ]

By current design of MDEV-4742, leading zeros are ignored, so, depending on the data, the order can look strange. One way to mitigate it, at least in certain cases, is ordering by both natural_sort_key and the column itself. For example:

Unordered data

MariaDB [db]> select a from t;
+-------+
| a     |
+-------+
| a.01  |
| a.1   |
| a.001 |
| a.01b |
| a.10  |
| a.01  |
| a.1   |
| a.001 |
| a.01b |
| a.10  |
+-------+
10 rows in set (0.001 sec)

ORDER BY column value

MariaDB [db]> select a from t order by a;
+-------+
| a     |
+-------+
| a.001 |
| a.001 |
| a.01  |
| a.01  |
| a.01b |
| a.01b |
| a.1   |
| a.1   |
| a.10  |
| a.10  |
+-------+
10 rows in set (0.000 sec)

Pure "natural sort"

MariaDB [db]> select a from t order by natural_sort_key(a);
+-------+
| a     |
+-------+
| a.01  |
| a.001 |
| a.1   |
| a.01  |
| a.001 |
| a.1   |
| a.01b |
| a.01b |
| a.10  |
| a.10  |
+-------+
10 rows in set (0.000 sec)

ORDER BY both, probably what most would expect as natural sort

MariaDB [db]> select a from t order by natural_sort_key(a), a;
+-------+
| a     |
+-------+
| a.001 |
| a.001 |
| a.01  |
| a.01  |
| a.1   |
| a.1   |
| a.01b |
| a.01b |
| a.10  |
| a.10  |
+-------+
10 rows in set (0.000 sec)

Comment by Vladislav Vaintroub [ 2021-10-11 ]

The order looks strange, because you expect numbers to be sorted as fractions. The fractions are not handled, so the order is just not deterministic, in presence of leading zeros.
0.07 > 0.5 would be probably not an expected order, but this will sort like this, even if you break the ties by adding the column itself. Having said this, in existing implementations of "natural sort", both Windows' StrCompareLogicalW, and PHP natsort would sort more leading zeros before less leading zeros. ICU would not. Python would not either

Comment by Elena Stepanova [ 2021-10-11 ]

The order looks strange, because you expect numbers to be sorted as fractions

No, it is not that at all. I have a problem with natural sort and fractions, but it doesn't apply here. I put a period because it's easier for me to look at, but it's not important for the subject. Let's have it this way:

MariaDB [test]> select a from t order by a;
+------+
| a    |
+------+
| a001 |
| a001 |
| a01  |
| a01  |
| a01b |
| a01b |
| a1   |
| a1   |
| a10  |
| a10  |
+------+
10 rows in set (0.001 sec)

The result above is our usual "unnatural" sort. There is nothing special about it.

MariaDB [test]> select a from t order by natural_sort_key(a);
+------+
| a    |
+------+
| a01  |
| a001 |
| a1   |
| a01  |
| a001 |
| a1   |
| a01b |
| a01b |
| a10  |
| a10  |
+------+
10 rows in set (0.002 sec)

The result above, technically "natural sort", still looks strange. It is strange not because it resembles fractions, but because it is not ordered at all within each equality group, e.g. the sequence a01 a001 a1 is not ordered. It is by design, because for all of them the functon value is a01, so the order can be random.

This is what I suggest as a workaround for those who needs it ordered.

MariaDB [test]> select a from t order by natural_sort_key(a), a;
+------+
| a    |
+------+
| a001 |
| a001 |
| a01  |
| a01  |
| a1   |
| a1   |
| a01b |
| a01b |
| a10  |
| a10  |
+------+
10 rows in set (0.002 sec)

Comment by Ian Gilfillan [ 2021-10-19 ]

elenst I don't think you mean MDEV-2458 in the description?

Comment by Elena Stepanova [ 2021-10-19 ]

Sorry, typo, fixed. MDEV-24582 .

Generated at Thu Feb 08 09:47:56 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.