Details
-
Task
-
Status: Closed (View Workflow)
-
Critical
-
Resolution: Fixed
Description
Hi guys, could be nice a natural sorting inside mariadb
today some guys use SQL procedures/functions to execute this cpu intensive task
php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?
functions:
1) natural sort, used in ORDER BY / GROUP BY
2) natural sort compare, could be used as an operator, like the "SOUNDS LIKE" operator, in other words, the function that return a canonical form of the string could be used to compare, something that could be rewrite "field NATURAL LIKE value" to "natual(field)=natural(value)"
i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly
--------------
example of (2):
http://www.php.net/manual/en/function.strnatcmp.php
https://github.com/php/php-src/blob/master/ext/standard/strnatcmp.c
--------------
example of (1)
Here a example in STORED PROCEDURE/FUNCTIONS for ORDER BY, but not exaust tested:
source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql
DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
|
DELIMITER ;;
|
CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
|
RETURNS int
|
LANGUAGE SQL
|
DETERMINISTIC
|
NO SQL
|
SQL SECURITY INVOKER
|
BEGIN
|
DECLARE position int;
|
DECLARE tmp_position int;
|
SET position = 5000;
|
SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
|
|
IF (position = 5000) THEN RETURN 0; END IF;
|
RETURN position;
|
END
|
;;
|
|
DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
|
DELIMITER ;;
|
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
|
RETURNS varchar(4000)
|
LANGUAGE SQL
|
DETERMINISTIC
|
NO SQL
|
SQL SECURITY INVOKER
|
BEGIN
|
DECLARE sortString varchar(4000);
|
DECLARE numStartIndex int;
|
DECLARE numEndIndex int;
|
DECLARE padLength int;
|
DECLARE totalPadLength int;
|
DECLARE i int;
|
DECLARE sameOrderCharsLen int;
|
|
SET totalPadLength = 0;
|
SET instring = TRIM(instring);
|
SET sortString = instring;
|
SET numStartIndex = udf_FirstNumberPos(instring);
|
SET numEndIndex = 0;
|
SET i = 1;
|
SET sameOrderCharsLen = LENGTH(sameOrderChars);
|
|
WHILE (i <= sameOrderCharsLen) DO
|
SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
|
SET i = i + 1;
|
END WHILE;
|
|
WHILE (numStartIndex <> 0) DO
|
SET numStartIndex = numStartIndex + numEndIndex;
|
SET numEndIndex = numStartIndex;
|
|
WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
|
SET numEndIndex = numEndIndex + 1;
|
END WHILE;
|
|
SET numEndIndex = numEndIndex - 1;
|
|
SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
|
|
IF padLength < 0 THEN
|
SET padLength = 0;
|
END IF;
|
|
SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
|
|
SET totalPadLength = totalPadLength + padLength;
|
SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
|
END WHILE;
|
|
RETURN sortString;
|
END
|
;;
|
Attachments
- mariadb_nat_comp.c
- 13 kB
- num.01.diff
- 7 kB
Issue Links
- causes
-
MDEV-26786 Inserting NULL into base column breaks NATURAL_SORT_KEY column
-
- Closed
-
-
MDEV-26787 Document that NATURAL_SORT_KEY virtual column should be longer than the base column (and other notes)
-
- Closed
-
-
MDEV-26796 Natural sort does not work for utf32/utf16/ucs2
-
- Closed
-
-
MDEV-26806 Server crash in Charset::charset / Item_func_natural_sort_key::val_str
-
- Closed
-
- relates to
-
MDEV-23400 Add UCA case sensitive accent sensitive collations for Unicode character sets
-
- Closed
-
- links to
Activity
i done some job, as a start point...
natstring = canonical string (can use in order by natstring(some_string))
what need to be changed: the number is rewrite as a string with zeropad, but it's a fixed size (20), the size could be a optional parameter example: natstring("aasdfasdf",10), when number is bigger than 10, it's just copied not rewrite with zeropad)
what this function should do:
natstring("hello 1 person in 2 tables",13);
return:
"hello 0000000000001 person in 0000000000002 tables"
—
how to rewrite numbers?
1) 1 vs 01 (today this function do)
-> the first idea is use bigint precision 20 digits => "18446744073709551615"/"+-9223372036854775807"
"1" and "01" = "00000000000000000001"
2) decimal: 0.1 (today this function don't do in the right way)
using the same idea of (1):
0.1 = "0"."1" => "00000000000000000000.00000000000000000001" (today it do, but it's not right...)
or maybe understand that's as a fraction:
0.1 = "00000000000000000000.10000000000000000000" (today it don't do, that's the right way)
3) 10e-1 vs 0.1
that's a problem since 10e-1 and 0.1 is the same value,
but when you read 10e-1 you don't think about 0.1, i'm right?
maybe we can use (2) here
-> rewrite to 00000000000000000010e-00000000000000000001
00000000000000000010e-00000000000000000001 vs 00000000000000000000.00000000000000000001
strnatcasecmp('10e-1','0.1') => return "1" (right)
strnatcmp('10e-1','0.1') => return "1" (right)
4) what about numbers with more than 20 digits? example:
PI = "3.1415926535897932384626433832795028841971693993751"
rewrite with length = ceil(size/20)*20 ?
no! just copy! (today this function do this part)
5)leading zeros (today it do)
"0000000000000000000000000000000000000000000000000000000001"
must be rewrite to "1" => "00000000000000000001"
6) special numbers (version and others kinds), example: (it's part of (2) problem)
'1.1.1' should be understood as:
integer.integer.integer instead of
integer.fraction.integer
1.1.1 (especial) = 00000000000000000001.00000000000000000001.00000000000000000001
1.1 (not special) = 00000000000000000001.10000000000000000000
check that if second or third number is bigger than 20 (max number size) it's just copied, but it's not a fraction, it's just a integer
fraction = number+'.'+number + (next pos!='.' and next next pos is not a digit)
especial = number+'.'number'.'+number... (end when don't find "."+digit)
examples
6.1)"1.1.1" => number+'.'number'.'+number + end of string (!='.'+digit) = special
6.2)"1.1.1." => "1.1.1" (special) + "."
number+'.'number'.'number (!='.'+digit) [non special part= '.']
6.3)"1.1.1.a" => "1.1.1" (special) + ".a"
number+'.'number'.'number (!='.'+digit) [non special part = '.a']
i'm re thinking about the source code of canonical form (natural string), the first scratch was very ugly
the easier way is:
find first digit, find where the number ends
- number= start with [0-9], end with [0-9], may have [0-9] in middle, may have one "." before and after two digits
copy previous string without double space
write the number from start digit to end rewriting with the zero padding, and espacial cases / fraction
find next number or copy the end string removing the double spaces
end
here's a more compact "sorting string" encoding approach:
- split the source string into sequences of digits and non-digits. Like perl's split /(\d+)/.
- every non-digit chunk is put into the encoded string as is
- every number (a sequence of digits) is stored as a 64-bit number, bigendian
- alternatively, could be some variable-length encoding, that can be memcmp-ed
- put one zero byte between numbers and non-numbers.
this won't work correctly for strings with embedded zero bytes. We can say that there's nothing natural in these strings
and this won't work if numbers won't fit in 64 bit. variable-length encoding will solve it, though
example of a variable-length encoding for numbers:
- first byte: number of following bytes.
- then the number, bigendian, without leading zero bytes.
for example:
- 1234567890 (that is 0x00000000499602D2) will be 0x4499602D2
- 5 will be 0x105
- 987654321987654321987654321 (0x330F7F01403F94EDB1812B1) will be 0x0C0330F7F01403F94EDB1812B1
this can handle numbers that are longer than 64 bit
Another simplification, to avoid arbitrary length numbers (like 0x0C0330F7F01403F94EDB1812B1 above), it'd be easier to use some variant of BCD encoding. For example, 987654321987654321987654321 could be 0x0E0987654321987654321987654321, which is 2 bytes longer that a true binary representation, but much easier to calculate
not exactly clear how this mix of binary and text data will work with different charsets
we could normalize the charset to be utf8mb4 for the input string, before creating a key first.
However, on things that bugs me is that sorting on strings is ASCIIbetical in Martin Pool implementation,
I thought if we wanted natural sort talking about strings parts, it should be something like collating key, produced according to the UCA algorithm.
Numeric sorting is a part of the Unicode collation algorithm:
Search for the numericOrdering string in https://unicode.org/reports/tr35/tr35-collation.html
Here's a relevant quote:
If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UAX44]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123".
I think we should extend our Unicode collation algorithm implementation to support the [numericOrdering on] collation option.
I think implementing a collation option should be better than implementing a function:
- It should provide a better performance.
- It's good to implement more of the Unicode standard.
- It should be doable with low efforts.
- It will natively (i.e. without virtual columns) support all advantages of a collation: indexes, partitioning, hash based HEAP tables, etc.
The attached patch num.01.diff contains a preliminary working prototype.
It currently has some limitations, but it should be relatively easy to address them:
- For now it supports only ASCII digits. It should support all Unicode digits from the "Nd" category, e.g.:
1041;MYANMAR DIGIT ONE;Nd;0;L;;1;1;1;N;;;;;
- For now it works fine only for numbers fitting into uint32. It should support digit sequences with arbitrary lengths.
- For now it's hard coded into utf8_myanmar_ci, just for test purposes. It should allow to add collations with the enabled numericOrdering attribute through share/charsets/Index.xml (and in long terms through the CREATE COLLATION statement - when it's implemented).
Here's an example query output:
SET NAMES utf8 COLLATE utf8_myanmar_ci; |
SELECT * FROM (VALUES ('a1'),('a2'),('a100'),('a20000'),('a9'),('a99')) t ORDER BY 1; |
+--------+
|
| a1 |
|
+--------+
|
| a1 |
|
| a2 |
|
| a9 |
|
| a99 |
|
| a100 |
|
| a20000 |
|
+--------+
|
6 rows in set (0.001 sec)
|
Another example, with the underlying sorting weights:
SET NAMES utf8 COLLATE utf8_myanmar_ci; |
SELECT a1, HEX(WEIGHT_STRING(a1)) FROM (VALUES ('a1'),('a2'),('a100'),('a20000'),('a9'),('a99')) t ORDER BY 1; |
+--------+--------------------------+
|
| a1 | HEX(WEIGHT_STRING(a1)) |
|
+--------+--------------------------+
|
| a1 | 120F12050201020002000201 |
|
| a2 | 120F12050201020002000202 |
|
| a9 | 120F12050201020002000209 |
|
| a99 | 120F12050202020002000263 |
|
| a100 | 120F12050203020002000264 |
|
| a20000 | 120F12050205020002090820 |
|
+--------+--------------------------+
|
Realistically, it could be utf8_numeric_ci or utf8_generic_ci_numeric that does it. Using utf8_generic_ci rules for non-digits.
Could use prefix encoding as above to support arbitrary-length numbers.
Benefits:
- avoids materializing of the weight string for sorting, so might be faster than a function.
- logically string comparison belongs in a collation, so it's a more natural approach
- can be used in indexes easier (if natural_sort() is a function, it can be indexed too, but needs a virtual column)
Drawbacks:
- _numeric variant has to be created per collation. Here we'll do only utf8_generic_ci_numeric, but users might later request utf8_czech_ci_numeric, etc.
- every collation takes a collation ID out of the pool
- materializing the string and then memcmp-ing it might be faster than strnncoll-ing every time.
Effort-wise I think both approaches are about the same.
I think I found I relatively good scheme to encode numbers, so that a valid string remains a valid string, i.e without BCD, and with a relatively small overhead. Moreover, numeric strings remain numeric strings, so that it is not possible to mix up numbers and strings, that contain no digits
I imagine it like this
- numeric strings up to the length of 10 will get a single digit prefix, '0' + len -1
- numeric strings that are larger, will be split in chunks of 10, or smaller than 10, each encoded like above, with single digit prefix.
Examples
Calculate key('9') . prefix = '0' + len('9') -1 = '0' , thus key('9') = '09'
Calculate key('19') prefix = '1', key = '19'
Note , while '9' > '19 lexicographically ', the key('9') < key('19'), which is he desired property.
Calculate key('12345678901') , gives 2 chunks, of length 10 and 1
partial key('1234567890') = '91234567890' (adds '9' as prefix), partial key('1') = '01',
full key = '9123456789001'
I'd think that for realistic numbers, this encoding is compact enough - it only adds 1 extra char
for numbers < 10 billions, and encoding form is still a string in the original charset (which can be converted to UTF8, and also given COLLATE to sort whatever user wants)
serg, any comments on this schema
Oh, I missed the bar's reply to that, and @serg's. A unicode collation is kinda attractive too, although I would only implement that for case-sensitive, accent-sensitive UCA collation that does not yet exist, because primary application for this feature in my understanding is sorting, rather than anything else (e.g searching). And sorting is best when it is unambiguous, while currently everything Unicode, except _bin collation is ambiguous (and we know that _bin is anything but natural)
I agree with wlad, we should add accent-sensitive case-sensitive UCA collations.
For now, we have only an accent-sensitive case-INsensitive UCA collation: utf8mb3_thai_520_w2
Note, it can be used to provide unambiguous sorting order in combination with numericOrdering-enabled collations, like this:
SET NAMES utf8; |
CREATE OR REPLACE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE utf8_myanmar_ci); |
INSERT INTO t1 VALUES ('a1'),('ä1'),('Ä1'),('Å1'),('å1'),('a2'),('a100'),('a20000'),('a9'),('a99'); |
SELECT a, HEX(WEIGHT_STRING(a)) FROM t1 |
ORDER BY |
a,
|
WEIGHT_STRING(a COLLATE utf8mb3_thai_520_w2 LEVEL 2), |
BINARY a; |
+--------+--------------------------+
|
| a | HEX(WEIGHT_STRING(a)) |
|
+--------+--------------------------+
|
| a1 | 120F12050201020002000201 |
|
| Ã…1 | 120F12050201020002000201 |
|
| å1 | 120F12050201020002000201 |
|
| Ä1 | 120F12050201020002000201 |
|
| ä1 | 120F12050201020002000201 |
|
| a2 | 120F12050201020002000202 |
|
| a9 | 120F12050201020002000209 |
|
| a99 | 120F12050202020002000263 |
|
| a100 | 120F12050203020002000264 |
|
| a20000 | 120F12050205020002090820 |
|
+--------+--------------------------+
|
Here the ORDER BY has three parts:
- The a part orders records according to the base letter (and handles numbers at the same time)
- The WEIGHT_STRING(a COLLATE utf8mb3_thai_520_w2 LEVEL 2) part makes accent order unambiguous.
- The BINARY a part makes case order unambiguous.
Hello serg. I agree, creating built-in numericOrderding=on variants for all existing regular collations (and thus spending a lot of available IDs in the pool) is a problem.
I propose we don't do that. Some ideas how to avoid this:
- We can allow to create custom collations with the numericOrdering=on attribute in share/charsets/Index.xml. So every user can decide which collation needs a numericOrdering-enabled counterpart.
- We can also add the SQL standard CREATE COLLATE statement support. It should not be too complex.
- We can allow customizing collations in the context where we don't really need fixed collations IDs. For example:
SELECT * FROM t1 ORDER BY a COLLATE `utf8_czech_ci@numericOrdering=on`;
In this statement, we can create a temporary customized collation instance (based on Czech, by with numericOrdering enabled) which life cycle will end as soon as the query ends.
- Also we can store customized collations in FRM files in text format (like we recently allowed user defined datatype names), so these statements will also work:
CREATE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE `utf8_czech_ci@numericOrdering=on`);
Some changes in replication will also be needed.
So I really prefer the numeric sorting to be implemented as a UCA collation option.
It will be immediately usable without adding many built-in collations on one hand. And it will indirectly prioritize these tasks on the other hand:
- CREATE COLLATION
- Accent-sensitive case-sensitive collations
which I think are overdue.
It will also support this approach you mentioned:
materializing the string and then memcmp-ing it might be faster than strnncoll-ing every time.
with help of the WEIGHT_STRING() function. So one can decide which of these two equivalent queries to use:
SELECT * FROM t1 ORDER BY a COLLATE utf8_czech_ci_numeric; |
SELECT * FROM t1 ORDER BY WEIGHT_STRING(a COLLATE utf8_czech_ci_numeric); |
The former works without materializing, the latter works with materializing.
To tell the truth, the numericOrdering in Unicode is described a little bit sketchy,UTS#10, the collation paper, does not mention how to do it
Is there anything yet that implements it? The i18 "golden standard", ICU, do they allow that? bar ?
To be honest, I think a string to string transformation with a function should be OK, an I also think that natural_sort is rather exotic, so people probably can live with virtual functions by now. It will be immediately usable, at the tiny cost of virtual functions, if one wants an index to be used in ORDER BY
Not saying that supporting that in collations is too bad, but I think the overhead of doing that is perhaps too much given the scope of that MDEV.
People won't be willing to create their own collations, I doubt that very much.
Storing collation attributes (locale, weight level or, accent and case sensitivity, pad-nopad) in the FRM is a good thing, fully independent from numericOrdering.
Also, a function can be adopted and given parameter to recognize and handle negatives, whether to remove leading zeroes or not, or handle decimal fractions (with a user-defined decimal point and thousands separator), and this will be too much to put into numericOrdering only.
wlad, yes ICU does support numeric sorting.
Note, PostgreSQL can use ICU collations, including collation customization:
DROP COLLATION IF EXISTS c1; |
CREATE COLLATION c1 (provider = icu, locale = 'en@colNumeric=yes;colAlternate=shifted'); |
DROP TABLE IF EXISTS t1; |
CREATE TABLE t1 (a VARCHAR(10) COLLATE c1); |
INSERT INTO t1 VALUES ('a1'),('a10'),('a9'); |
SELECT * FROM t1 ORDER BY a; |
a
|
-----
|
a1
|
a9
|
a10
|
(3 rows)
|
I think I found I relatively good scheme to encode numbers, so that a valid string remains a valid string, i.e without BCD, and with a relatively small overhead. Moreover, numeric strings remain numeric strings, so that it is not possible to mix up numbers and strings, that contain no digits
...
Sergei Golubchik, any comments on this schema
I wanted to avoid comparing numbers and non-numbers. That is, I tried to get the same effect as for comparing tuples.
Consider, for example, "abc-123.4.5alpha2" → ("abc-", 123, ".", 4, ".", 5, "alpha", 2). And comparing this "naturally" with, say, "abc-123.4.5.alphabeta3", first six elements in a tuple would be equal, but "alpha" < "alphabeta", so the first tuple and the first string is "naturally" smaller.
That's why I used "\0" as a separator. "alpha\0...2" is smaller than "alphabeta\0...3" no matter what bytes follow the zero byte.
I think binary data isn't much of a problem. The NATURAL_SORT_KEY() function could convert numbers to binary data, for example, BCD with a length prefix, and non-numbers — to their weights, like WEIGHT_STRING() function. And its result can then be binary, just a set of bytes that can be memcmp-ed.
serg, this is what I planned originally, but now I believe encoding string to string has its own advantages, it is compact (much more than WEIGHT_STRING + BCD), reversible (if we want to, we can provide NATURAL_SORT_DECODE() to ease the index-only access), and can be used with this or that COLLATE order
WEIGHT_STRING(NATURAL_SORT_KEY()) can be stored in index, too, if one really wants
serg, I do not think a string->string requires separators for digits.
Binary sortkeys, lke in your example, doe need separators. Though, you'll have to take care that actual sortkeys do not contain binary zeros, which is not given for either classical BCD or collation weights.
point is that natural sort solves the problem where group of digits at the same offset inside string are sorted alpha, rather then numeric. I do not think anyone complained about relative order or digits and non-digits (this is what separators solve), and I'm sure UCA did not put digits in the middle of any script, and into some useful position and this will already ensure the natural order of things. The relative order of digits vs scripts, punctuation,emojis, or Dingbats is preserved, in string transformation, and it should be fine.
If there is a real need to sort numbers smaller than anything, in then preprocessing string with e.g
|
REGEXP_REPLACE(s,'([0-9]+)','\t\\1')
|
will do that trick ( use \0 nstead of \t for _bin collations, the smallest character varies depending on collation)
The patch is in bb-10.7-MDEV-4742.
It is a string->string transformation, and there is an additional parameter natural_sort_key, for decimal separator, which can be e.g "." or "," , to sort fractions correctly.
There is no designated separator between digits and non-digits (just the digit rsp non-digit characters themselves). This should not cause problems, as all characters that are less than digits 0-9 are controls, punctuation, letter modifiers and similar non-letter stuff (that's about UCA, but also true for all ASCII based encodings)
A generic scheme for number encoding for natural sorting is to prepend the length-1 of the number to the number, and then prepend the length-1 of the length, and repeat it a fixed number of times. The leftmost prefix must be of a fixed length and should be encoded as decimal. Remaining prefixes should be encoded using radix > 10 to make them short.
Using two levels of encoding, a single digit for the first prefix and a hexadecimal notation for the second prefix will work for digit sequences up to the length of 1TB. Example prefixes:
- 1 digit: 00
- 16 digits: 0f
- 256 digits: 1ff
- 2^40^ digits: 9ffffffffff
I think prefixes of 2 bytes for small numbers are too much. For that task, I doubt there will be a lot of numbers longer than 9-10 digits.
Looks good, the only thing that I doesn't like is:
`c` varchar(30) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
|
- `NATURAL_SORT_KEY(c)` varchar(45) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL
|
+ `NATURAL_SORT_KEY(c)` varchar(60) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL |
that first patch had 50% overhead, now you went up to 2x.
May be you don't need to encode leading zeros when there aren't any? "1" -> "01", not "010"
also
+ if (out->append(buf, natsort_encode_length(n_digits - 1, buf)))
|
+ return NATSORT_ERR::ALLOC_ERROR; |
this is usually done like
out->reserve(natsort_max_key_size(n_digits));
|
out->length(out->length() + natsort_encode_length(n_digits-1, n->ptr()));
|
I agree that for a small buffer like yours here it matters little. But better to set a good example in case someone copies it and his buffer won't be small.
serg , we have to encode lead zeros count, even if there are none. Otherwise, "1/", "01" , "1a" will sort with "01" in the middle, even though "01" > "1". I added a test for it.
So I think maybe we should not change that just yet, unless we make a decision to ignore/remove all leading zeros (this would make keys not reversible, and sort unstable).
I also changed natsort_encode_numeric_string such that it won't allocate anything, but fail, if preallocated string can't accomodate the resulting key. We preallocate at an earlier stage, only once, and do not do reallocs afterwards.
I've done a little survey what other natsort implementations do w.r.t leading zeros sort, to figure out what to do with them, and here are the results
- Martin Pools strnatcmp, which is what PHP is using, sorts leading zeros before no-lead-zeros, "001" < "01" < "1". Windows StrCmpLogicalW (used in the Explorer to sort filenames), as well as CompareStringEx with SORT_DIGITSASNUMBERS flag, sort the same
- ICU, with UCOL_NUMERIC_COLLATION, as well as Python natsort ignore leading zeros.
Ignoring leading zeros, like ICU does, somewhat surprisingly leads to more logical results.
While other libraries think that "a1b" > "a01c" (because 1 > 01), ICU begs to differ, and this kinda makes sense, because split into components , it is (a,1,b) vs (a,1,c) the difference between 1 and 01 seems less significant that than the difference between b and c.
With that, I think we should be ignoring lead zero differences, for the upcoming version. That makes keys smaller too, usually requiring only a single byte per number.
For stable sort, one could still "ORDER BY natural_sort_key(c), c" , this would give exactly "01" < "1" order.
As a side note, I do like the description of the older version of StrCompareLogicalW algorithm (now it does not work like this anymore).
A particular case is made for leading zeros in numbers. Of two numbers that have the same numerical value, the lesser for the comparison is whichever has the more leading zeros. However, this is acted on only if the two strings turn out to be otherwise equal. For example:
"a1b1" < "a01b2" < "a1b2" < "a01b3"
This one seems combine the strength of stable sort, and prioritizing significant over insignificant differences a la ICU. It seems to be inspired by Unicode collation algorithm, like a weight level for leading zeros.
Tested and published the [blog](https://mariadb.org/10-7-preview-feature-natural-sort/)
While I understand the intention behind it and thus don't want to file a bug, I think there is a distinct possibility that some users will find the natural sort of float numbers rather unnatural.
Especially when the returned values are FLOAT, and the client aligns them as such, it is quite difficult to see the result as natural.
MariaDB [test]> create or replace table t (f float); |
Query OK, 0 rows affected (0.055 sec) |
|
MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044); |
Query OK, 3 rows affected (0.013 sec) |
Records: 3 Duplicates: 0 Warnings: 0
|
|
MariaDB [test]> select * from t order by natural_sort_key(f); |
+-------+ |
| f |
|
+-------+ |
| 0.5 |
|
| 0.33 |
|
| 0.044 |
|
+-------+ |
3 rows in set (0.001 sec) |
It is slightly better when the values are actually strings, at least they are formatted as such, and with some effort one can see the logic:
MariaDB [test]> create or replace table t (f varchar(16)); |
Query OK, 0 rows affected (0.053 sec) |
|
MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044); |
Query OK, 3 rows affected (0.012 sec) |
Records: 3 Duplicates: 0 Warnings: 0
|
|
MariaDB [test]> select * from t order by natural_sort_key(f); |
+-------+ |
| f |
|
+-------+ |
| 0.5 |
|
| 0.33 |
|
| 0.044 |
|
+-------+ |
3 rows in set (0.001 sec) |
Same for negative numbers.
I wonder what existing best practices say about it, whether the natural sort, when applied to numbers, should consider the numeric order being natural.
It is certainly possible to design a transformation scheme for fractions, and negatives, it is just not done currently. Some of that, sorting leading zeros in predictable way, and fractions( with either dot or comma as separators) were in some earlier versions of the patch, and were taken out, for the sake of simplicity.
We currently sort like ICU - positive numbers only, no fractions, ignore leading zeros, no other customizations. We sort slightly better than ICU in one aspect, because they have some in-built limitations about how long a numeric string can be, 256 digits or something like that, I don't recall exactly. We sort slightly worse than ICU in i18n department, because we only consider 0-9 to be digits, and they handle more decimal digits from different scripts.
What you can be interested in , Python's natsort has some customizations to sort, and allegedly handles fractions, and negatives , when told to handle it, via parameter https://natsort.readthedocs.io/en/master/api.html#the-ns-enum
But there is no best practice, as far as I understand, and no common definition of natural sort, except that "we intermix strings and numbers, and expect numbers to be sorted as numbers, not ASCII".
There are a couple of aspects when people want to sort strings as numbers
- i18n decimal numbers : 0-9 vs decimal numbers in non-latin scripts
- fractions separators- dot vs comma vs whatever is used in some arabic scripts. Sometimes both dot and comma are used, e.g in Switzerland.
- fractions - whether to support scientific notation with exponent ?
- thousands separators - 20'000'000 - is this the same as 20000000? Is 20 000 000 the same as 20000000,00? Those are tricky, as they are can be locale dependent, and alternating, with comma and dot, but do not have to, and people would just separate with space.
- negatives
- decimal numbers - should they sort as-is ( bigger than punctuation, less that letters), or less than anything else, including punctuations, or larger than anything else? This is related to script ordering in UCA algorithm.
And of course, natural sort sorts strings. Anything can be converted to string, but it does not mean people should be trying to convert numbers to string, and to natural_sort_key, to sort numbers.
As far as testing is concerned, it can be pushed into 10.7 (as of preview-10.7-MDEV-4742-natural-sort cfec49f1dd7241582f03cbcd395e65a9c2ecf4ce).
Documentation in the KB needs to be updated/enhanced, but it shouldn't be an obstacle to pushing into main.
As explained earlier in the comments, some decisions remain a subject to further consideration based on users' feedback.
I wrote a nat-sort function for MariaDB/MySQL a few years ago, documented here: https://stackoverflow.com/a/58154535/999120
It produces a key strings based on an input string, such that sorting by the key string results in a natural sort. They key strings can be stored for re-use, making the process efficient.
I think it might have more "features" than the one you describe above: It handles negatives (though this could be improved), leading zeros, leading +, version numbers or IP addresses, decimals, and use of commas as thousand separators, and also preserves total sort ordering (i.e. no two different strings nat-sort as equivalent).
Feel free to borrow it, if you like.
a partial patch, todo:
natstring need fraction implementation
use collations
rewrite, it's too ugly now