[MDEV-4742] A function for native natural sorting / natural compare Created: 2013-07-01 Updated: 2023-03-21 Resolved: 2021-10-14 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Server |
| Fix Version/s: | 10.7.1 |
| Type: | Task | Priority: | Critical |
| Reporter: | roberto spadim | Assignee: | Vladislav Vaintroub |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | Preview_10.7 | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Description |
|
Hi guys, could be nice a natural sorting inside mariadb functions: i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly example of (1)
|
| Comments |
| Comment by roberto spadim [ 2013-07-01 ] | ||||||||||||||||||||||||||||||||
|
a partial patch, todo: rewrite, it's too ugly now | ||||||||||||||||||||||||||||||||
| Comment by roberto spadim [ 2013-07-01 ] | ||||||||||||||||||||||||||||||||
|
i done some job, as a start point... natstring = canonical string (can use in order by natstring(some_string)) what this function should do: natstring("hello 1 person in 2 tables",13); — 00000000000000000010e-00000000000000000001 vs 00000000000000000000.00000000000000000001 fraction = number+'.'+number + (next pos!='.' and next next pos is not a digit) | ||||||||||||||||||||||||||||||||
| Comment by roberto spadim [ 2013-07-02 ] | ||||||||||||||||||||||||||||||||
|
i'm re thinking about the source code of canonical form (natural string), the first scratch was very ugly the easier way is:
| ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-06-29 ] | ||||||||||||||||||||||||||||||||
|
here's a more compact "sorting string" encoding approach:
this won't work correctly for strings with embedded zero bytes. We can say that there's nothing natural in these strings | ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-06-29 ] | ||||||||||||||||||||||||||||||||
|
example of a variable-length encoding for numbers:
for example:
this can handle numbers that are longer than 64 bit | ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-06-30 ] | ||||||||||||||||||||||||||||||||
|
Another simplification, to avoid arbitrary length numbers (like 0x0C0330F7F01403F94EDB1812B1 above), it'd be easier to use some variant of BCD encoding. For example, 987654321987654321987654321 could be 0x0E0987654321987654321987654321, which is 2 bytes longer that a true binary representation, but much easier to calculate | ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-07-29 ] | ||||||||||||||||||||||||||||||||
|
not exactly clear how this mix of binary and text data will work with different charsets | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-07-29 ] | ||||||||||||||||||||||||||||||||
|
we could normalize the charset to be utf8mb4 for the input string, before creating a key first. | ||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-07-30 ] | ||||||||||||||||||||||||||||||||
|
Numeric sorting is a part of the Unicode collation algorithm:
I think we should extend our Unicode collation algorithm implementation to support the [numericOrdering on] collation option. I think implementing a collation option should be better than implementing a function:
The attached patch num.01.diff contains a preliminary working prototype. It currently has some limitations, but it should be relatively easy to address them:
Here's an example query output:
| ||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-07-30 ] | ||||||||||||||||||||||||||||||||
|
Another example, with the underlying sorting weights:
| ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-07-30 ] | ||||||||||||||||||||||||||||||||
|
Realistically, it could be utf8_numeric_ci or utf8_generic_ci_numeric that does it. Using utf8_generic_ci rules for non-digits. Could use prefix encoding as above to support arbitrary-length numbers. Benefits:
Drawbacks:
Effort-wise I think both approaches are about the same. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-07-30 ] | ||||||||||||||||||||||||||||||||
|
I think I found I relatively good scheme to encode numbers, so that a valid string remains a valid string, i.e without BCD, and with a relatively small overhead. Moreover, numeric strings remain numeric strings, so that it is not possible to mix up numbers and strings, that contain no digits I imagine it like this
Examples Note , while '9' > '19 lexicographically ', the key('9') < key('19'), which is he desired property. Calculate key('12345678901') , gives 2 chunks, of length 10 and 1 I'd think that for realistic numbers, this encoding is compact enough - it only adds 1 extra char serg, any comments on this schema | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-07-30 ] | ||||||||||||||||||||||||||||||||
|
Oh, I missed the bar's reply to that, and @serg's. A unicode collation is kinda attractive too, although I would only implement that for case-sensitive, accent-sensitive UCA collation that does not yet exist, because primary application for this feature in my understanding is sorting, rather than anything else (e.g searching). And sorting is best when it is unambiguous, while currently everything Unicode, except _bin collation is ambiguous (and we know that _bin is anything but natural) | ||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-07-31 ] | ||||||||||||||||||||||||||||||||
|
I agree with wlad, we should add accent-sensitive case-sensitive UCA collations. For now, we have only an accent-sensitive case-INsensitive UCA collation: utf8mb3_thai_520_w2 Note, it can be used to provide unambiguous sorting order in combination with numericOrdering-enabled collations, like this:
Here the ORDER BY has three parts:
| ||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-07-31 ] | ||||||||||||||||||||||||||||||||
|
Hello serg. I agree, creating built-in numericOrderding=on variants for all existing regular collations (and thus spending a lot of available IDs in the pool) is a problem. I propose we don't do that. Some ideas how to avoid this:
So I really prefer the numeric sorting to be implemented as a UCA collation option.
which I think are overdue. It will also support this approach you mentioned:
with help of the WEIGHT_STRING() function. So one can decide which of these two equivalent queries to use:
The former works without materializing, the latter works with materializing. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-07-31 ] | ||||||||||||||||||||||||||||||||
|
To tell the truth, the numericOrdering in Unicode is described a little bit sketchy,UTS#10, the collation paper, does not mention how to do it Is there anything yet that implements it? The i18 "golden standard", ICU, do they allow that? bar ? To be honest, I think a string to string transformation with a function should be OK, an I also think that natural_sort is rather exotic, so people probably can live with virtual functions by now. It will be immediately usable, at the tiny cost of virtual functions, if one wants an index to be used in ORDER BY Not saying that supporting that in collations is too bad, but I think the overhead of doing that is perhaps too much given the scope of that MDEV. People won't be willing to create their own collations, I doubt that very much. Also, a function can be adopted and given parameter to recognize and handle negatives, whether to remove leading zeroes or not, or handle decimal fractions (with a user-defined decimal point and thousands separator), and this will be too much to put into numericOrdering only. | ||||||||||||||||||||||||||||||||
| Comment by Alexander Barkov [ 2021-07-31 ] | ||||||||||||||||||||||||||||||||
|
wlad, yes ICU does support numeric sorting. Note, PostgreSQL can use ICU collations, including collation customization:
| ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-08-03 ] | ||||||||||||||||||||||||||||||||
I wanted to avoid comparing numbers and non-numbers. That is, I tried to get the same effect as for comparing tuples. Consider, for example, "abc-123.4.5alpha2" → ("abc-", 123, ".", 4, ".", 5, "alpha", 2). And comparing this "naturally" with, say, "abc-123.4.5.alphabeta3", first six elements in a tuple would be equal, but "alpha" < "alphabeta", so the first tuple and the first string is "naturally" smaller. That's why I used "\0" as a separator. "alpha\0...2" is smaller than "alphabeta\0...3" no matter what bytes follow the zero byte. | ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-08-03 ] | ||||||||||||||||||||||||||||||||
|
I think binary data isn't much of a problem. The NATURAL_SORT_KEY() function could convert numbers to binary data, for example, BCD with a length prefix, and non-numbers — to their weights, like WEIGHT_STRING() function. And its result can then be binary, just a set of bytes that can be memcmp-ed. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-08-03 ] | ||||||||||||||||||||||||||||||||
|
serg, this is what I planned originally, but now I believe encoding string to string has its own advantages, it is compact (much more than WEIGHT_STRING + BCD), reversible (if we want to, we can provide NATURAL_SORT_DECODE() to ease the index-only access), and can be used with this or that COLLATE order WEIGHT_STRING(NATURAL_SORT_KEY()) can be stored in index, too, if one really wants | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-08-14 ] | ||||||||||||||||||||||||||||||||
|
serg, I do not think a string->string requires separators for digits. Binary sortkeys, lke in your example, doe need separators. Though, you'll have to take care that actual sortkeys do not contain binary zeros, which is not given for either classical BCD or collation weights. point is that natural sort solves the problem where group of digits at the same offset inside string are sorted alpha, rather then numeric. I do not think anyone complained about relative order or digits and non-digits (this is what separators solve), and I'm sure UCA did not put digits in the middle of any script, and into some useful position and this will already ensure the natural order of things. The relative order of digits vs scripts, punctuation,emojis, or Dingbats is preserved, in string transformation, and it should be fine. If there is a real need to sort numbers smaller than anything, in then preprocessing string with e.g
will do that trick ( use \0 nstead of \t for _bin collations, the smallest character varies depending on collation) | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-08-16 ] | ||||||||||||||||||||||||||||||||
|
The patch is in bb-10.7- It is a string->string transformation, and there is an additional parameter natural_sort_key, for decimal separator, which can be e.g "." or "," , to sort fractions correctly. There is no designated separator between digits and non-digits (just the digit rsp non-digit characters themselves). This should not cause problems, as all characters that are less than digits 0-9 are controls, punctuation, letter modifiers and similar non-letter stuff (that's about UCA, but also true for all ASCII based encodings) | ||||||||||||||||||||||||||||||||
| Comment by Artur ZaprzaĆa [ 2021-08-24 ] | ||||||||||||||||||||||||||||||||
|
A generic scheme for number encoding for natural sorting is to prepend the length-1 of the number to the number, and then prepend the length-1 of the length, and repeat it a fixed number of times. The leftmost prefix must be of a fixed length and should be encoded as decimal. Remaining prefixes should be encoded using radix > 10 to make them short. Using two levels of encoding, a single digit for the first prefix and a hexadecimal notation for the second prefix will work for digit sequences up to the length of 1TB. Example prefixes:
| ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-08-24 ] | ||||||||||||||||||||||||||||||||
|
I think prefixes of 2 bytes for small numbers are too much. For that task, I doubt there will be a lot of numbers longer than 9-10 digits. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-08-26 ] | ||||||||||||||||||||||||||||||||
|
serg, bb-10.7- | ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-08-27 ] | ||||||||||||||||||||||||||||||||
|
Looks good, the only thing that I doesn't like is:
that first patch had 50% overhead, now you went up to 2x. May be you don't need to encode leading zeros when there aren't any? "1" -> "01", not "010" also
this is usually done like
I agree that for a small buffer like yours here it matters little. But better to set a good example in case someone copies it and his buffer won't be small. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-08-30 ] | ||||||||||||||||||||||||||||||||
|
serg , we have to encode lead zeros count, even if there are none. Otherwise, "1/", "01" , "1a" will sort with "01" in the middle, even though "01" > "1". I added a test for it. I also changed natsort_encode_numeric_string such that it won't allocate anything, but fail, if preallocated string can't accomodate the resulting key. We preallocate at an earlier stage, only once, and do not do reallocs afterwards. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-09-01 ] | ||||||||||||||||||||||||||||||||
|
I've done a little survey what other natsort implementations do w.r.t leading zeros sort, to figure out what to do with them, and here are the results
Ignoring leading zeros, like ICU does, somewhat surprisingly leads to more logical results. While other libraries think that "a1b" > "a01c" (because 1 > 01), ICU begs to differ, and this kinda makes sense, because split into components , it is (a,1,b) vs (a,1,c) the difference between 1 and 01 seems less significant that than the difference between b and c. With that, I think we should be ignoring lead zero differences, for the upcoming version. That makes keys smaller too, usually requiring only a single byte per number. For stable sort, one could still "ORDER BY natural_sort_key(c), c" , this would give exactly "01" < "1" order. As a side note, I do like the description of the older version of StrCompareLogicalW algorithm (now it does not work like this anymore).
This one seems combine the strength of stable sort, and prioritizing significant over insignificant differences a la ICU. It seems to be inspired by Unicode collation algorithm, like a weight level for leading zeros. | ||||||||||||||||||||||||||||||||
| Comment by Sergei Golubchik [ 2021-09-02 ] | ||||||||||||||||||||||||||||||||
|
the branch as of 64206db61af looks good, thanks! | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-09-02 ] | ||||||||||||||||||||||||||||||||
|
elenst, could you please test? | ||||||||||||||||||||||||||||||||
| Comment by Anel Husakovic [ 2021-09-22 ] | ||||||||||||||||||||||||||||||||
|
Tested and published the [blog](https://mariadb.org/10-7-preview-feature-natural-sort/) | ||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2021-10-07 ] | ||||||||||||||||||||||||||||||||
|
While I understand the intention behind it and thus don't want to file a bug, I think there is a distinct possibility that some users will find the natural sort of float numbers rather unnatural.
It is slightly better when the values are actually strings, at least they are formatted as such, and with some effort one can see the logic:
Same for negative numbers. I wonder what existing best practices say about it, whether the natural sort, when applied to numbers, should consider the numeric order being natural. | ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-07 ] | ||||||||||||||||||||||||||||||||
|
It is certainly possible to design a transformation scheme for fractions, and negatives, it is just not done currently. Some of that, sorting leading zeros in predictable way, and fractions( with either dot or comma as separators) were in some earlier versions of the patch, and were taken out, for the sake of simplicity. We currently sort like ICU - positive numbers only, no fractions, ignore leading zeros, no other customizations. We sort slightly better than ICU in one aspect, because they have some in-built limitations about how long a numeric string can be, 256 digits or something like that, I don't recall exactly. We sort slightly worse than ICU in i18n department, because we only consider 0-9 to be digits, and they handle more decimal digits from different scripts. What you can be interested in , Python's natsort has some customizations to sort, and allegedly handles fractions, and negatives , when told to handle it, via parameter https://natsort.readthedocs.io/en/master/api.html#the-ns-enum But there is no best practice, as far as I understand, and no common definition of natural sort, except that "we intermix strings and numbers, and expect numbers to be sorted as numbers, not ASCII". There are a couple of aspects when people want to sort strings as numbers
| ||||||||||||||||||||||||||||||||
| Comment by Vladislav Vaintroub [ 2021-10-07 ] | ||||||||||||||||||||||||||||||||
|
And of course, natural sort sorts strings. Anything can be converted to string, but it does not mean people should be trying to convert numbers to string, and to natural_sort_key, to sort numbers. | ||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2021-10-13 ] | ||||||||||||||||||||||||||||||||
|
As far as testing is concerned, it can be pushed into 10.7 (as of preview-10.7- Documentation in the KB needs to be updated/enhanced, but it shouldn't be an obstacle to pushing into main. As explained earlier in the comments, some decisions remain a subject to further consideration based on users' feedback. | ||||||||||||||||||||||||||||||||
| Comment by Dean Trower [ 2022-03-30 ] | ||||||||||||||||||||||||||||||||
|
I wrote a nat-sort function for MariaDB/MySQL a few years ago, documented here: https://stackoverflow.com/a/58154535/999120 It produces a key strings based on an input string, such that sorting by the key string results in a natural sort. They key strings can be stored for re-use, making the process efficient. I think it might have more "features" than the one you describe above: It handles negatives (though this could be improved), leading zeros, leading +, version numbers or IP addresses, decimals, and use of commas as thousand separators, and also preserves total sort ordering (i.e. no two different strings nat-sort as equivalent). Feel free to borrow it, if you like. |