Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4742

A function for native natural sorting / natural compare

Details

    Description

      Hi guys, could be nice a natural sorting inside mariadb
      today some guys use SQL procedures/functions to execute this cpu intensive task
      php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?

      functions:
      1) natural sort, used in ORDER BY / GROUP BY
      2) natural sort compare, could be used as an operator, like the "SOUNDS LIKE" operator, in other words, the function that return a canonical form of the string could be used to compare, something that could be rewrite "field NATURAL LIKE value" to "natual(field)=natural(value)"

      i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly
      --------------
      example of (2):
      http://www.php.net/manual/en/function.strnatcmp.php
      https://github.com/php/php-src/blob/master/ext/standard/strnatcmp.c
      --------------

      example of (1)
      Here a example in STORED PROCEDURE/FUNCTIONS for ORDER BY, but not exaust tested:
      source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql

      DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
      DELIMITER ;;
      CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000)) 
      RETURNS int
      LANGUAGE SQL
      DETERMINISTIC
      NO SQL
      SQL SECURITY INVOKER
      BEGIN
          DECLARE position int;
          DECLARE tmp_position int;
          SET position = 5000;
          SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF; 
          SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
       
          IF (position = 5000) THEN RETURN 0; END IF;
          RETURN position;
      END
      ;;
       
      DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
      DELIMITER ;;
      CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
      RETURNS varchar(4000)
      LANGUAGE SQL
      DETERMINISTIC
      NO SQL
      SQL SECURITY INVOKER
      BEGIN
          DECLARE sortString varchar(4000);
          DECLARE numStartIndex int;
          DECLARE numEndIndex int;
          DECLARE padLength int;
          DECLARE totalPadLength int;
          DECLARE i int;
          DECLARE sameOrderCharsLen int;
       
          SET totalPadLength = 0;
          SET instring = TRIM(instring);
          SET sortString = instring;
          SET numStartIndex = udf_FirstNumberPos(instring);
          SET numEndIndex = 0;
          SET i = 1;
          SET sameOrderCharsLen = LENGTH(sameOrderChars);
       
          WHILE (i <= sameOrderCharsLen) DO
              SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
              SET i = i + 1;
          END WHILE;
       
          WHILE (numStartIndex <> 0) DO
              SET numStartIndex = numStartIndex + numEndIndex;
              SET numEndIndex = numStartIndex;
       
              WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                  SET numEndIndex = numEndIndex + 1;
              END WHILE;
       
              SET numEndIndex = numEndIndex - 1;
       
              SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
       
              IF padLength < 0 THEN
                  SET padLength = 0;
              END IF;
       
              SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
       
              SET totalPadLength = totalPadLength + padLength;
              SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
          END WHILE;
       
          RETURN sortString;
      END
      ;;

      Attachments

        Issue Links

          Activity

            While I understand the intention behind it and thus don't want to file a bug, I think there is a distinct possibility that some users will find the natural sort of float numbers rather unnatural.
            Especially when the returned values are FLOAT, and the client aligns them as such, it is quite difficult to see the result as natural.

            MariaDB [test]> create or replace table t (f float);
            Query OK, 0 rows affected (0.055 sec)
             
            MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044);
            Query OK, 3 rows affected (0.013 sec)
            Records: 3  Duplicates: 0  Warnings: 0
             
            MariaDB [test]> select * from t order by natural_sort_key(f);
            +-------+
            | f     |
            +-------+
            |   0.5 |
            |  0.33 |
            | 0.044 |
            +-------+
            3 rows in set (0.001 sec)
            

            It is slightly better when the values are actually strings, at least they are formatted as such, and with some effort one can see the logic:

            MariaDB [test]> create or replace table t (f varchar(16));
            Query OK, 0 rows affected (0.053 sec)
             
            MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044);
            Query OK, 3 rows affected (0.012 sec)
            Records: 3  Duplicates: 0  Warnings: 0
             
            MariaDB [test]> select * from t order by natural_sort_key(f);
            +-------+
            | f     |
            +-------+
            | 0.5   |
            | 0.33  |
            | 0.044 |
            +-------+
            3 rows in set (0.001 sec)
            

            Same for negative numbers.

            I wonder what existing best practices say about it, whether the natural sort, when applied to numbers, should consider the numeric order being natural.

            elenst Elena Stepanova added a comment - While I understand the intention behind it and thus don't want to file a bug, I think there is a distinct possibility that some users will find the natural sort of float numbers rather unnatural. Especially when the returned values are FLOAT, and the client aligns them as such, it is quite difficult to see the result as natural. MariaDB [test]> create or replace table t (f float ); Query OK, 0 rows affected (0.055 sec)   MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044); Query OK, 3 rows affected (0.013 sec) Records: 3 Duplicates: 0 Warnings: 0   MariaDB [test]> select * from t order by natural_sort_key(f); + -------+ | f | + -------+ | 0.5 | | 0.33 | | 0.044 | + -------+ 3 rows in set (0.001 sec) It is slightly better when the values are actually strings, at least they are formatted as such, and with some effort one can see the logic: MariaDB [test]> create or replace table t (f varchar (16)); Query OK, 0 rows affected (0.053 sec)   MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044); Query OK, 3 rows affected (0.012 sec) Records: 3 Duplicates: 0 Warnings: 0   MariaDB [test]> select * from t order by natural_sort_key(f); + -------+ | f | + -------+ | 0.5 | | 0.33 | | 0.044 | + -------+ 3 rows in set (0.001 sec) Same for negative numbers. I wonder what existing best practices say about it, whether the natural sort, when applied to numbers , should consider the numeric order being natural.
            wlad Vladislav Vaintroub added a comment - - edited

            It is certainly possible to design a transformation scheme for fractions, and negatives, it is just not done currently. Some of that, sorting leading zeros in predictable way, and fractions( with either dot or comma as separators) were in some earlier versions of the patch, and were taken out, for the sake of simplicity.

            We currently sort like ICU - positive numbers only, no fractions, ignore leading zeros, no other customizations. We sort slightly better than ICU in one aspect, because they have some in-built limitations about how long a numeric string can be, 256 digits or something like that, I don't recall exactly. We sort slightly worse than ICU in i18n department, because we only consider 0-9 to be digits, and they handle more decimal digits from different scripts.

            What you can be interested in , Python's natsort has some customizations to sort, and allegedly handles fractions, and negatives , when told to handle it, via parameter https://natsort.readthedocs.io/en/master/api.html#the-ns-enum

            But there is no best practice, as far as I understand, and no common definition of natural sort, except that "we intermix strings and numbers, and expect numbers to be sorted as numbers, not ASCII".

            There are a couple of aspects when people want to sort strings as numbers

            • i18n decimal numbers : 0-9 vs decimal numbers in non-latin scripts
            • fractions separators- dot vs comma vs whatever is used in some arabic scripts. Sometimes both dot and comma are used, e.g in Switzerland.
            • fractions - whether to support scientific notation with exponent ?
            • thousands separators - 20'000'000 - is this the same as 20000000? Is 20 000 000 the same as 20000000,00? Those are tricky, as they are can be locale dependent, and alternating, with comma and dot, but do not have to, and people would just separate with space.
            • negatives
            • decimal numbers - should they sort as-is ( bigger than punctuation, less that letters), or less than anything else, including punctuations, or larger than anything else? This is related to script ordering in UCA algorithm.
            wlad Vladislav Vaintroub added a comment - - edited It is certainly possible to design a transformation scheme for fractions, and negatives, it is just not done currently. Some of that, sorting leading zeros in predictable way, and fractions( with either dot or comma as separators) were in some earlier versions of the patch, and were taken out, for the sake of simplicity. We currently sort like ICU - positive numbers only, no fractions, ignore leading zeros, no other customizations. We sort slightly better than ICU in one aspect, because they have some in-built limitations about how long a numeric string can be, 256 digits or something like that, I don't recall exactly. We sort slightly worse than ICU in i18n department, because we only consider 0-9 to be digits, and they handle more decimal digits from different scripts. What you can be interested in , Python's natsort has some customizations to sort, and allegedly handles fractions, and negatives , when told to handle it, via parameter https://natsort.readthedocs.io/en/master/api.html#the-ns-enum But there is no best practice, as far as I understand, and no common definition of natural sort, except that "we intermix strings and numbers, and expect numbers to be sorted as numbers, not ASCII". There are a couple of aspects when people want to sort strings as numbers i18n decimal numbers : 0-9 vs decimal numbers in non-latin scripts fractions separators- dot vs comma vs whatever is used in some arabic scripts. Sometimes both dot and comma are used, e.g in Switzerland. fractions - whether to support scientific notation with exponent ? thousands separators - 20'000'000 - is this the same as 20000000? Is 20 000 000 the same as 20000000,00? Those are tricky, as they are can be locale dependent, and alternating, with comma and dot, but do not have to, and people would just separate with space. negatives decimal numbers - should they sort as-is ( bigger than punctuation, less that letters), or less than anything else, including punctuations, or larger than anything else? This is related to script ordering in UCA algorithm.

            And of course, natural sort sorts strings. Anything can be converted to string, but it does not mean people should be trying to convert numbers to string, and to natural_sort_key, to sort numbers.

            wlad Vladislav Vaintroub added a comment - And of course, natural sort sorts strings. Anything can be converted to string, but it does not mean people should be trying to convert numbers to string, and to natural_sort_key, to sort numbers.
            elenst Elena Stepanova added a comment - - edited

            As far as testing is concerned, it can be pushed into 10.7 (as of preview-10.7-MDEV-4742-natural-sort cfec49f1dd7241582f03cbcd395e65a9c2ecf4ce).

            Documentation in the KB needs to be updated/enhanced, but it shouldn't be an obstacle to pushing into main.

            As explained earlier in the comments, some decisions remain a subject to further consideration based on users' feedback.

            elenst Elena Stepanova added a comment - - edited As far as testing is concerned, it can be pushed into 10.7 (as of preview-10.7- MDEV-4742 -natural-sort cfec49f1dd7241582f03cbcd395e65a9c2ecf4ce). Documentation in the KB needs to be updated/enhanced, but it shouldn't be an obstacle to pushing into main. As explained earlier in the comments, some decisions remain a subject to further consideration based on users' feedback.
            Dean T Dean Trower added a comment - - edited

            I wrote a nat-sort function for MariaDB/MySQL a few years ago, documented here: https://stackoverflow.com/a/58154535/999120

            It produces a key strings based on an input string, such that sorting by the key string results in a natural sort. They key strings can be stored for re-use, making the process efficient.

            I think it might have more "features" than the one you describe above: It handles negatives (though this could be improved), leading zeros, leading +, version numbers or IP addresses, decimals, and use of commas as thousand separators, and also preserves total sort ordering (i.e. no two different strings nat-sort as equivalent).

            Feel free to borrow it, if you like.

            Dean T Dean Trower added a comment - - edited I wrote a nat-sort function for MariaDB/MySQL a few years ago, documented here: https://stackoverflow.com/a/58154535/999120 It produces a key strings based on an input string, such that sorting by the key string results in a natural sort. They key strings can be stored for re-use, making the process efficient. I think it might have more "features" than the one you describe above: It handles negatives (though this could be improved), leading zeros, leading +, version numbers or IP addresses, decimals, and use of commas as thousand separators, and also preserves total sort ordering (i.e. no two different strings nat-sort as equivalent). Feel free to borrow it, if you like.

            People

              wlad Vladislav Vaintroub
              rspadim roberto spadim
              Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.