Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4742

A function for native natural sorting / natural compare

Details

    Description

      Hi guys, could be nice a natural sorting inside mariadb
      today some guys use SQL procedures/functions to execute this cpu intensive task
      php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?

      functions:
      1) natural sort, used in ORDER BY / GROUP BY
      2) natural sort compare, could be used as an operator, like the "SOUNDS LIKE" operator, in other words, the function that return a canonical form of the string could be used to compare, something that could be rewrite "field NATURAL LIKE value" to "natual(field)=natural(value)"

      i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly
      --------------
      example of (2):
      http://www.php.net/manual/en/function.strnatcmp.php
      https://github.com/php/php-src/blob/master/ext/standard/strnatcmp.c
      --------------

      example of (1)
      Here a example in STORED PROCEDURE/FUNCTIONS for ORDER BY, but not exaust tested:
      source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql

      DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
      DELIMITER ;;
      CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000)) 
      RETURNS int
      LANGUAGE SQL
      DETERMINISTIC
      NO SQL
      SQL SECURITY INVOKER
      BEGIN
          DECLARE position int;
          DECLARE tmp_position int;
          SET position = 5000;
          SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF; 
          SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
          SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
       
          IF (position = 5000) THEN RETURN 0; END IF;
          RETURN position;
      END
      ;;
       
      DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
      DELIMITER ;;
      CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
      RETURNS varchar(4000)
      LANGUAGE SQL
      DETERMINISTIC
      NO SQL
      SQL SECURITY INVOKER
      BEGIN
          DECLARE sortString varchar(4000);
          DECLARE numStartIndex int;
          DECLARE numEndIndex int;
          DECLARE padLength int;
          DECLARE totalPadLength int;
          DECLARE i int;
          DECLARE sameOrderCharsLen int;
       
          SET totalPadLength = 0;
          SET instring = TRIM(instring);
          SET sortString = instring;
          SET numStartIndex = udf_FirstNumberPos(instring);
          SET numEndIndex = 0;
          SET i = 1;
          SET sameOrderCharsLen = LENGTH(sameOrderChars);
       
          WHILE (i <= sameOrderCharsLen) DO
              SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
              SET i = i + 1;
          END WHILE;
       
          WHILE (numStartIndex <> 0) DO
              SET numStartIndex = numStartIndex + numEndIndex;
              SET numEndIndex = numStartIndex;
       
              WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                  SET numEndIndex = numEndIndex + 1;
              END WHILE;
       
              SET numEndIndex = numEndIndex - 1;
       
              SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);
       
              IF padLength < 0 THEN
                  SET padLength = 0;
              END IF;
       
              SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));
       
              SET totalPadLength = totalPadLength + padLength;
              SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
          END WHILE;
       
          RETURN sortString;
      END
      ;;

      Attachments

        Issue Links

          Activity

            rspadim roberto spadim created issue -
            rspadim roberto spadim made changes -
            Field Original Value New Value
            rspadim roberto spadim made changes -
            Description Hi guys, could be nice a natural sorting inside mariadb
            today some guys use SQL procedures/functions to execute this cpu intensive task
            php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?
            Hi guys, could be nice a natural sorting inside mariadb
            today some guys use SQL procedures/functions to execute this cpu intensive task
            php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?


            Here a example in STORED PROCEDURE/FUNCTIONS, but not exaust tested:
            source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql
            {code}
            DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
            DELIMITER ;;
            CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
            RETURNS int
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE position int;
                DECLARE tmp_position int;
                SET position = 5000;
                SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;

                IF (position = 5000) THEN RETURN 0; END IF;
                RETURN position;
            END
            ;;

            DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
            DELIMITER ;;
            CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
            RETURNS varchar(4000)
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE sortString varchar(4000);
                DECLARE numStartIndex int;
                DECLARE numEndIndex int;
                DECLARE padLength int;
                DECLARE totalPadLength int;
                DECLARE i int;
                DECLARE sameOrderCharsLen int;

                SET totalPadLength = 0;
                SET instring = TRIM(instring);
                SET sortString = instring;
                SET numStartIndex = udf_FirstNumberPos(instring);
                SET numEndIndex = 0;
                SET i = 1;
                SET sameOrderCharsLen = LENGTH(sameOrderChars);

                WHILE (i <= sameOrderCharsLen) DO
                    SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
                    SET i = i + 1;
                END WHILE;

                WHILE (numStartIndex <> 0) DO
                    SET numStartIndex = numStartIndex + numEndIndex;
                    SET numEndIndex = numStartIndex;

                    WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                        SET numEndIndex = numEndIndex + 1;
                    END WHILE;

                    SET numEndIndex = numEndIndex - 1;

                    SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

                    IF padLength < 0 THEN
                        SET padLength = 0;
                    END IF;

                    SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

                    SET totalPadLength = totalPadLength + padLength;
                    SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
                END WHILE;

                RETURN sortString;
            END
            ;;
            {code}
            rspadim roberto spadim made changes -
            rspadim roberto spadim made changes -
            Summary UDF - Native natural sorting UDF - Native natural sorting / natural compare
            rspadim roberto spadim made changes -
            Description Hi guys, could be nice a natural sorting inside mariadb
            today some guys use SQL procedures/functions to execute this cpu intensive task
            php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?


            Here a example in STORED PROCEDURE/FUNCTIONS, but not exaust tested:
            source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql
            {code}
            DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
            DELIMITER ;;
            CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
            RETURNS int
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE position int;
                DECLARE tmp_position int;
                SET position = 5000;
                SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;

                IF (position = 5000) THEN RETURN 0; END IF;
                RETURN position;
            END
            ;;

            DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
            DELIMITER ;;
            CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
            RETURNS varchar(4000)
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE sortString varchar(4000);
                DECLARE numStartIndex int;
                DECLARE numEndIndex int;
                DECLARE padLength int;
                DECLARE totalPadLength int;
                DECLARE i int;
                DECLARE sameOrderCharsLen int;

                SET totalPadLength = 0;
                SET instring = TRIM(instring);
                SET sortString = instring;
                SET numStartIndex = udf_FirstNumberPos(instring);
                SET numEndIndex = 0;
                SET i = 1;
                SET sameOrderCharsLen = LENGTH(sameOrderChars);

                WHILE (i <= sameOrderCharsLen) DO
                    SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
                    SET i = i + 1;
                END WHILE;

                WHILE (numStartIndex <> 0) DO
                    SET numStartIndex = numStartIndex + numEndIndex;
                    SET numEndIndex = numStartIndex;

                    WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                        SET numEndIndex = numEndIndex + 1;
                    END WHILE;

                    SET numEndIndex = numEndIndex - 1;

                    SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

                    IF padLength < 0 THEN
                        SET padLength = 0;
                    END IF;

                    SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

                    SET totalPadLength = totalPadLength + padLength;
                    SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
                END WHILE;

                RETURN sortString;
            END
            ;;
            {code}
            Hi guys, could be nice a natural sorting inside mariadb
            today some guys use SQL procedures/functions to execute this cpu intensive task
            php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?

            functions:
            1) natural sort, used in ORDER BY / GROUP BY
            2) natural sort compare, could be used as an operator, like the "SOUNDS LIKE" operator, in other words, the function that return a canonical form of the string could be used to compare, something that could be rewrite "field NATURAL LIKE value" to "natual(field)=natural(value)"

            i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly
            example of (2): http://www.php.net/manual/en/function.strnatcmp.php
            --------------


            1)
            Here a example in STORED PROCEDURE/FUNCTIONS for ORDER BY, but not exaust tested:
            source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql
            {code}
            DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
            DELIMITER ;;
            CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
            RETURNS int
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE position int;
                DECLARE tmp_position int;
                SET position = 5000;
                SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;

                IF (position = 5000) THEN RETURN 0; END IF;
                RETURN position;
            END
            ;;

            DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
            DELIMITER ;;
            CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
            RETURNS varchar(4000)
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE sortString varchar(4000);
                DECLARE numStartIndex int;
                DECLARE numEndIndex int;
                DECLARE padLength int;
                DECLARE totalPadLength int;
                DECLARE i int;
                DECLARE sameOrderCharsLen int;

                SET totalPadLength = 0;
                SET instring = TRIM(instring);
                SET sortString = instring;
                SET numStartIndex = udf_FirstNumberPos(instring);
                SET numEndIndex = 0;
                SET i = 1;
                SET sameOrderCharsLen = LENGTH(sameOrderChars);

                WHILE (i <= sameOrderCharsLen) DO
                    SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
                    SET i = i + 1;
                END WHILE;

                WHILE (numStartIndex <> 0) DO
                    SET numStartIndex = numStartIndex + numEndIndex;
                    SET numEndIndex = numStartIndex;

                    WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                        SET numEndIndex = numEndIndex + 1;
                    END WHILE;

                    SET numEndIndex = numEndIndex - 1;

                    SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

                    IF padLength < 0 THEN
                        SET padLength = 0;
                    END IF;

                    SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

                    SET totalPadLength = totalPadLength + padLength;
                    SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
                END WHILE;

                RETURN sortString;
            END
            ;;
            {code}
            rspadim roberto spadim made changes -
            Description Hi guys, could be nice a natural sorting inside mariadb
            today some guys use SQL procedures/functions to execute this cpu intensive task
            php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?

            functions:
            1) natural sort, used in ORDER BY / GROUP BY
            2) natural sort compare, could be used as an operator, like the "SOUNDS LIKE" operator, in other words, the function that return a canonical form of the string could be used to compare, something that could be rewrite "field NATURAL LIKE value" to "natual(field)=natural(value)"

            i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly
            example of (2): http://www.php.net/manual/en/function.strnatcmp.php
            --------------


            1)
            Here a example in STORED PROCEDURE/FUNCTIONS for ORDER BY, but not exaust tested:
            source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql
            {code}
            DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
            DELIMITER ;;
            CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
            RETURNS int
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE position int;
                DECLARE tmp_position int;
                SET position = 5000;
                SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;

                IF (position = 5000) THEN RETURN 0; END IF;
                RETURN position;
            END
            ;;

            DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
            DELIMITER ;;
            CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
            RETURNS varchar(4000)
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE sortString varchar(4000);
                DECLARE numStartIndex int;
                DECLARE numEndIndex int;
                DECLARE padLength int;
                DECLARE totalPadLength int;
                DECLARE i int;
                DECLARE sameOrderCharsLen int;

                SET totalPadLength = 0;
                SET instring = TRIM(instring);
                SET sortString = instring;
                SET numStartIndex = udf_FirstNumberPos(instring);
                SET numEndIndex = 0;
                SET i = 1;
                SET sameOrderCharsLen = LENGTH(sameOrderChars);

                WHILE (i <= sameOrderCharsLen) DO
                    SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
                    SET i = i + 1;
                END WHILE;

                WHILE (numStartIndex <> 0) DO
                    SET numStartIndex = numStartIndex + numEndIndex;
                    SET numEndIndex = numStartIndex;

                    WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                        SET numEndIndex = numEndIndex + 1;
                    END WHILE;

                    SET numEndIndex = numEndIndex - 1;

                    SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

                    IF padLength < 0 THEN
                        SET padLength = 0;
                    END IF;

                    SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

                    SET totalPadLength = totalPadLength + padLength;
                    SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
                END WHILE;

                RETURN sortString;
            END
            ;;
            {code}
            Hi guys, could be nice a natural sorting inside mariadb
            today some guys use SQL procedures/functions to execute this cpu intensive task
            php released a nice function to do this job, maybe we could port part of it and implement in udf function, and release as a default udf function?

            functions:
            1) natural sort, used in ORDER BY / GROUP BY
            2) natural sort compare, could be used as an operator, like the "SOUNDS LIKE" operator, in other words, the function that return a canonical form of the string could be used to compare, something that could be rewrite "field NATURAL LIKE value" to "natual(field)=natural(value)"

            i didn't found (2) in internet, but it's used in php, and the (1) should use it in some place to order/group correctly
            --------------
            example of (2):
            http://www.php.net/manual/en/function.strnatcmp.php
            https://github.com/php/php-src/blob/master/ext/standard/strnatcmp.c
            --------------


            example of (1)
            Here a example in STORED PROCEDURE/FUNCTIONS for ORDER BY, but not exaust tested:
            source: http://stackoverflow.com/questions/153633/natural-sort-in-mysql
            {code}
            DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
            DELIMITER ;;
            CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000))
            RETURNS int
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE position int;
                DECLARE tmp_position int;
                SET position = 5000;
                SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
                SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;

                IF (position = 5000) THEN RETURN 0; END IF;
                RETURN position;
            END
            ;;

            DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
            DELIMITER ;;
            CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50))
            RETURNS varchar(4000)
            LANGUAGE SQL
            DETERMINISTIC
            NO SQL
            SQL SECURITY INVOKER
            BEGIN
                DECLARE sortString varchar(4000);
                DECLARE numStartIndex int;
                DECLARE numEndIndex int;
                DECLARE padLength int;
                DECLARE totalPadLength int;
                DECLARE i int;
                DECLARE sameOrderCharsLen int;

                SET totalPadLength = 0;
                SET instring = TRIM(instring);
                SET sortString = instring;
                SET numStartIndex = udf_FirstNumberPos(instring);
                SET numEndIndex = 0;
                SET i = 1;
                SET sameOrderCharsLen = LENGTH(sameOrderChars);

                WHILE (i <= sameOrderCharsLen) DO
                    SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
                    SET i = i + 1;
                END WHILE;

                WHILE (numStartIndex <> 0) DO
                    SET numStartIndex = numStartIndex + numEndIndex;
                    SET numEndIndex = numStartIndex;

                    WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
                        SET numEndIndex = numEndIndex + 1;
                    END WHILE;

                    SET numEndIndex = numEndIndex - 1;

                    SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

                    IF padLength < 0 THEN
                        SET padLength = 0;
                    END IF;

                    SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

                    SET totalPadLength = totalPadLength + padLength;
                    SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, LENGTH(instring) - numEndIndex));
                END WHILE;

                RETURN sortString;
            END
            ;;
            {code}
            serg Sergei Golubchik made changes -
            serg Sergei Golubchik made changes -
            Labels upstream

            a partial patch, todo:
            natstring need fraction implementation
            use collations

            rewrite, it's too ugly now

            rspadim roberto spadim added a comment - a partial patch, todo: natstring need fraction implementation use collations rewrite, it's too ugly now
            rspadim roberto spadim made changes -
            Attachment mariadb_nat_comp.c [ 22501 ]

            i done some job, as a start point...

            natstring = canonical string (can use in order by natstring(some_string))
            what need to be changed: the number is rewrite as a string with zeropad, but it's a fixed size (20), the size could be a optional parameter example: natstring("aasdfasdf",10), when number is bigger than 10, it's just copied not rewrite with zeropad)

            what this function should do:

            natstring("hello 1 person in 2 tables",13);
            return:
            "hello 0000000000001 person in 0000000000002 tables"

            —
            how to rewrite numbers?
            1) 1 vs 01 (today this function do)
            -> the first idea is use bigint precision 20 digits => "18446744073709551615"/"+-9223372036854775807"
            "1" and "01" = "00000000000000000001"
            2) decimal: 0.1 (today this function don't do in the right way)
            using the same idea of (1):
            0.1 = "0"."1" => "00000000000000000000.00000000000000000001" (today it do, but it's not right...)
            or maybe understand that's as a fraction:
            0.1 = "00000000000000000000.10000000000000000000" (today it don't do, that's the right way)
            3) 10e-1 vs 0.1
            that's a problem since 10e-1 and 0.1 is the same value,
            but when you read 10e-1 you don't think about 0.1, i'm right?
            maybe we can use (2) here
            -> rewrite to 00000000000000000010e-00000000000000000001

            00000000000000000010e-00000000000000000001 vs 00000000000000000000.00000000000000000001
            strnatcasecmp('10e-1','0.1') => return "1" (right)
            strnatcmp('10e-1','0.1') => return "1" (right)
            4) what about numbers with more than 20 digits? example:
            PI = "3.1415926535897932384626433832795028841971693993751"
            rewrite with length = ceil(size/20)*20 ?
            no! just copy! (today this function do this part)
            5)leading zeros (today it do)
            "0000000000000000000000000000000000000000000000000000000001"
            must be rewrite to "1" => "00000000000000000001"
            6) special numbers (version and others kinds), example: (it's part of (2) problem)
            '1.1.1' should be understood as:
            integer.integer.integer instead of
            integer.fraction.integer
            1.1.1 (especial) = 00000000000000000001.00000000000000000001.00000000000000000001
            1.1 (not special) = 00000000000000000001.10000000000000000000
            check that if second or third number is bigger than 20 (max number size) it's just copied, but it's not a fraction, it's just a integer

            fraction = number+'.'+number + (next pos!='.' and next next pos is not a digit)
            especial = number+'.'number'.'+number... (end when don't find "."+digit)
            examples
            6.1)"1.1.1" => number+'.'number'.'+number + end of string (!='.'+digit) = special
            6.2)"1.1.1." => "1.1.1" (special) + "."
            number+'.'number'.'number (!='.'+digit) [non special part= '.']
            6.3)"1.1.1.a" => "1.1.1" (special) + ".a"
            number+'.'number'.'number (!='.'+digit) [non special part = '.a']

            rspadim roberto spadim added a comment - i done some job, as a start point... natstring = canonical string (can use in order by natstring(some_string)) what need to be changed: the number is rewrite as a string with zeropad, but it's a fixed size (20), the size could be a optional parameter example: natstring("aasdfasdf",10), when number is bigger than 10, it's just copied not rewrite with zeropad) what this function should do: natstring("hello 1 person in 2 tables",13); return: "hello 0000000000001 person in 0000000000002 tables" — how to rewrite numbers? 1) 1 vs 01 (today this function do) -> the first idea is use bigint precision 20 digits => "18446744073709551615"/"+-9223372036854775807" "1" and "01" = "00000000000000000001" 2) decimal: 0.1 (today this function don't do in the right way) using the same idea of (1): 0.1 = "0"."1" => "00000000000000000000.00000000000000000001" (today it do, but it's not right...) or maybe understand that's as a fraction: 0.1 = "00000000000000000000.10000000000000000000" (today it don't do, that's the right way) 3) 10e-1 vs 0.1 that's a problem since 10e-1 and 0.1 is the same value, but when you read 10e-1 you don't think about 0.1, i'm right? maybe we can use (2) here -> rewrite to 00000000000000000010e-00000000000000000001 00000000000000000010e-00000000000000000001 vs 00000000000000000000.00000000000000000001 strnatcasecmp('10e-1','0.1') => return "1" (right) strnatcmp('10e-1','0.1') => return "1" (right) 4) what about numbers with more than 20 digits? example: PI = "3.1415926535897932384626433832795028841971693993751" rewrite with length = ceil(size/20)*20 ? no! just copy! (today this function do this part) 5)leading zeros (today it do) "0000000000000000000000000000000000000000000000000000000001" must be rewrite to "1" => "00000000000000000001" 6) special numbers (version and others kinds), example: (it's part of (2) problem) '1.1.1' should be understood as: integer.integer.integer instead of integer.fraction.integer 1.1.1 (especial) = 00000000000000000001.00000000000000000001.00000000000000000001 1.1 (not special) = 00000000000000000001.10000000000000000000 check that if second or third number is bigger than 20 (max number size) it's just copied, but it's not a fraction, it's just a integer fraction = number+'.'+number + (next pos!='.' and next next pos is not a digit) especial = number+'.' number '.'+number... (end when don't find "."+digit) examples 6.1)"1.1.1" => number+'.' number '.'+number + end of string (!='.'+digit) = special 6.2)"1.1.1." => "1.1.1" (special) + "." number+'.' number '.' number (!='.'+digit) [non special part= '.'] 6.3)"1.1.1.a" => "1.1.1" (special) + ".a" number+'.' number '.' number (!='.'+digit) [non special part = '.a']

            i'm re thinking about the source code of canonical form (natural string), the first scratch was very ugly

            the easier way is:
            find first digit, find where the number ends

            • number= start with [0-9], end with [0-9], may have [0-9] in middle, may have one "." before and after two digits
              copy previous string without double space
              write the number from start digit to end rewriting with the zero padding, and espacial cases / fraction
              find next number or copy the end string removing the double spaces
              end
            rspadim roberto spadim added a comment - i'm re thinking about the source code of canonical form (natural string), the first scratch was very ugly the easier way is: find first digit, find where the number ends number= start with [0-9] , end with [0-9] , may have [0-9] in middle, may have one "." before and after two digits copy previous string without double space write the number from start digit to end rewriting with the zero padding, and espacial cases / fraction find next number or copy the end string removing the double spaces end
            serg Sergei Golubchik made changes -
            Workflow defaullt [ 27800 ] MariaDB v2 [ 46652 ]
            ratzpo Rasmus Johansson (Inactive) made changes -
            Workflow MariaDB v2 [ 46652 ] MariaDB v3 [ 61595 ]
            serg Sergei Golubchik made changes -
            Priority Trivial [ 5 ] Major [ 3 ]
            serg Sergei Golubchik made changes -
            Summary UDF - Native natural sorting / natural compare A function for native natural sorting / natural compare
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels upstream
            serg Sergei Golubchik made changes -
            Fix Version/s 10.7 [ 24805 ]

            here's a more compact "sorting string" encoding approach:

            • split the source string into sequences of digits and non-digits. Like perl's split /(\d+)/.
            • every non-digit chunk is put into the encoded string as is
            • every number (a sequence of digits) is stored as a 64-bit number, bigendian
              • alternatively, could be some variable-length encoding, that can be memcmp-ed
            • put one zero byte between numbers and non-numbers.

            this won't work correctly for strings with embedded zero bytes. We can say that there's nothing natural in these strings
            and this won't work if numbers won't fit in 64 bit. variable-length encoding will solve it, though

            serg Sergei Golubchik added a comment - here's a more compact "sorting string" encoding approach: split the source string into sequences of digits and non-digits. Like perl's split /(\d+)/ . every non-digit chunk is put into the encoded string as is every number (a sequence of digits) is stored as a 64-bit number, bigendian alternatively, could be some variable-length encoding, that can be memcmp-ed put one zero byte between numbers and non-numbers. this won't work correctly for strings with embedded zero bytes. We can say that there's nothing natural in these strings and this won't work if numbers won't fit in 64 bit. variable-length encoding will solve it, though
            serg Sergei Golubchik added a comment - - edited

            example of a variable-length encoding for numbers:

            • first byte: number of following bytes.
            • then the number, bigendian, without leading zero bytes.

            for example:

            • 1234567890 (that is 0x00000000499602D2) will be 0x4499602D2
            • 5 will be 0x105
            • 987654321987654321987654321 (0x330F7F01403F94EDB1812B1) will be 0x0C0330F7F01403F94EDB1812B1

            this can handle numbers that are longer than 64 bit

            serg Sergei Golubchik added a comment - - edited example of a variable-length encoding for numbers: first byte: number of following bytes. then the number, bigendian, without leading zero bytes. for example: 1234567890 (that is 0x00000000499602D2) will be 0x4499602D2 5 will be 0x105 987654321987654321987654321 (0x330F7F01403F94EDB1812B1) will be 0x0C0330F7F01403F94EDB1812B1 this can handle numbers that are longer than 64 bit

            Another simplification, to avoid arbitrary length numbers (like 0x0C0330F7F01403F94EDB1812B1 above), it'd be easier to use some variant of BCD encoding. For example, 987654321987654321987654321 could be 0x0E0987654321987654321987654321, which is 2 bytes longer that a true binary representation, but much easier to calculate

            serg Sergei Golubchik added a comment - Another simplification, to avoid arbitrary length numbers (like 0x0C0330F7F01403F94EDB1812B1 above), it'd be easier to use some variant of BCD encoding. For example, 987654321987654321987654321 could be 0x0E0987654321987654321987654321, which is 2 bytes longer that a true binary representation, but much easier to calculate
            serg Sergei Golubchik made changes -
            Assignee Oleksandr Byelkin [ sanja ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Priority Major [ 3 ] Critical [ 2 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Due Date 2021-09-14
            sanja Oleksandr Byelkin made changes -
            Assignee Oleksandr Byelkin [ sanja ] Vladislav Vaintroub [ wlad ]

            not exactly clear how this mix of binary and text data will work with different charsets

            serg Sergei Golubchik added a comment - not exactly clear how this mix of binary and text data will work with different charsets

            we could normalize the charset to be utf8mb4 for the input string, before creating a key first.
            However, on things that bugs me is that sorting on strings is ASCIIbetical in Martin Pool implementation,
            I thought if we wanted natural sort talking about strings parts, it should be something like collating key, produced according to the UCA algorithm.

            wlad Vladislav Vaintroub added a comment - we could normalize the charset to be utf8mb4 for the input string, before creating a key first. However, on things that bugs me is that sorting on strings is ASCIIbetical in Martin Pool implementation, I thought if we wanted natural sort talking about strings parts, it should be something like collating key, produced according to the UCA algorithm.
            bar Alexander Barkov added a comment - - edited

            Numeric sorting is a part of the Unicode collation algorithm:
            Search for the numericOrdering string in https://unicode.org/reports/tr35/tr35-collation.html
            Here's a relevant quote:

            If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UAX44]) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123".

            I think we should extend our Unicode collation algorithm implementation to support the [numericOrdering on] collation option.

            I think implementing a collation option should be better than implementing a function:

            • It should provide a better performance.
            • It's good to implement more of the Unicode standard.
            • It should be doable with low efforts.
            • It will natively (i.e. without virtual columns) support all advantages of a collation: indexes, partitioning, hash based HEAP tables, etc.

            The attached patch num.01.diff contains a preliminary working prototype.

            It currently has some limitations, but it should be relatively easy to address them:

            • For now it supports only ASCII digits. It should support all Unicode digits from the "Nd" category, e.g.:

              1041;MYANMAR DIGIT ONE;Nd;0;L;;1;1;1;N;;;;;
              

            • For now it works fine only for numbers fitting into uint32. It should support digit sequences with arbitrary lengths.
            • For now it's hard coded into utf8_myanmar_ci, just for test purposes. It should allow to add collations with the enabled numericOrdering attribute through share/charsets/Index.xml (and in long terms through the CREATE COLLATION statement - when it's implemented).

            Here's an example query output:

            SET NAMES utf8 COLLATE utf8_myanmar_ci;
            SELECT * FROM (VALUES ('a1'),('a2'),('a100'),('a20000'),('a9'),('a99')) t ORDER BY 1;
            

            +--------+
            | a1     |
            +--------+
            | a1     |
            | a2     |
            | a9     |
            | a99    |
            | a100   |
            | a20000 |
            +--------+
            6 rows in set (0.001 sec)
            

            bar Alexander Barkov added a comment - - edited Numeric sorting is a part of the Unicode collation algorithm: Search for the numericOrdering string in https://unicode.org/reports/tr35/tr35-collation.html Here's a relevant quote: If set to on, any sequence of Decimal Digits (General_Category = Nd in the [UAX44] ) is sorted at a primary level with its numeric value. For example, "A-21" < "A-123". I think we should extend our Unicode collation algorithm implementation to support the [numericOrdering on] collation option. I think implementing a collation option should be better than implementing a function: It should provide a better performance. It's good to implement more of the Unicode standard. It should be doable with low efforts. It will natively (i.e. without virtual columns) support all advantages of a collation: indexes, partitioning, hash based HEAP tables, etc. The attached patch num.01.diff contains a preliminary working prototype. It currently has some limitations, but it should be relatively easy to address them: For now it supports only ASCII digits. It should support all Unicode digits from the "Nd" category, e.g.: 1041;MYANMAR DIGIT ONE;Nd;0;L;;1;1;1;N;;;;; For now it works fine only for numbers fitting into uint32. It should support digit sequences with arbitrary lengths. For now it's hard coded into utf8_myanmar_ci, just for test purposes. It should allow to add collations with the enabled numericOrdering attribute through share/charsets/Index.xml (and in long terms through the CREATE COLLATION statement - when it's implemented). Here's an example query output: SET NAMES utf8 COLLATE utf8_myanmar_ci; SELECT * FROM ( VALUES ( 'a1' ),( 'a2' ),( 'a100' ),( 'a20000' ),( 'a9' ),( 'a99' )) t ORDER BY 1; +--------+ | a1 | +--------+ | a1 | | a2 | | a9 | | a99 | | a100 | | a20000 | +--------+ 6 rows in set (0.001 sec)
            bar Alexander Barkov made changes -
            Attachment num.01.diff [ 58407 ]

            Another example, with the underlying sorting weights:

            SET NAMES utf8 COLLATE utf8_myanmar_ci;
            SELECT a1, HEX(WEIGHT_STRING(a1)) FROM (VALUES ('a1'),('a2'),('a100'),('a20000'),('a9'),('a99')) t ORDER BY 1;
            

            +--------+--------------------------+
            | a1     | HEX(WEIGHT_STRING(a1))   |
            +--------+--------------------------+
            | a1     | 120F12050201020002000201 |
            | a2     | 120F12050201020002000202 |
            | a9     | 120F12050201020002000209 |
            | a99    | 120F12050202020002000263 |
            | a100   | 120F12050203020002000264 |
            | a20000 | 120F12050205020002090820 |
            +--------+--------------------------+
            

            bar Alexander Barkov added a comment - Another example, with the underlying sorting weights: SET NAMES utf8 COLLATE utf8_myanmar_ci; SELECT a1, HEX(WEIGHT_STRING(a1)) FROM ( VALUES ( 'a1' ),( 'a2' ),( 'a100' ),( 'a20000' ),( 'a9' ),( 'a99' )) t ORDER BY 1; +--------+--------------------------+ | a1 | HEX(WEIGHT_STRING(a1)) | +--------+--------------------------+ | a1 | 120F12050201020002000201 | | a2 | 120F12050201020002000202 | | a9 | 120F12050201020002000209 | | a99 | 120F12050202020002000263 | | a100 | 120F12050203020002000264 | | a20000 | 120F12050205020002090820 | +--------+--------------------------+
            bar Alexander Barkov made changes -
            Attachment num.01.diff [ 58407 ]
            bar Alexander Barkov made changes -
            Attachment num.01.diff [ 58408 ]

            Realistically, it could be utf8_numeric_ci or utf8_generic_ci_numeric that does it. Using utf8_generic_ci rules for non-digits.

            Could use prefix encoding as above to support arbitrary-length numbers.

            Benefits:

            • avoids materializing of the weight string for sorting, so might be faster than a function.
            • logically string comparison belongs in a collation, so it's a more natural approach
            • can be used in indexes easier (if natural_sort() is a function, it can be indexed too, but needs a virtual column)

            Drawbacks:

            • _numeric variant has to be created per collation. Here we'll do only utf8_generic_ci_numeric, but users might later request utf8_czech_ci_numeric, etc.
            • every collation takes a collation ID out of the pool
            • materializing the string and then memcmp-ing it might be faster than strnncoll-ing every time.

            Effort-wise I think both approaches are about the same.

            serg Sergei Golubchik added a comment - Realistically, it could be utf8_numeric_ci or utf8_generic_ci_numeric that does it. Using utf8_generic_ci rules for non-digits. Could use prefix encoding as above to support arbitrary-length numbers. Benefits: avoids materializing of the weight string for sorting, so might be faster than a function. logically string comparison belongs in a collation, so it's a more natural approach can be used in indexes easier (if natural_sort() is a function, it can be indexed too, but needs a virtual column) Drawbacks: _numeric variant has to be created per collation. Here we'll do only utf8_generic_ci_numeric, but users might later request utf8_czech_ci_numeric, etc. every collation takes a collation ID out of the pool materializing the string and then memcmp-ing it might be faster than strnncoll-ing every time. Effort-wise I think both approaches are about the same.
            wlad Vladislav Vaintroub made changes -
            wlad Vladislav Vaintroub added a comment - - edited

            I think I found I relatively good scheme to encode numbers, so that a valid string remains a valid string, i.e without BCD, and with a relatively small overhead. Moreover, numeric strings remain numeric strings, so that it is not possible to mix up numbers and strings, that contain no digits

            I imagine it like this

            • numeric strings up to the length of 10 will get a single digit prefix, '0' + len -1
            • numeric strings that are larger, will be split in chunks of 10, or smaller than 10, each encoded like above, with single digit prefix.

            Examples
            Calculate key('9') . prefix = '0' + len('9') -1 = '0' , thus key('9') = '09'
            Calculate key('19') prefix = '1', key = '19'

            Note , while '9' > '19 lexicographically ', the key('9') < key('19'), which is he desired property.

            Calculate key('12345678901') , gives 2 chunks, of length 10 and 1
            partial key('1234567890') = '91234567890' (adds '9' as prefix), partial key('1') = '01',
            full key = '9123456789001'

            I'd think that for realistic numbers, this encoding is compact enough - it only adds 1 extra char
            for numbers < 10 billions, and encoding form is still a string in the original charset (which can be converted to UTF8, and also given COLLATE to sort whatever user wants)

            serg, any comments on this schema

            wlad Vladislav Vaintroub added a comment - - edited I think I found I relatively good scheme to encode numbers, so that a valid string remains a valid string, i.e without BCD, and with a relatively small overhead. Moreover, numeric strings remain numeric strings, so that it is not possible to mix up numbers and strings, that contain no digits I imagine it like this numeric strings up to the length of 10 will get a single digit prefix, '0' + len -1 numeric strings that are larger, will be split in chunks of 10, or smaller than 10, each encoded like above, with single digit prefix. Examples Calculate key('9') . prefix = '0' + len('9') -1 = '0' , thus key('9') = '09' Calculate key('19') prefix = '1', key = '19' Note , while '9' > '19 lexicographically ', the key('9') < key('19'), which is he desired property. Calculate key('12345678901') , gives 2 chunks, of length 10 and 1 partial key('1234567890') = '91234567890' (adds '9' as prefix), partial key('1') = '01', full key = '9123456789001' I'd think that for realistic numbers, this encoding is compact enough - it only adds 1 extra char for numbers < 10 billions, and encoding form is still a string in the original charset (which can be converted to UTF8, and also given COLLATE to sort whatever user wants) serg , any comments on this schema
            wlad Vladislav Vaintroub added a comment - - edited

            Oh, I missed the bar's reply to that, and @serg's. A unicode collation is kinda attractive too, although I would only implement that for case-sensitive, accent-sensitive UCA collation that does not yet exist, because primary application for this feature in my understanding is sorting, rather than anything else (e.g searching). And sorting is best when it is unambiguous, while currently everything Unicode, except _bin collation is ambiguous (and we know that _bin is anything but natural)

            wlad Vladislav Vaintroub added a comment - - edited Oh, I missed the bar 's reply to that, and @serg's. A unicode collation is kinda attractive too, although I would only implement that for case-sensitive, accent-sensitive UCA collation that does not yet exist, because primary application for this feature in my understanding is sorting, rather than anything else (e.g searching). And sorting is best when it is unambiguous, while currently everything Unicode, except _bin collation is ambiguous (and we know that _bin is anything but natural)
            bar Alexander Barkov added a comment - - edited

            I agree with wlad, we should add accent-sensitive case-sensitive UCA collations.

            For now, we have only an accent-sensitive case-INsensitive UCA collation: utf8mb3_thai_520_w2

            Note, it can be used to provide unambiguous sorting order in combination with numericOrdering-enabled collations, like this:

            SET NAMES utf8;
            CREATE OR REPLACE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE utf8_myanmar_ci);
            INSERT INTO t1 VALUES ('a1'),('ä1'),('Ä1'),('Å1'),('å1'),('a2'),('a100'),('a20000'),('a9'),('a99');
            SELECT a, HEX(WEIGHT_STRING(a)) FROM t1
            ORDER BY
              a,
              WEIGHT_STRING(a COLLATE utf8mb3_thai_520_w2 LEVEL 2),
              BINARY a;
            

            +--------+--------------------------+
            | a      | HEX(WEIGHT_STRING(a))    |
            +--------+--------------------------+
            | a1     | 120F12050201020002000201 |
            | Ã…1     | 120F12050201020002000201 |
            | å1     | 120F12050201020002000201 |
            | Ä1     | 120F12050201020002000201 |
            | ä1     | 120F12050201020002000201 |
            | a2     | 120F12050201020002000202 |
            | a9     | 120F12050201020002000209 |
            | a99    | 120F12050202020002000263 |
            | a100   | 120F12050203020002000264 |
            | a20000 | 120F12050205020002090820 |
            +--------+--------------------------+
            

            Here the ORDER BY has three parts:

            • The a part orders records according to the base letter (and handles numbers at the same time)
            • The WEIGHT_STRING(a COLLATE utf8mb3_thai_520_w2 LEVEL 2) part makes accent order unambiguous.
            • The BINARY a part makes case order unambiguous.
            bar Alexander Barkov added a comment - - edited I agree with wlad , we should add accent-sensitive case-sensitive UCA collations. For now, we have only an accent-sensitive case-INsensitive UCA collation: utf8mb3_thai_520_w2 Note, it can be used to provide unambiguous sorting order in combination with numericOrdering-enabled collations, like this: SET NAMES utf8; CREATE OR REPLACE TABLE t1 (a VARCHAR (32) CHARACTER SET utf8 COLLATE utf8_myanmar_ci); INSERT INTO t1 VALUES ( 'a1' ),( 'ä1' ),( 'Ä1' ),( 'Å1' ),( 'å1' ),( 'a2' ),( 'a100' ),( 'a20000' ),( 'a9' ),( 'a99' ); SELECT a, HEX(WEIGHT_STRING(a)) FROM t1 ORDER BY a, WEIGHT_STRING(a COLLATE utf8mb3_thai_520_w2 LEVEL 2), BINARY a; +--------+--------------------------+ | a | HEX(WEIGHT_STRING(a)) | +--------+--------------------------+ | a1 | 120F12050201020002000201 | | Å1 | 120F12050201020002000201 | | å1 | 120F12050201020002000201 | | Ä1 | 120F12050201020002000201 | | ä1 | 120F12050201020002000201 | | a2 | 120F12050201020002000202 | | a9 | 120F12050201020002000209 | | a99 | 120F12050202020002000263 | | a100 | 120F12050203020002000264 | | a20000 | 120F12050205020002090820 | +--------+--------------------------+ Here the ORDER BY has three parts: The a part orders records according to the base letter (and handles numbers at the same time) The WEIGHT_STRING(a COLLATE utf8mb3_thai_520_w2 LEVEL 2) part makes accent order unambiguous. The BINARY a part makes case order unambiguous.
            bar Alexander Barkov added a comment - - edited

            Hello serg. I agree, creating built-in numericOrderding=on variants for all existing regular collations (and thus spending a lot of available IDs in the pool) is a problem.

            I propose we don't do that. Some ideas how to avoid this:

            • We can allow to create custom collations with the numericOrdering=on attribute in share/charsets/Index.xml. So every user can decide which collation needs a numericOrdering-enabled counterpart.
            • We can also add the SQL standard CREATE COLLATE statement support. It should not be too complex.
            • We can allow customizing collations in the context where we don't really need fixed collations IDs. For example:

              SELECT * FROM t1 ORDER BY a COLLATE `utf8_czech_ci@numericOrdering=on`;
              

              In this statement, we can create a temporary customized collation instance (based on Czech, by with numericOrdering enabled) which life cycle will end as soon as the query ends.

            • Also we can store customized collations in FRM files in text format (like we recently allowed user defined datatype names), so these statements will also work:

              CREATE TABLE t1 (a VARCHAR(32) CHARACTER SET utf8 COLLATE `utf8_czech_ci@numericOrdering=on`);
              

              Some changes in replication will also be needed.

            So I really prefer the numeric sorting to be implemented as a UCA collation option.
            It will be immediately usable without adding many built-in collations on one hand. And it will indirectly prioritize these tasks on the other hand:

            • CREATE COLLATION
            • Accent-sensitive case-sensitive collations

            which I think are overdue.

            It will also support this approach you mentioned:

            materializing the string and then memcmp-ing it might be faster than strnncoll-ing every time.

            with help of the WEIGHT_STRING() function. So one can decide which of these two equivalent queries to use:

            SELECT * FROM t1 ORDER BY a COLLATE utf8_czech_ci_numeric;
            SELECT * FROM t1 ORDER BY WEIGHT_STRING(a COLLATE utf8_czech_ci_numeric);
            

            The former works without materializing, the latter works with materializing.

            bar Alexander Barkov added a comment - - edited Hello serg . I agree, creating built-in numericOrderding=on variants for all existing regular collations (and thus spending a lot of available IDs in the pool) is a problem. I propose we don't do that. Some ideas how to avoid this: We can allow to create custom collations with the numericOrdering=on attribute in share/charsets/Index.xml . So every user can decide which collation needs a numericOrdering-enabled counterpart. We can also add the SQL standard CREATE COLLATE statement support. It should not be too complex. We can allow customizing collations in the context where we don't really need fixed collations IDs. For example: SELECT * FROM t1 ORDER BY a COLLATE `utf8_czech_ci@numericOrdering= on `; In this statement, we can create a temporary customized collation instance (based on Czech, by with numericOrdering enabled) which life cycle will end as soon as the query ends. Also we can store customized collations in FRM files in text format (like we recently allowed user defined datatype names), so these statements will also work: CREATE TABLE t1 (a VARCHAR (32) CHARACTER SET utf8 COLLATE `utf8_czech_ci@numericOrdering= on `); Some changes in replication will also be needed. So I really prefer the numeric sorting to be implemented as a UCA collation option. It will be immediately usable without adding many built-in collations on one hand. And it will indirectly prioritize these tasks on the other hand: CREATE COLLATION Accent-sensitive case-sensitive collations which I think are overdue. It will also support this approach you mentioned: materializing the string and then memcmp-ing it might be faster than strnncoll-ing every time. with help of the WEIGHT_STRING() function. So one can decide which of these two equivalent queries to use: SELECT * FROM t1 ORDER BY a COLLATE utf8_czech_ci_numeric; SELECT * FROM t1 ORDER BY WEIGHT_STRING(a COLLATE utf8_czech_ci_numeric); The former works without materializing, the latter works with materializing.
            wlad Vladislav Vaintroub added a comment - - edited

            To tell the truth, the numericOrdering in Unicode is described a little bit sketchy,UTS#10, the collation paper, does not mention how to do it

            Is there anything yet that implements it? The i18 "golden standard", ICU, do they allow that? bar ?

            To be honest, I think a string to string transformation with a function should be OK, an I also think that natural_sort is rather exotic, so people probably can live with virtual functions by now. It will be immediately usable, at the tiny cost of virtual functions, if one wants an index to be used in ORDER BY

            Not saying that supporting that in collations is too bad, but I think the overhead of doing that is perhaps too much given the scope of that MDEV.

            People won't be willing to create their own collations, I doubt that very much.
            Storing collation attributes (locale, weight level or, accent and case sensitivity, pad-nopad) in the FRM is a good thing, fully independent from numericOrdering.

            Also, a function can be adopted and given parameter to recognize and handle negatives, whether to remove leading zeroes or not, or handle decimal fractions (with a user-defined decimal point and thousands separator), and this will be too much to put into numericOrdering only.

            wlad Vladislav Vaintroub added a comment - - edited To tell the truth, the numericOrdering in Unicode is described a little bit sketchy,UTS#10, the collation paper, does not mention how to do it Is there anything yet that implements it? The i18 "golden standard", ICU, do they allow that? bar ? To be honest, I think a string to string transformation with a function should be OK, an I also think that natural_sort is rather exotic, so people probably can live with virtual functions by now. It will be immediately usable, at the tiny cost of virtual functions, if one wants an index to be used in ORDER BY Not saying that supporting that in collations is too bad, but I think the overhead of doing that is perhaps too much given the scope of that MDEV. People won't be willing to create their own collations, I doubt that very much. Storing collation attributes (locale, weight level or, accent and case sensitivity, pad-nopad) in the FRM is a good thing, fully independent from numericOrdering. Also, a function can be adopted and given parameter to recognize and handle negatives, whether to remove leading zeroes or not, or handle decimal fractions (with a user-defined decimal point and thousands separator), and this will be too much to put into numericOrdering only.
            bar Alexander Barkov added a comment - - edited

            wlad, yes ICU does support numeric sorting.

            Note, PostgreSQL can use ICU collations, including collation customization:

            DROP COLLATION IF EXISTS c1;
            CREATE COLLATION c1 (provider = icu, locale = 'en@colNumeric=yes;colAlternate=shifted');
            DROP TABLE IF EXISTS t1;
            CREATE TABLE t1 (a VARCHAR(10) COLLATE c1);
            INSERT INTO t1 VALUES ('a1'),('a10'),('a9');
            SELECT * FROM t1 ORDER BY a;
            

              a  
            -----
             a1
             a9
             a10
            (3 rows)
            

            bar Alexander Barkov added a comment - - edited wlad , yes ICU does support numeric sorting. Note, PostgreSQL can use ICU collations, including collation customization: DROP COLLATION IF EXISTS c1; CREATE COLLATION c1 (provider = icu, locale = 'en@colNumeric=yes;colAlternate=shifted' ); DROP TABLE IF EXISTS t1; CREATE TABLE t1 (a VARCHAR (10) COLLATE c1); INSERT INTO t1 VALUES ( 'a1' ),( 'a10' ),( 'a9' ); SELECT * FROM t1 ORDER BY a; a ----- a1 a9 a10 (3 rows)
            serg Sergei Golubchik made changes -
            Status Open [ 1 ] In Progress [ 3 ]

            I think I found I relatively good scheme to encode numbers, so that a valid string remains a valid string, i.e without BCD, and with a relatively small overhead. Moreover, numeric strings remain numeric strings, so that it is not possible to mix up numbers and strings, that contain no digits
            ...
            Sergei Golubchik, any comments on this schema

            I wanted to avoid comparing numbers and non-numbers. That is, I tried to get the same effect as for comparing tuples.

            Consider, for example, "abc-123.4.5alpha2" → ("abc-", 123, ".", 4, ".", 5, "alpha", 2). And comparing this "naturally" with, say, "abc-123.4.5.alphabeta3", first six elements in a tuple would be equal, but "alpha" < "alphabeta", so the first tuple and the first string is "naturally" smaller.

            That's why I used "\0" as a separator. "alpha\0...2" is smaller than "alphabeta\0...3" no matter what bytes follow the zero byte.

            serg Sergei Golubchik added a comment - I think I found I relatively good scheme to encode numbers, so that a valid string remains a valid string, i.e without BCD, and with a relatively small overhead. Moreover, numeric strings remain numeric strings, so that it is not possible to mix up numbers and strings, that contain no digits ... Sergei Golubchik, any comments on this schema I wanted to avoid comparing numbers and non-numbers. That is, I tried to get the same effect as for comparing tuples. Consider, for example, "abc-123.4.5alpha2" → ("abc-", 123, ".", 4, ".", 5, "alpha", 2) . And comparing this "naturally" with, say, "abc-123.4.5.alphabeta3" , first six elements in a tuple would be equal, but "alpha" < "alphabeta" , so the first tuple and the first string is "naturally" smaller. That's why I used "\0" as a separator. "alpha\0...2" is smaller than "alphabeta\0...3" no matter what bytes follow the zero byte.

            I think binary data isn't much of a problem. The NATURAL_SORT_KEY() function could convert numbers to binary data, for example, BCD with a length prefix, and non-numbers — to their weights, like WEIGHT_STRING() function. And its result can then be binary, just a set of bytes that can be memcmp-ed.

            serg Sergei Golubchik added a comment - I think binary data isn't much of a problem. The NATURAL_SORT_KEY() function could convert numbers to binary data, for example, BCD with a length prefix, and non-numbers — to their weights, like WEIGHT_STRING() function. And its result can then be binary, just a set of bytes that can be memcmp-ed.
            wlad Vladislav Vaintroub added a comment - - edited

            serg, this is what I planned originally, but now I believe encoding string to string has its own advantages, it is compact (much more than WEIGHT_STRING + BCD), reversible (if we want to, we can provide NATURAL_SORT_DECODE() to ease the index-only access), and can be used with this or that COLLATE order

            WEIGHT_STRING(NATURAL_SORT_KEY()) can be stored in index, too, if one really wants

            wlad Vladislav Vaintroub added a comment - - edited serg , this is what I planned originally, but now I believe encoding string to string has its own advantages, it is compact (much more than WEIGHT_STRING + BCD), reversible (if we want to, we can provide NATURAL_SORT_DECODE() to ease the index-only access), and can be used with this or that COLLATE order WEIGHT_STRING(NATURAL_SORT_KEY()) can be stored in index, too, if one really wants
            wlad Vladislav Vaintroub added a comment - - edited

            serg, I do not think a string->string requires separators for digits.

            Binary sortkeys, lke in your example, doe need separators. Though, you'll have to take care that actual sortkeys do not contain binary zeros, which is not given for either classical BCD or collation weights.

            point is that natural sort solves the problem where group of digits at the same offset inside string are sorted alpha, rather then numeric. I do not think anyone complained about relative order or digits and non-digits (this is what separators solve), and I'm sure UCA did not put digits in the middle of any script, and into some useful position and this will already ensure the natural order of things. The relative order of digits vs scripts, punctuation,emojis, or Dingbats is preserved, in string transformation, and it should be fine.

            If there is a real need to sort numbers smaller than anything, in then preprocessing string with e.g

             
            REGEXP_REPLACE(s,'([0-9]+)','\t\\1')
            

            will do that trick ( use \0 nstead of \t for _bin collations, the smallest character varies depending on collation)

            wlad Vladislav Vaintroub added a comment - - edited serg , I do not think a string->string requires separators for digits. Binary sortkeys, lke in your example, doe need separators. Though, you'll have to take care that actual sortkeys do not contain binary zeros, which is not given for either classical BCD or collation weights. point is that natural sort solves the problem where group of digits at the same offset inside string are sorted alpha, rather then numeric. I do not think anyone complained about relative order or digits and non-digits (this is what separators solve), and I'm sure UCA did not put digits in the middle of any script, and into some useful position and this will already ensure the natural order of things. The relative order of digits vs scripts, punctuation,emojis, or Dingbats is preserved, in string transformation, and it should be fine. If there is a real need to sort numbers smaller than anything, in then preprocessing string with e.g REGEXP_REPLACE(s,'([0-9]+)','\t\\1') will do that trick ( use \0 nstead of \t for _bin collations, the smallest character varies depending on collation)

            The patch is in bb-10.7-MDEV-4742.

            It is a string->string transformation, and there is an additional parameter natural_sort_key, for decimal separator, which can be e.g "." or "," , to sort fractions correctly.

            There is no designated separator between digits and non-digits (just the digit rsp non-digit characters themselves). This should not cause problems, as all characters that are less than digits 0-9 are controls, punctuation, letter modifiers and similar non-letter stuff (that's about UCA, but also true for all ASCII based encodings)

            wlad Vladislav Vaintroub added a comment - The patch is in bb-10.7- MDEV-4742 . It is a string->string transformation, and there is an additional parameter natural_sort_key, for decimal separator, which can be e.g "." or "," , to sort fractions correctly. There is no designated separator between digits and non-digits (just the digit rsp non-digit characters themselves). This should not cause problems, as all characters that are less than digits 0-9 are controls, punctuation, letter modifiers and similar non-letter stuff (that's about UCA, but also true for all ASCII based encodings)
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Sergei Golubchik [ serg ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Vladislav Vaintroub [ wlad ]
            Status In Review [ 10002 ] Stalled [ 10000 ]

            A generic scheme for number encoding for natural sorting is to prepend the length-1 of the number to the number, and then prepend the length-1 of the length, and repeat it a fixed number of times. The leftmost prefix must be of a fixed length and should be encoded as decimal. Remaining prefixes should be encoded using radix > 10 to make them short.

            Using two levels of encoding, a single digit for the first prefix and a hexadecimal notation for the second prefix will work for digit sequences up to the length of 1TB. Example prefixes:

            • 1 digit: 00
            • 16 digits: 0f
            • 256 digits: 1ff
            • 2^40^ digits: 9ffffffffff
            arturz Artur Zaprzała added a comment - A generic scheme for number encoding for natural sorting is to prepend the length-1 of the number to the number, and then prepend the length-1 of the length, and repeat it a fixed number of times. The leftmost prefix must be of a fixed length and should be encoded as decimal. Remaining prefixes should be encoded using radix > 10 to make them short. Using two levels of encoding, a single digit for the first prefix and a hexadecimal notation for the second prefix will work for digit sequences up to the length of 1TB. Example prefixes: 1 digit: 00 16 digits: 0f 256 digits: 1ff 2^40^ digits: 9ffffffffff

            I think prefixes of 2 bytes for small numbers are too much. For that task, I doubt there will be a lot of numbers longer than 9-10 digits.

            wlad Vladislav Vaintroub added a comment - I think prefixes of 2 bytes for small numbers are too much. For that task, I doubt there will be a lot of numbers longer than 9-10 digits.
            wlad Vladislav Vaintroub made changes -
            Status Stalled [ 10000 ] In Progress [ 3 ]

            serg, bb-10.7-MDEV-4742 contains updated patch , modified past review.

            wlad Vladislav Vaintroub added a comment - serg , bb-10.7- MDEV-4742 contains updated patch , modified past review.
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Sergei Golubchik [ serg ]
            Status In Progress [ 3 ] In Review [ 10002 ]
            serg Sergei Golubchik added a comment - - edited

            Looks good, the only thing that I doesn't like is:

               `c` varchar(30) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL,
            -  `NATURAL_SORT_KEY(c)` varchar(45) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL
            +  `NATURAL_SORT_KEY(c)` varchar(60) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL
            

            that first patch had 50% overhead, now you went up to 2x.

            May be you don't need to encode leading zeros when there aren't any? "1" -> "01", not "010"


            also

            +  if (out->append(buf, natsort_encode_length(n_digits - 1, buf)))
            +    return NATSORT_ERR::ALLOC_ERROR;
            

            this is usually done like

              out->reserve(natsort_max_key_size(n_digits));
              out->length(out->length() + natsort_encode_length(n_digits-1, n->ptr()));
            

            I agree that for a small buffer like yours here it matters little. But better to set a good example in case someone copies it and his buffer won't be small.

            serg Sergei Golubchik added a comment - - edited Looks good, the only thing that I doesn't like is: `c` varchar(30) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL, - `NATURAL_SORT_KEY(c)` varchar(45) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL + `NATURAL_SORT_KEY(c)` varchar(60) CHARACTER SET latin1 COLLATE latin1_bin DEFAULT NULL that first patch had 50% overhead, now you went up to 2x. May be you don't need to encode leading zeros when there aren't any? "1" -> "01", not "010" also + if (out->append(buf, natsort_encode_length(n_digits - 1, buf))) + return NATSORT_ERR::ALLOC_ERROR; this is usually done like out->reserve(natsort_max_key_size(n_digits)); out->length(out->length() + natsort_encode_length(n_digits-1, n->ptr())); I agree that for a small buffer like yours here it matters little. But better to set a good example in case someone copies it and his buffer won't be small.
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Vladislav Vaintroub [ wlad ]
            Status In Review [ 10002 ] Stalled [ 10000 ]
            wlad Vladislav Vaintroub added a comment - - edited

            serg , we have to encode lead zeros count, even if there are none. Otherwise, "1/", "01" , "1a" will sort with "01" in the middle, even though "01" > "1". I added a test for it.
            So I think maybe we should not change that just yet, unless we make a decision to ignore/remove all leading zeros (this would make keys not reversible, and sort unstable).

            I also changed natsort_encode_numeric_string such that it won't allocate anything, but fail, if preallocated string can't accomodate the resulting key. We preallocate at an earlier stage, only once, and do not do reallocs afterwards.

            wlad Vladislav Vaintroub added a comment - - edited serg , we have to encode lead zeros count, even if there are none. Otherwise, "1/", "01" , "1a" will sort with "01" in the middle, even though "01" > "1". I added a test for it. So I think maybe we should not change that just yet, unless we make a decision to ignore/remove all leading zeros (this would make keys not reversible, and sort unstable). I also changed natsort_encode_numeric_string such that it won't allocate anything, but fail, if preallocated string can't accomodate the resulting key. We preallocate at an earlier stage, only once, and do not do reallocs afterwards.
            wlad Vladislav Vaintroub added a comment - - edited

            I've done a little survey what other natsort implementations do w.r.t leading zeros sort, to figure out what to do with them, and here are the results

            1. Martin Pools strnatcmp, which is what PHP is using, sorts leading zeros before no-lead-zeros, "001" < "01" < "1". Windows StrCmpLogicalW (used in the Explorer to sort filenames), as well as CompareStringEx with SORT_DIGITSASNUMBERS flag, sort the same
            2. ICU, with UCOL_NUMERIC_COLLATION, as well as Python natsort ignore leading zeros.

            Ignoring leading zeros, like ICU does, somewhat surprisingly leads to more logical results.

            While other libraries think that "a1b" > "a01c" (because 1 > 01), ICU begs to differ, and this kinda makes sense, because split into components , it is (a,1,b) vs (a,1,c) the difference between 1 and 01 seems less significant that than the difference between b and c.

            With that, I think we should be ignoring lead zero differences, for the upcoming version. That makes keys smaller too, usually requiring only a single byte per number.

            For stable sort, one could still "ORDER BY natural_sort_key(c), c" , this would give exactly "01" < "1" order.

            As a side note, I do like the description of the older version of StrCompareLogicalW algorithm (now it does not work like this anymore).

            A particular case is made for leading zeros in numbers. Of two numbers that have the same numerical value, the lesser for the comparison is whichever has the more leading zeros. However, this is acted on only if the two strings turn out to be otherwise equal. For example:

            "a1b1" < "a01b2" < "a1b2" < "a01b3"

            This one seems combine the strength of stable sort, and prioritizing significant over insignificant differences a la ICU. It seems to be inspired by Unicode collation algorithm, like a weight level for leading zeros.

            wlad Vladislav Vaintroub added a comment - - edited I've done a little survey what other natsort implementations do w.r.t leading zeros sort, to figure out what to do with them, and here are the results Martin Pools strnatcmp , which is what PHP is using, sorts leading zeros before no-lead-zeros, "001" < "01" < "1". Windows StrCmpLogicalW (used in the Explorer to sort filenames), as well as CompareStringEx with SORT_DIGITSASNUMBERS flag, sort the same ICU, with UCOL_NUMERIC_COLLATION, as well as Python natsort ignore leading zeros. Ignoring leading zeros, like ICU does, somewhat surprisingly leads to more logical results. While other libraries think that "a1b" > "a01c" (because 1 > 01), ICU begs to differ, and this kinda makes sense, because split into components , it is (a,1,b) vs (a,1,c) the difference between 1 and 01 seems less significant that than the difference between b and c. With that, I think we should be ignoring lead zero differences, for the upcoming version. That makes keys smaller too, usually requiring only a single byte per number. For stable sort, one could still "ORDER BY natural_sort_key(c), c" , this would give exactly "01" < "1" order. As a side note, I do like the description of the older version of StrCompareLogicalW algorithm (now it does not work like this anymore). A particular case is made for leading zeros in numbers. Of two numbers that have the same numerical value, the lesser for the comparison is whichever has the more leading zeros. However, this is acted on only if the two strings turn out to be otherwise equal. For example: "a1b1" < "a01b2" < "a1b2" < "a01b3" This one seems combine the strength of stable sort, and prioritizing significant over insignificant differences a la ICU. It seems to be inspired by Unicode collation algorithm, like a weight level for leading zeros.

            the branch as of 64206db61af looks good, thanks!

            serg Sergei Golubchik added a comment - the branch as of 64206db61af looks good, thanks!

            elenst, could you please test?

            wlad Vladislav Vaintroub added a comment - elenst , could you please test?
            wlad Vladislav Vaintroub made changes -
            Assignee Vladislav Vaintroub [ wlad ] Elena Stepanova [ elenst ]
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            Assignee Elena Stepanova [ elenst ] Anel Husakovic [ anel ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Due Date 2021-09-14
            anel Anel Husakovic added a comment - Tested and published the [blog] ( https://mariadb.org/10-7-preview-feature-natural-sort/ )
            anel Anel Husakovic made changes -
            Component/s Server [ 13907 ]
            Fix Version/s 10.7.0 [ 26072 ]
            Fix Version/s 10.7 [ 24805 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            anel Anel Husakovic made changes -
            Assignee Anel Husakovic [ anel ] Vladislav Vaintroub [ wlad ]
            Resolution Fixed [ 1 ]
            Status Closed [ 6 ] Stalled [ 10000 ]
            ralf.gebhardt Ralf Gebhardt made changes -
            Fix Version/s 10.7 [ 24805 ]
            Fix Version/s 10.7.0 [ 26072 ]
            serg Sergei Golubchik made changes -
            Assignee Vladislav Vaintroub [ wlad ] Elena Stepanova [ elenst ]
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -

            While I understand the intention behind it and thus don't want to file a bug, I think there is a distinct possibility that some users will find the natural sort of float numbers rather unnatural.
            Especially when the returned values are FLOAT, and the client aligns them as such, it is quite difficult to see the result as natural.

            MariaDB [test]> create or replace table t (f float);
            Query OK, 0 rows affected (0.055 sec)
             
            MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044);
            Query OK, 3 rows affected (0.013 sec)
            Records: 3  Duplicates: 0  Warnings: 0
             
            MariaDB [test]> select * from t order by natural_sort_key(f);
            +-------+
            | f     |
            +-------+
            |   0.5 |
            |  0.33 |
            | 0.044 |
            +-------+
            3 rows in set (0.001 sec)
            

            It is slightly better when the values are actually strings, at least they are formatted as such, and with some effort one can see the logic:

            MariaDB [test]> create or replace table t (f varchar(16));
            Query OK, 0 rows affected (0.053 sec)
             
            MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044);
            Query OK, 3 rows affected (0.012 sec)
            Records: 3  Duplicates: 0  Warnings: 0
             
            MariaDB [test]> select * from t order by natural_sort_key(f);
            +-------+
            | f     |
            +-------+
            | 0.5   |
            | 0.33  |
            | 0.044 |
            +-------+
            3 rows in set (0.001 sec)
            

            Same for negative numbers.

            I wonder what existing best practices say about it, whether the natural sort, when applied to numbers, should consider the numeric order being natural.

            elenst Elena Stepanova added a comment - While I understand the intention behind it and thus don't want to file a bug, I think there is a distinct possibility that some users will find the natural sort of float numbers rather unnatural. Especially when the returned values are FLOAT, and the client aligns them as such, it is quite difficult to see the result as natural. MariaDB [test]> create or replace table t (f float ); Query OK, 0 rows affected (0.055 sec)   MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044); Query OK, 3 rows affected (0.013 sec) Records: 3 Duplicates: 0 Warnings: 0   MariaDB [test]> select * from t order by natural_sort_key(f); + -------+ | f | + -------+ | 0.5 | | 0.33 | | 0.044 | + -------+ 3 rows in set (0.001 sec) It is slightly better when the values are actually strings, at least they are formatted as such, and with some effort one can see the logic: MariaDB [test]> create or replace table t (f varchar (16)); Query OK, 0 rows affected (0.053 sec)   MariaDB [test]> insert into t (f) values (0.5), (0.33), (0.044); Query OK, 3 rows affected (0.012 sec) Records: 3 Duplicates: 0 Warnings: 0   MariaDB [test]> select * from t order by natural_sort_key(f); + -------+ | f | + -------+ | 0.5 | | 0.33 | | 0.044 | + -------+ 3 rows in set (0.001 sec) Same for negative numbers. I wonder what existing best practices say about it, whether the natural sort, when applied to numbers , should consider the numeric order being natural.
            wlad Vladislav Vaintroub added a comment - - edited

            It is certainly possible to design a transformation scheme for fractions, and negatives, it is just not done currently. Some of that, sorting leading zeros in predictable way, and fractions( with either dot or comma as separators) were in some earlier versions of the patch, and were taken out, for the sake of simplicity.

            We currently sort like ICU - positive numbers only, no fractions, ignore leading zeros, no other customizations. We sort slightly better than ICU in one aspect, because they have some in-built limitations about how long a numeric string can be, 256 digits or something like that, I don't recall exactly. We sort slightly worse than ICU in i18n department, because we only consider 0-9 to be digits, and they handle more decimal digits from different scripts.

            What you can be interested in , Python's natsort has some customizations to sort, and allegedly handles fractions, and negatives , when told to handle it, via parameter https://natsort.readthedocs.io/en/master/api.html#the-ns-enum

            But there is no best practice, as far as I understand, and no common definition of natural sort, except that "we intermix strings and numbers, and expect numbers to be sorted as numbers, not ASCII".

            There are a couple of aspects when people want to sort strings as numbers

            • i18n decimal numbers : 0-9 vs decimal numbers in non-latin scripts
            • fractions separators- dot vs comma vs whatever is used in some arabic scripts. Sometimes both dot and comma are used, e.g in Switzerland.
            • fractions - whether to support scientific notation with exponent ?
            • thousands separators - 20'000'000 - is this the same as 20000000? Is 20 000 000 the same as 20000000,00? Those are tricky, as they are can be locale dependent, and alternating, with comma and dot, but do not have to, and people would just separate with space.
            • negatives
            • decimal numbers - should they sort as-is ( bigger than punctuation, less that letters), or less than anything else, including punctuations, or larger than anything else? This is related to script ordering in UCA algorithm.
            wlad Vladislav Vaintroub added a comment - - edited It is certainly possible to design a transformation scheme for fractions, and negatives, it is just not done currently. Some of that, sorting leading zeros in predictable way, and fractions( with either dot or comma as separators) were in some earlier versions of the patch, and were taken out, for the sake of simplicity. We currently sort like ICU - positive numbers only, no fractions, ignore leading zeros, no other customizations. We sort slightly better than ICU in one aspect, because they have some in-built limitations about how long a numeric string can be, 256 digits or something like that, I don't recall exactly. We sort slightly worse than ICU in i18n department, because we only consider 0-9 to be digits, and they handle more decimal digits from different scripts. What you can be interested in , Python's natsort has some customizations to sort, and allegedly handles fractions, and negatives , when told to handle it, via parameter https://natsort.readthedocs.io/en/master/api.html#the-ns-enum But there is no best practice, as far as I understand, and no common definition of natural sort, except that "we intermix strings and numbers, and expect numbers to be sorted as numbers, not ASCII". There are a couple of aspects when people want to sort strings as numbers i18n decimal numbers : 0-9 vs decimal numbers in non-latin scripts fractions separators- dot vs comma vs whatever is used in some arabic scripts. Sometimes both dot and comma are used, e.g in Switzerland. fractions - whether to support scientific notation with exponent ? thousands separators - 20'000'000 - is this the same as 20000000? Is 20 000 000 the same as 20000000,00? Those are tricky, as they are can be locale dependent, and alternating, with comma and dot, but do not have to, and people would just separate with space. negatives decimal numbers - should they sort as-is ( bigger than punctuation, less that letters), or less than anything else, including punctuations, or larger than anything else? This is related to script ordering in UCA algorithm.

            And of course, natural sort sorts strings. Anything can be converted to string, but it does not mean people should be trying to convert numbers to string, and to natural_sort_key, to sort numbers.

            wlad Vladislav Vaintroub added a comment - And of course, natural sort sorts strings. Anything can be converted to string, but it does not mean people should be trying to convert numbers to string, and to natural_sort_key, to sort numbers.
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova made changes -
            elenst Elena Stepanova added a comment - - edited

            As far as testing is concerned, it can be pushed into 10.7 (as of preview-10.7-MDEV-4742-natural-sort cfec49f1dd7241582f03cbcd395e65a9c2ecf4ce).

            Documentation in the KB needs to be updated/enhanced, but it shouldn't be an obstacle to pushing into main.

            As explained earlier in the comments, some decisions remain a subject to further consideration based on users' feedback.

            elenst Elena Stepanova added a comment - - edited As far as testing is concerned, it can be pushed into 10.7 (as of preview-10.7- MDEV-4742 -natural-sort cfec49f1dd7241582f03cbcd395e65a9c2ecf4ce). Documentation in the KB needs to be updated/enhanced, but it shouldn't be an obstacle to pushing into main. As explained earlier in the comments, some decisions remain a subject to further consideration based on users' feedback.
            elenst Elena Stepanova made changes -
            Assignee Elena Stepanova [ elenst ] Sergei Golubchik [ serg ]
            serg Sergei Golubchik made changes -
            Assignee Sergei Golubchik [ serg ] Vladislav Vaintroub [ wlad ]
            wlad Vladislav Vaintroub made changes -
            Fix Version/s 10.7.1 [ 26120 ]
            Fix Version/s 10.7 [ 24805 ]
            Resolution Fixed [ 1 ]
            Status Stalled [ 10000 ] Closed [ 6 ]
            serg Sergei Golubchik made changes -
            Workflow MariaDB v3 [ 61595 ] MariaDB v4 [ 132162 ]
            Dean T Dean Trower added a comment - - edited

            I wrote a nat-sort function for MariaDB/MySQL a few years ago, documented here: https://stackoverflow.com/a/58154535/999120

            It produces a key strings based on an input string, such that sorting by the key string results in a natural sort. They key strings can be stored for re-use, making the process efficient.

            I think it might have more "features" than the one you describe above: It handles negatives (though this could be improved), leading zeros, leading +, version numbers or IP addresses, decimals, and use of commas as thousand separators, and also preserves total sort ordering (i.e. no two different strings nat-sort as equivalent).

            Feel free to borrow it, if you like.

            Dean T Dean Trower added a comment - - edited I wrote a nat-sort function for MariaDB/MySQL a few years ago, documented here: https://stackoverflow.com/a/58154535/999120 It produces a key strings based on an input string, such that sorting by the key string results in a natural sort. They key strings can be stored for re-use, making the process efficient. I think it might have more "features" than the one you describe above: It handles negatives (though this could be improved), leading zeros, leading +, version numbers or IP addresses, decimals, and use of commas as thousand separators, and also preserves total sort ordering (i.e. no two different strings nat-sort as equivalent). Feel free to borrow it, if you like.
            ralf.gebhardt Ralf Gebhardt made changes -
            Labels Preview_10.7

            People

              wlad Vladislav Vaintroub
              rspadim roberto spadim
              Votes:
              2 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.