[MDEV-24250] Sorting behavior changed from 10.1 to 10.3 Created: 2020-11-19  Updated: 2020-11-23  Resolved: 2020-11-20

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.3.25, 10.2, 10.3, 10.4, 10.5
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Torben Hansen Assignee: Sergei Golubchik
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Ubuntu 20.04


Issue Links:
Problem/Incident
is caused by MDEV-11320 MariaRocks: rocksdb.type_text_indexes... Closed

 Description   

The sorting behavior for columns with CRLF (\n\r) values seem to have changed from MariaDB 10.1 to 10.3

Simple example table

CREATE TABLE `test` (
  `a` int(11) NOT NULL AUTO_INCREMENT,
  `b` varchar(255) NOT NULL,
  `c` text NOT NULL,
  `d` varchar(255) NOT NULL,
  PRIMARY KEY (`a`)
) ENGINE=InnoDB AUTO_INCREMENT=6 DEFAULT CHARSET=utf8;
 
INSERT INTO `test` VALUES (1,'A','\r\n','CRLF'),(2,'A','',''),(3,'A','',''),(4,'A','\r\n','CRLF'),(5,'A','','');

Results on MariaDB 10.1 (Ubuntu 18.04 LTS)

SELECT * FROM test ORDER BY b ASC, c ASC, a ASC;
 
+---+---+----+------+
| a | b | c  | d    |
+---+---+----+------+
| 2 | A |    |      |
| 3 | A |    |      |
| 5 | A |    |      |
| 1 | A |    | CRLF |
| 4 | A |    | CRLF |
+---+---+----+------+

Results on MariaDB 10.3 (Ubuntu 20.04 LTS)

SELECT * FROM test ORDER BY b ASC, c ASC, a ASC;
 
+---+---+----+------+
| a | b | c  | d    |
+---+---+----+------+
| 1 | A |    | CRLF |
| 4 | A |    | CRLF |
| 2 | A |    |      |
| 3 | A |    |      |
| 5 | A |    |      |
+---+---+----+------+

The ordering for column c has obviously changed, since rows with a CRLF value are now shown before rows with an empty value.

I did not find any documentation describing this change in sorting behavior. Is this a bug or can this behavior be changed "back" by a setting?



 Comments   
Comment by Alice Sherepa [ 2020-11-20 ]

Thanks! I repeated on 10.2-10.5:

MariaDB [test]> create table t1 ( a int, c text);
Query OK, 0 rows affected (0.043 sec)
 
MariaDB [test]> insert into t1 values (1,'\r\n'),(2,'');
Query OK, 2 rows affected (0.007 sec)
Records: 2  Duplicates: 0  Warnings: 0
 
MariaDB [test]> select * from t1 order by c;
+------+------+
| a    | c    |
+------+------+
|    1 | 
   |
|    2 |      |
+------+------+
2 rows in set (0.003 sec)
 
MariaDB [test]> select '\r\n' < '';
+-------------+
| '\r\n' < '' |
+-------------+
|           1 |
+-------------+
1 row in set (0.001 sec)

Comment by Sergei Golubchik [ 2020-11-20 ]

This is correct behavior. In MariaDB most collations have what SQL standard calls "PAD" behavior, that is when two strings are compared they are padded with spaces to the max string length. And as '\r' < ' ' you also get '\r' < ''.

You can use the NOPAD collation if you want strings not to be padded on comparison:

MariaDB [test]> SELECT a,b,hex(c),d FROM test ORDER BY b ASC, c COLLATE utf8_general_ci ASC, a ASC;
+---+---+--------+------+
| a | b | hex(c) | d    |
+---+---+--------+------+
| 1 | A | 0D0A   | CRLF |
| 4 | A | 0D0A   | CRLF |
| 2 | A |        |      |
| 3 | A |        |      |
| 5 | A |        |      |
+---+---+--------+------+
5 rows in set (0.00 sec)
 
MariaDB [test]> SELECT a,b,hex(c),d FROM test ORDER BY b ASC, c COLLATE utf8_general_nopad_ci ASC, a ASC;
+---+---+--------+------+
| a | b | hex(c) | d    |
+---+---+--------+------+
| 2 | A |        |      |
| 3 | A |        |      |
| 5 | A |        |      |
| 1 | A | 0D0A   | CRLF |
| 4 | A | 0D0A   | CRLF |
+---+---+--------+------+
5 rows in set (0.00 sec)

Comment by Torben Hansen [ 2020-11-22 ]

Thanks for your feedback. Since I'm using a DBAL, I have no real influence on how queries are created, so I can not easily add

c COLLATE utf8_general_nopad_ci ASC

as suggested.

My initial question is/was, if this behavior somehow changed between MariaDB 10.1 and 10.3? If so, this should at least be documented somewhere.

Comment by Sergei Golubchik [ 2020-11-23 ]

It was changed in commit https://github.com/MariaDB/server/commit/6fb94c3e430d as a fix for MDEV-11320

My previous comment described the correct string comparison behavior. It was implemented before 10.1. That is, 10.1 also pads strings with spaces and '\n' is sorted before '' there. What I omitted — in binary character set strings are padded with '\0', not with spaces. In that case _binary '\n' > _binary ''.

Now 10.1 had a bug where BLOB and TEXT columns were always padding with '\0', even though TEXT columns have a proper character set. It was a bug in comparison and sorting of TEXT columns and it was fixed in 10.2.5

Comment by Torben Hansen [ 2020-11-23 ]

Thanks a lot for your explanation.

Generated at Thu Feb 08 09:28:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.