[MDEV-24263] REPLACE() a string with NULL leads to unexpected result for non UTF-8 char sets Created: 2020-11-21  Updated: 2023-04-27

Status: Confirmed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.1, 10.2, 10.3, 10.4, 10.5
Fix Version/s: 10.4, 10.5

Type: Bug Priority: Major
Reporter: Lau Assignee: Oleksandr Byelkin
Resolution: Unresolved Votes: 0
Labels: None


 Description   

I expect both fields of the following query to be NULL, however only the utf8mb4_expected field actually is. Instead, the latin1_unexpected field returns the original string of foo, which is unexpected.

SELECT
    REPLACE(convert('foo' using utf8mb4), convert('bar' using utf8mb4), null) as `utf8mb4_expected`,
    REPLACE(convert('foo' using latin1), convert('bar' using latin1), null) as `latin1_unexpected`;

+------------------+------------------+
| utf8mb4_expected | latin_unexpected |
+------------------+------------------+
| NULL             | foo              |
+------------------+------------------+
1 row in set (0.01 sec)

This seems to be in issue in all versions of MariaDB and all versions of MySQL below 8.0.x (works on MySQL 8.0.21).



 Comments   
Comment by Sergei Golubchik [ 2020-11-21 ]

on the contrary, latin1 result is correct, there is no bar in foo, so the replacement doesn't take place, it does not matter what the value of the replacement string is. In the utf8mb4 case the code returns NULL too early, before making sure that the replacement value should be used.

Comment by Lau [ 2020-11-21 ]

That is incorrect. REPLACE() should return NULL if any of its arguments is NULL.

From Changes in MySQL 8.0.0 (2016-09-12, Development Milestone):

REPLACE('a', BINARY 'b', NULL) returned 'a' rather than NULL. (Bug #79912, Bug #22523836)

I checked the behavior of other string methods again, and they all work as expected and return NULL if any argument is NULL.

 select LOCATE(null, convert('foo' using latin1)),
        SUBSTRING(null, 5),
        INSERT(convert('foo' using latin1), 42, 2, null);

The closest one to REPLACE() of the functions I check is INSERT(), which will only insert a new string into an existing string, if the start position argument is within the length of the existing string. It also returns NULL if the new string to insert is NULL and the start position is larger than the existing string (meaning even if no string would be inserted). Therefore, the function behaves exactly as expected, by returning NULL if any argument is NULL, independent of whether the argument is actually used or not.

This is also stated in the MySQL docs:

Returns the string str, with the substring beginning at position pos and len characters long replaced by the string newstr. Returns the original string if pos is not within the length of the string. Replaces the rest of the string from position pos if len is not within the length of the rest of the string. Returns NULL if any argument is NULL.

Therefore, the correct order is to first check all arguments for NULL and return NULL if any arguments is actually NULL. If no argument is NULL, then execute the actual logic of the function. And this needs to be consistent across all possible char sets/collations.

Comment by Sergei Golubchik [ 2020-11-21 ]

Technically REPLACE() should behave the way it is documented, which can be anything. Alas, this detail is not documented (yet).
There is no REPLACE() in the SQL Standard, so we cannot refer to the standard to know what todo.
But generally there is no rule in the standard that "an expression returns NULL if any argument is NULL".
Consider:

MariaDB [test]> SELECT 0 AND NULL, NULL OR 1, IF(TRUE, 1, NULL);
+------------+-----------+-------------------+
| 0 AND NULL | NULL OR 1 | IF(TRUE, 1, NULL) |
+------------+-----------+-------------------+
|          0 |         1 |                 1 |
+------------+-----------+-------------------+
 
MariaDB [test]> SELECT 1 + NULL, 2 = NULL, CONCAT('a', NULL);
+----------+----------+-------------------+
| 1 + NULL | 2 = NULL | CONCAT('a', NULL) |
+----------+----------+-------------------+
|     NULL |     NULL | NULL              |
+----------+----------+-------------------+
1 row in set (0.00 sec)

in the first case NULL is not needed to determine the value of the expression and it does not make the result NULL. In the second case it is and it does.

You can think of NULL as of "unknown value", it's neither true, nor false, it's unknown.

MariaDB [test]> select NULL IS TRUE, NULL IS FALSE, NULL IS UNKNOWN;
+--------------+---------------+-----------------+
| NULL IS TRUE | NULL IS FALSE | NULL IS UNKNOWN |
+--------------+---------------+-----------------+
|            0 |             0 |               1 |
+--------------+---------------+-----------------+
1 row in set (0.00 sec)

In the +, =, CONCAT(), SUBSTRING(), and LOCATE() case the result is clearly unknown if one of the arguments is unknown. INSERT() can be argued either way, but it is explicitly documented to return NULL if any argument is NULL.

In the REPLACE() case the unknown value surely does not affect the result if the replacement never takes place. This is also the clear intention of the code — it does not evaluate the third argument until it knows that the replacement, indeed, will take place.

Comment by Lau [ 2020-11-21 ]

Of course any issue can be argued in at least two directions.

However, the point I am making here is, that the current behavior is unexpected. Not just from the standpoint, that there are different results for UTF-8 and other charsets, but also from the standpoint that if this string function returns a non-null value even if a parameter is NULL, than this is the only string function I know of that does that (if you know others, that conditionally return NULL for a NULL parameter, please state them).

I recently took inventory over all functions our provider supports and documented them internally: from the 99 SQL functions we currently translate, all that take arguments return NULL if any argument is NULL, with the exception of course for the special case of non-UTF-8 charsets in the REPLACE() function that I outlined and reported as a bug.

Its true, that there is no official standard here. However, I looked at how SQL Server and PostgreSQL handle REPLACE():

SQL Server:

Returns NULL if any one of the arguments is NULL.

PostgreSQL:
I didn't find any official docs about the NULL handling, but there is an indirect prove, that it works as I expect by seeing that Npgsql does not override the Null_semantics_applied_when_comparing_two_functions_with_multiple_nullable_arguments() test in their NullSemanticsQueryNpgsqlTest implementation, which means it works out-of-the box as it should be (this is the test that fails for MariaDB and latin1).

As for MySQL, I made my case that the expected design can be seen in how the similar INSERT() function handles arguments. While it is true, that there is no explicit documentation yet for REPLACE() and NULL handling in the MySQL docs, the previously referenced release notes should be enough evidence for what the intended design is, and that returning a non-null value here has been categorized as a bug.

That being said, since I am the lead developer of the Pomelo.EntityFrameworkCore.MySql provider and I do my best to keep the provider not just compatible with all officially supported version of MySQL but also with all officially supported version of MariaDB, any help in reasonable cases like this one is greatly appreciated. Since we have over 18,000 tests in place, we usually find all divergences between MariaDB and MySQL, and also the ones between each individual version update, and take the time to report them and then deal with them in our provider.

Comment by Sergei Golubchik [ 2020-12-14 ]

There are string functions that don't necessarily return NULL if some arguments are NULL:

MariaDB [test]> select concat_ws('/', 'a', null, 'b');
+--------------------------------+
| concat_ws('/', 'a', null, 'b') |
+--------------------------------+
| a/b                            |
+--------------------------------+
 
MariaDB [test]> select elt(2, null, 'a');
+-------------------------+
| elt(2, null, 'a', null) |
+-------------------------+
| a                       |
+-------------------------+
 
MariaDB [test]> select field('a', null, 'a');
+-----------------------+
| field('a', null, 'a') |
+-----------------------+
|                     2 |
+-----------------------+
 
MariaDB [test]> select make_set(5, 'a', NULL, 'c');
+-----------------------------+
| make_set(5, 'a', NULL, 'c') |
+-----------------------------+
| a,c                         |
+-----------------------------+

And while CONCAT_WS is a special case, ELT clearly follows the logic "the result is not NULL unless NULL argument is used". FIELD and MAKE_SET just don't return NULL at all, so it looks. It's rather inconsistent, unfortunately. I'd rather pick one rule and stick with it.

The difference in REPLACE behavior with single-byte and multi-byte character sets is unquestionably a bug. We will fix it. The question only is, what the behavior should be. And I lean towards the logical "the result is not NULL unless NULL argument is used" rule.

Comment by Lau [ 2020-12-15 ]

The question only is, what the behavior should be. And I lean towards the logical "the result is not NULL unless NULL argument is used" rule.

While this might make sense for MariaDB as an island, I would strife for as much conformity with MySQL as reasonably possible. Since MySQL has categorized the old behavior as a bug, I think that MariaDB should implement the current MySQL behavior as well.

After MySQL, I would strife for conformity with other major database systems. As already stated previously, Microsoft SQL Server and PostgreSQL implement the behavior in the same way that MySQL now does.

Differences in details like this will inevitably lead to bugs that are very hard to track down (and therefore costly), when migrating from MySQL to MariaDB.

Comment by Sergei Golubchik [ 2020-12-16 ]

Okay, you win. I've tried SQL Server and PostreSQL, but also SQLite, DB2 and Firebase. Everybody returns NULL if the last argument is NULL. I still think it's illogical that REPLACE('a', 'b', NULL) IS UNKNOWN, as far as I can see it's perfectly known. But if every other implementation I've tried does the illogical thing, so will we.

Generated at Thu Feb 08 09:28:39 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.