[MDEV-12805] inconsistency in NULL handling, JSON type, CONNECT engine Created: 2017-05-15  Updated: 2017-07-23  Resolved: 2017-07-23

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - Connect
Affects Version/s: 10.1.23
Fix Version/s: 10.1.26, 10.0.32, 10.2.8

Type: Bug Priority: Major
Reporter: Alice Sherepa Assignee: Olivier Bertrand
Resolution: Fixed Votes: 0
Labels: None
Environment:

ubuntu 16.04



 Description   

Inconsistency in NULL handling

JSON file:

[
    {
        "i": 3,
        "k": 
          {
            "WHAT": "milk",
            "AMOUNT": null
          }
    },
    {
        "i": 4,
        "k": 
          {
            "WHAT": null,
            "AMOUNT": 5
          }
    },
      {
        "i": 5,
        "k": 
          [{
            "WHAT": null,
            "AMOUNT": 10
          },
          {
            "WHAT": "sugar",
            "AMOUNT": 20
          },
           {
            "WHAT": null,
            "AMOUNT": 30
          }
        ]
 
    }
    
]

Null value is substituted by string "???"

CREATE TABLE `t1`  ENGINE=CONNECT `TABLE_TYPE`='JSON' `FILE_NAME`='exp.json' ;
SELECT * FROM t1;
+---+----------+
| i | k        |
+---+----------+
| 3 | milk ??? |
| 4 | ??? 5    |
| 5 | ??? 10   |
+---+----------+

When adding 'level=1' in option_list:

CREATE TABLE `t2` ENGINE=CONNECT `TABLE_TYPE`='JSON' `FILE_NAME`='exp.json' 
OPTION_LIST='level=1' ;
SELECT * FROM t2;
+---+--------+----------+
| i | k_WHAT | k_AMOUNT |
+---+--------+----------+
| 3 | milk   | NULL     |
| 4 | NULL   | 5        |
| 5 | NULL   | 10       |
+---+--------+----------+

When using [" and "] Null is substituted by "". If null is in last position, then it is not represented

create table t3 (
i int(2) field_format='i:[]:NUMBER',
WHAT char(32) field_format='k:[" and "]:WHAT',
Amm char(32) field_format='k:[", "]:AMOUNT'
)engine=CONNECT table_type=JSON File_name='exp.json';
 
select * from t3;
+------+------------+------------+
| i    | WHAT       | Amm        |
+------+------------+------------+
|    3 | milk       | NULL       |
|    4 | NULL       | 5          |
|    5 |  and sugar | 10, 20, 30 |
+------+------------+------------+



 Comments   
Comment by Olivier Bertrand [ 2017-05-15 ]

All this work as designed.

The first case is the "text" of the object meaning all key values separated by blanks. The problem is how to represent a null value? NULL would seem nice but prevents to make the difference between a null value and a "NULL" string. "???" is perhaps a bad choice. Suggestions welcome.

The second example is OK. As for the first one, only the first object of the k array is displayed.

The third one is a "calculated array". Null values are replaced by CONNECT by "pseudo" null values (void string or 0 for numeric types) This is clearly necessary for numeric calculations such as SUM. Here again, how to represent nulls when concatenating strings? My feeling is that these formats are not meant for objects that can contain nulls anyway.

For instance, to have a complete representation of this file, your table could be created as:

create table t4 (
i int(2) not null,
what char(32) field_format='k:[X]:WHAT',
amount int field_format='k:[X]:AMOUNT'
) engine=CONNECT table_type=JSON File_name='exp.json';

Then the query:

select * from t4;

will return:

i what amount
3 milk NULL
4 NULL 5
5 NULL 10
5 sugar 20
5 NULL 30
Comment by Alice Sherepa [ 2017-05-15 ]

I understand why it was done like that. Sql operations with null result as null, so concat will be null and it is not, what we want her.
Maybe it should be option for arrays, where user defines how null should be handled, something like [ "---"] )
And I guess it needs to be documented that default is "???"

Also about sum function, if there was null as result, we get null or 0 (if not null field), it is confusing with 0, maybe there should be warning about null value.

And also I wondered why in the t3 table the result is not like " and sugar and ", that is why I did not expanded that field)

Comment by Olivier Bertrand [ 2017-05-15 ]

If you prefer "---" to "???" I'll do it. No problem.
About "and sugar and" it is just the way it is implemented. However I'd prefer just "sugar".

Comment by Elena Stepanova [ 2017-05-15 ]

Maybe just <null>?
Whatever we choose, obviously it will never be unique, because there can always be a string value like that.

Comment by Olivier Bertrand [ 2017-05-15 ]

<null> is Ok.

Comment by Alice Sherepa [ 2017-05-15 ]

sorry, I didn't mean "---", it was "---" some string, that user will decide, how he wants to see null appeared. And maybe to set default to <null>

Comment by Olivier Bertrand [ 2017-05-17 ]

This is an interresting topic. One question is: what are null values in a json document?
According to most people on the Internet, it depends on the requirements of application.

For CONNECT, the problem it not when a table column directly access a value, which is represented by NULL or <null> by client programs, but when representing object "text" or concatenate or calculated array values. Both are specific to CONNECT, there are no such things in the standard JSON handling by other products.

For example, supposing we have the following pres.json file:

{"id":32,"name":{"first":"Franklin","second":"Delanoe","last":"Roosevelt"},"party":"Democratic","pets":["dog",null]}
{"id":35,"name":{"first":"John","second":"Fitzgerald","last":"Kennedy"},"party":"Democratic","pets":[null,"cat"]}
{"id":37,"name":{"first":"Richard","second":null,"last":"Nixon"},"party":"Republican","pets":["dog",null,"bird"]}
{"id":40,"name":{"first":"Ronald","last":"Reagan"},"party":"Republican","pets":[]}
{"id":43,"name":{"first":"George","second":"W.","last":"Bush"},"party":"Republican","pets":["dog","cat"]}
{"id":44,"name":{"first":"Barack","second":null,"last":"Obama"},"party":"Democratic","pets":["cat"]}

One user can wish to have a simple view on it creating a table such as:

create or replace table pres2 (
id int(3) not null,
name char(50) not null,
party char(24),
pets varchar(255) field_format='pets.[", "]',
nb int(2) field_format='pets.[#]')
engine=connect table_type=json file_name='pres.json'
lrecl=350 option_list='pretty=0';

However, when executing:

select * from pres2;

The result currently obtained is:

id name party pets nb
32 Franklin Delanoe Roosevelt Democratic dog 2
35 John Fitzgerald Kennedy Democratic , cat 2
37 Richard ??? Nixon Republican dog, bird 3
40 Ronald Reagan Republican   0
43 George W. Bush Republican dog, cat 2
44 Barack ??? Obama Democratic cat 1

Even replacing "???" by "<null>" will not be satisfactory for him and the pet count is clearly wrong. The proper answer would be something such as:

id name party pets nb
32 Franklin Delanoe Roosevelt Democratic dog 1
35 John Fitzgerald Kennedy Democratic cat 1
37 Richard Nixon Republican dog, bird 2
40 Ronald Reagan Republican   0
43 George W. Bush Republican dog, cat 2
44 Barack Obama Democratic cat 1

This means that in these cases, the best handling of NULL values is probably to consider that they are equivalent to no value at all (like for Ronald Reagan).

If someone really wants to see NULL values, this is probably that he is more interrested by the file contain than by its values, then he can use the "*" format to directly see the JSON representation of it.

What do you think of it?

Comment by Alice Sherepa [ 2017-05-17 ]

https://en.wikipedia.org/wiki/Null_(SQL)#Null_propagation

so the real question is why input file has NULL values there. Maybe it was some kind of corruption? Or maybe we know that there was some pet, just it is unknown which exactly? Or maybe he had an elephant, which is formally not a pet - but it worth mentioning? Or maybe we were not able to find any information about that, but this work is expected to be performed soon? RDBMS must not guess what exactly NULL may mean. Interpretation of NULL is up to users of RDBMS, but RDBMS must follow standard way of handling NULL values.
btw most of fuctions here https://dev.mysql.com/doc/refman/5.7/en/json-modification-functions.html return NULL if argument is NULL

Comment by Olivier Bertrand [ 2017-05-17 ]

I agree that the pres.json file is sort of artificial. But nulls could be the result of DELETE that could not be done and replaced by UPDATE to null. But it also shows that it would be much better if JSON could not contain NULLs.

Now, according to your first site, NULL is somewhat ambiguous. After saying that nulls indicate missing values, the article describes what to do with nulls as having unknown values.

What MariaDB does with nulls is:

select 1 = NULL, concat('a',NULL,'b'), 4 + NULL;

1 = NULL concat('a',NULL,'b') 4 + NULL
NULL NULL NULL

Applied to my example, if CONNECT apply this, the result would be:

id name party pets nb
32 Franklin Delanoe Roosevelt Democratic NULL NULL
35 John Fitzgerald Kennedy Democratic NULL NULL
37 NULL Republican NULL NULL
40 Ronald Reagan Republican   0
43 George W. Bush Republican dog, cat 2
44 NULL Democratic cat 1

This position seems to me rather dogmatic. However, I could add an option enabling the user to specify how he wants nulls to me handled.

Comment by Elena Stepanova [ 2017-05-25 ]

In regard to the example from Olivier's comment :

as a regular user who doesn't care about internals and standards, I would actually expect the output from that example to look like this:

+----+----------------------------+------------+----------------+------+
| id | name                       | party      | pets           | nb   |
+----+----------------------------+------------+----------------+------+
| 32 | Franklin Delanoe Roosevelt | Democratic | dog, ???       |    2 |
| 35 | John Fitzgerald Kennedy    | Democratic | ???, cat       |    2 |
| 37 | Richard ??? Nixon          | Republican | dog, ???, bird |    3 |
| 40 | Ronald Reagan              | Republican |                |    0 |
| 43 | George W. Bush             | Republican | dog, cat       |    2 |
| 44 | Barack ??? Obama           | Democratic | cat            |    1 |
+----+----------------------------+------------+----------------+------+
6 rows in set (0.00 sec)

(or whatever visible representation of NULL we use instead of ???, it's not the point).

That is, I agree with Alice in the part that the presence of NULL both in "second" field and in "pets" array most likely indicates that something went wrong. For the second name, if a person doesn't have it at all, it should probably be absent, like it is for Ronald Reagan; while NULL means that it's either unknown or failed to be stored, and should be taken care of. For pets it's even more obvious, there should be no NULL or empty values, they are meaningless; if they're there, something is definitely wrong.

At the same time (again, as a user who doesn't care as much about standards as about common sense), I agree with Olivier that producing NULL as a result string for all those fileds wouldn't do me any good at all, it's investigation nightmare, even though it's a standard. Luckily, I think for now Olivier has an excuse for not following, because we can say that it is not really functions, it is just some internal magic of CONNECT JSON, so Olivier has some leeway here to decide how it works "by design".

All in all, I suggest to leave it for now, and maybe revisit it later when we have complaints or opinions from real users – after all, that's what our community testing is for.

Comment by Olivier Bertrand [ 2017-07-23 ]

I have added a new string session variable connect_json_null that can be set to the representation of nulls in object text or array list. Its default value is <null>.

It can also be set to NULL, indicating to ignore nulls in these case.

Comment by Olivier Bertrand [ 2017-07-23 ]

Concerning your example, it will look like:

+----+----------------------------+------------+-------------------+------+
| id | name                       | party      | pets              | nb   |
+----+----------------------------+------------+-------------------+------+
| 32 | Franklin Delanoe Roosevelt | Democratic | dog, <null>       |    2 |
| 35 | John Fitzgerald Kennedy    | Democratic | <null>, cat       |    2 |
| 37 | Richard <null> Nixon       | Republican | dog, <null>, bird |    3 |
| 40 | Ronald Reagan              | Republican |                   |    0 |
| 43 | George W. Bush             | Republican | dog, cat          |    2 |
| 44 | Barack <null> Obama        | Democratic | cat               |    1 |
+----+----------------------------+------------+-------------------+------+
6 rows in set (0.00 sec)

With the default value of connect_json_null and as:

+----+----------------------------+------------+-----------+------+
| id | name                       | party      | pets      | nb   |
+----+----------------------------+------------+-----------+------+
| 32 | Franklin Delanoe Roosevelt | Democratic | dog       |    2 |
| 35 | John Fitzgerald Kennedy    | Democratic | cat       |    2 |
| 37 | Richard Nixon              | Republican | dog, bird |    3 |
| 40 | Ronald Reagan              | Republican |           |    0 |
| 43 | George W. Bush             | Republican | dog, cat  |    2 |
| 44 | Barack Obama               | Democratic | cat       |    1 |
+----+----------------------------+------------+-----------+------+
6 rows in set (0.00 sec)

when ignoring nulls. Note that the nb column will still count null values. However, when making calculation on arrays or when retrieving min or max values, nulls will be always ignored.

Generated at Thu Feb 08 08:00:36 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.