[MDEV-15558] Save real data type in addition to existing data type Created: 2018-03-13  Updated: 2023-12-22

Status: Open
Project: MariaDB Server
Component/s: Data types
Fix Version/s: None

Type: New Feature Priority: Major
Reporter: Diego Dupin Assignee: Alexander Barkov
Resolution: Unresolved Votes: 3
Labels: None

Issue Links:
Blocks
is blocked by MDEV-17832 Protocol: extensions for Pluggable ty... Closed
is blocked by MDEV-20042 Implement EXTRA2_FIELD_DATA_TYPE_INFO... Closed

 Description   

Saving "real" data type

Data type are actually not the "real" data type, but "storage" data type.
Storing an additional byte in frm files to save "real" data type information would solve lots of issues.

BOOLEAN type means something compare to storage data type BIT(1) : that means Connector may return true/false and not 0/1.
JSON type means that data is not just a string, but that string can be a parse to JSON object.

Connectors actually don't know the "real" data type, and it's needed.
Example with JSON: MySQL server save JSON field differently. The result is having a dedicated MYSQL_TYPE_JSON Data type that is sent to connectors.
Connectors may then handle the data differently. The most popular javascript driver use that information to create a JSON object from the string.

connection.query("insert into user(name, additional) value (?, ?)", ['john', {game:'myGame', level:23}]);
connection.query("select from user", (err, rows) => {
	//with mysql
        //rows[0] = {name:'john', additional: {game:'myGame', level:23}}
 
	//with mariadb :
        //rows[0] = {name:'john', additional: "{game:'myGame', level:23}"}
});

MariaDB saves JSON like a string, then connector has no information that this field is a JSON object.

That additional byte would permit a lot of new functionalities: BOOLEAN, JSON ok, but that may permit new types like IPV4 / IPV6 / UUID that are store are text.

Another minor annoyance is that SHOW CREATE could return initial type :

CREATE TABLE `jsondata` (val1 JSON);
SHOW CREATE TABLE `testj`.`jsondata`;
//=> CREATE TABLE `jsondata` (
  `val1` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
//=> information about JSON is lost

Possible non breaking change of protocol

Current protocol for Column definition packet is :

string<lenenc> catalog (always 'def')
string<lenenc> schema
string<lenenc> table alias
string<lenenc> table
string<lenenc> column alias
string<lenenc> column
int<lenenc> length of fixed fields (=0xC)
int<2> character set number
int<4> max. column size
int<1> Field types
int<2> Field detail flag
int<1> decimals
int<2> - unused -

First field always send string<lenenc> catalog (always 'def') = 0x03 0x64 0x65 0x66
no driver does use this information, because catalog doesn't exist in Mariadb.

This field can be used to transmit to driver this additional metadata byte.
This would permit that existing driver will receive the "storage" data type



 Comments   
Comment by Rasmus Johansson (Inactive) [ 2018-05-25 ]

serg can you review this proposal for 10.4?

Comment by Sergei Golubchik [ 2018-05-25 ]

There is no "real" data type. There are data types, and data type aliases. Alias is, exactly, just a different name for something. It's not a new or some "real" data type.

The issue here seems to be not the "real" data type, but that the connector cannot tell JSON object from a string. It cannot do it based on the data type, because SQL:2016 (part 2, section 4.46 JSON data handling in SQL) says explicitly

NOTE 116 — There is no SQL <data type> whose value space is SQL/JSON items or SQL/JSON sequences.

It's easy to pass some flag over to connectors, if the server would know iftself what columns are supposed to contain JSON items. Perhaps the standard expected that it'll be done with a UDT?

Comment by Vladislav Vaintroub [ 2018-05-25 ]

is JSON in the create table not a pretty strong indicator that column is supposed to contain JSON items.
same BOOL etc

This information gets lost , it is not in frm and anywhere. But it should be there I think

Comment by Vladislav Vaintroub [ 2018-05-25 ]

It is a problem for at least JDBC and .NET. they have to guess what is the datatype, and they support richer type systems than MySQL or MariaDB does
tinyint(1) - is this bool? bit(1) - is this bool? CHAR(36) - is this GUID? BINARY(16) - is this GUID? BINARY(16) - is this IP address (ok, this is fictional, no connector needs IP addresses)

If server supported annotation, it would help

Comment by Alexander Barkov [ 2018-05-28 ]

The same problem exists for the GEOMETRY data type variants.
Suppose I start mysql --column-type-info test and run this script:

CREATE OR REPLACE TABLE t1 (a POINT, b POLYGON);
SELECT * FROM t1;

Field   1:  `a`
Catalog:    `def`
Database:   `test`
Table:      `t1`
Org_table:  `t1`
Type:       GEOMETRY
Collation:  binary (63)
Length:     4294967295
Max_length: 0
Decimals:   0
Flags:      BLOB BINARY 
 
Field   2:  `b`
Catalog:    `def`
Database:   `test`
Table:      `t1`
Org_table:  `t1`
Type:       GEOMETRY
Collation:  binary (63)
Length:     4294967295
Max_length: 0
Decimals:   0
Flags:      BLOB BINARY 

Notice, it reports both columns as GEOMETRY. The precise data types POINT and POLYGON are not visible.

The good news is that GEOMETRY at least exposes its additional attributes in INFORMATION_SCHEMA.GEOMETRY_COLUMNS.
https://mariadb.com/kb/en/library/information-schema-geometry_columns-table/

So for example one can use a query like this to know the exact types:

SELECT F_GEOMETRY_COLUMN, GEOMETRY_TYPE
FROM   INFORMATION_SCHEMA.GEOMETRY_COLUMNS
WHERE   G_TABLE_SCHEMA='test' AND G_TABLE_NAME='t1';

+-------------------+---------------+
| F_GEOMETRY_COLUMN | GEOMETRY_TYPE |
+-------------------+---------------+
|                   |             1 |
|                   |             3 |
+-------------------+---------------+

But I think that the exact data type should be available on the client automatically, without having to use I_S:

  • Using I_S involves an extra query, which is not good for performance
  • We cannot add I_S tables for every new data type

Note, DESC and SHOW CREATE actually report the exact data type:

DESC  t1;

+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| a     | point   | YES  |     | NULL    |       |
| b     | polygon | YES  |     | NULL    |       |
+-------+---------+------+-----+---------+-------+

But for connectors, it's a new query again, not any better than using I_S.

Comment by Vladislav Vaintroub [ 2018-05-28 ]

Bar, the problem is that we do not want to issue an I_S query for every column in resultset metadata. It should be sent with resultset metadata,

And BTW, the missing protocol-level documentation for Geometry type is probably the reason for missing support by the connectors.

Comment by Alexander Barkov [ 2018-05-28 ]

I agree, I also prefer to have the exact data type name automatically on the client side.

I'm not aware of any special protocol-level features of GEOMETRY. Perhaps there are no any. We need to ask holyfoot to make sure.

Comment by Vladislav Vaintroub [ 2018-05-28 ]

Resultset metadata should be sufficient to determine data type, or alias, or whatever people used during their CREATE TABLE. an extra I_S query is not an option, cannot be done within connector without increasing latency by the factor of 2?

Comment by Alexander Barkov [ 2018-05-28 ]

I propose we make GEOMETRY report the exact data type first. Then we can extrapolate this for all new data types.

Comment by Diego Dupin [ 2018-06-22 ]

bar for geometry, connector can still know type (data type MYSQL_TYPE_GEOMETRY=255), then parse blob to know geometry sub type (WKBPoint / WKBLineString, ...).
for JSON, connector doesn't event know that blob is JSON for example.

Comment by Vladislav Vaintroub [ 2018-06-22 ]

and for BOOLEAN , the connector does not know it is BOOLEAN, for example

Comment by Rasmus Johansson (Inactive) [ 2018-08-03 ]

How should this work? Should the application that sends the data for storage in the database inform what datatype it was using for each piece of data before converting it to the storage datatype? Isn't that what the connectors are supposed to do?

Comment by Vladislav Vaintroub [ 2018-08-03 ]

ratzpo,
No, application does not mark the data type. DDL does . We allow things like BOOL or JSON in DDL. But we do not preserve the information, and BOOL becomes TINYINT(1), and JSON becomes LONGTEXT.

The original type information is lost with that, but sometimes you want driver to know the actual type. An example is ResultSet.getObject() in JDBC. It should return a Boolean, not an Integer for a column declared as BOOL.

Comment by Sergei Golubchik [ 2018-08-08 ]

I don't think that returning the original type alias that the user used is the solution. One can store any arbitrary data in a "JSON" column, and one can store 5 in a "boolean" column. An alias is just that, an alias, it does not, by definition, carry any additional metadata, so it's wrong to make any conclusions based on what alias the user used. Not mentioning that aliases don't survive mysqldump|mysql.

If the connector needs to distinguish between a boolean and an integer, perhaps they should be two different types? How is it solved elsewhere?

Comment by Vladislav Vaintroub [ 2018-08-08 ]

All Connectors already distinguish between boolean and integer, for couple of decades. There is a lot of hacks, the extra parameter to the connector tinyInt1IsBoolean etc

And as of the alias.- maybe we should not discard "all" the information from the original DDL. And perhaps, "aliases" should survive mysqldump. This is all what this MDEV is about. User can help by adding check constraints.
A new type is a good solution in theory but not in practice - introduces protocol incompatibility with all connectors we do not own.

Comment by Martha C Henry [ 2018-08-25 ]

I don't know squat about all of the technical details for the connector that y'all are discussing.
Nor did I have any idea about the extra parameter to the connector tinyInt1IsBoolean, since unfortunately, in dbvisualizer, your connector does not display any driver properties!

Let me say this about the issue. I vote yes on somehow keeping the aliases in the DDL and mysqldump.
It is beyond irritating to lose your data type definition in COLUMNS.

If you are not going to keep the alias in the DDL, I would suggest getting rid of the alias entirely.
If it's not kept in DDL and mysqldump - why have it? No one knows that you specified it when examining the DB unless you put in a comment on a column to that effect (who reads the comments?)

Lucky me - I use ORM/Hibernate. So in my property definition for a database column I can specify boolean and json_array. That way, our programmers can't screw up the database by setting a bit or tinyint(1) column or json_array column to an invalid value.

In mysql, I usually use BIT for boolean - 0/1
I prefer to use set bit_column=false or set bit_column=true in the DB

Obviously, there is a lot of interpretation possible for Bit/TinyInt(1)/Boolean by the client if usage is boolean
strict: 0/1
interpretation: off/on, no/yes, false/true

    • It would be a lovely feature to add to connection parameter to allow interpretation of a Bit/Boolean/TinyInt(1) !

From what I can tell, mysql converts BIT to TINYINT(1), which is what you are doing in MariaDB with BOOLEAN
Yet, somehow BIT is retained as a data type in the DDL
Could you not do the same logic for BOOLEAN, as well as JSON?

2 mysql connector parameters

tinyInt1isBit Default: true
Should the driver treat the datatype TINYINT(1) as the BIT type (because the server silently converts BIT -> TINYINT(1) when creating tables)?

transformedBitIsBoolean Default: false
If the driver converts TINYINT(1) to a different type, should it use BOOLEAN instead of BIT for future compatibility with MySQL-5.0, as MySQL-5.0 has a BIT type?

Thank you to @diego.dupin for telling me about this thread

Comment by Diego Dupin [ 2018-09-13 ]

I correct my answer about what bar indicate for Geometry type :

For null value, connector doesn't know geometry type :
CREATE TEMPORARY TABLE gis_point_insert (g POINT);
INSERT INTO gis_point_insert VALUES (ST_PointFromText('POINT(10 10)'));
INSERT INTO gis_point_insert VALUES (ST_PointFromText('POINT( )'))
SELECT * FROM gis_point_insert;

Resultset :

first row :
1A 00 00 03 19 00 00 00  00 01 01 00 00 00 00 00     ................
00 00 00 00 24 40 00 00  00 00 00 00 24 40           ....$@......$@
 
second row : 
01 00 00 04 FB                                       .....
 
for first row :
1A 00 00 03 is packet header 
19 is data length
00 00 00  00 is SRID 
remaining bytes (01 01 00 00 00 00 00 ...) are in WKB format (second 01 indicate geometry type = POINT in this example)
 
for second row : 
01 00 00 04 is packet header
FB = NULL length => connector have no information about geometry type, just information about the NULL value. 

javascript standard is GeoJSON, and then expected response from driver is

{ type: "Point", coordinates: [] }

, not NULL, but connector doesn't know.

Comment by Vladislav Vaintroub [ 2018-09-14 ]

diego dupin, as for non-breaking change of the Column definition packet.
If we are very, non-breaking change should use 3 bytes for the datatype, because some connections could rely that "def" was always 3 bytes, and optimize access to schema member by skipping first 4 bytes at the start of packet. Not that I would endorse such connectors, but who knows.

Thus, "boo" == Boolean, "jso" = json

It could be renamed to type-hint from "catalog". I doubt think there will ever be both catalogs and schemas either in MySQL or MariaDB

Comment by Georg Richter [ 2018-10-23 ]

In the packet we have one byte which contains the length of integer values:

 int_LEC      length of fixed fields (=0xC)

How about to increase it's value to 0xE and send 2 bytes at the end of the packet? This would work also with older drivers, which would just ignore additional information.

Comment by Vladislav Vaintroub [ 2018-10-23 ]

@georg 2 bytes can not describe custom/user-defined datatypes ( we do not have any now, but we could in the future)

Comment by Garet Claborn [ 2022-11-16 ]

I very much agree the higher order type should be preserved as much as possible. There are numerous tools which I cannot use with MariaDB simply because JSON is stored as LONGTEXT and my own framework has no way of knowing the user stored JSON there. Very frustrating.

We resort to use a "types" table of 3 unsigned integers id, parent, pointer and 1 name

This allows us to

  • make arbitrarily complex mappings between fully namespaced types,
  • aliases of those types
  • pointer interpreted by the software system or the type itself
  • even compose a type whose children are a linked list or tree of imported types

If MariaDB were to gain user-defined types; feel free to use that haha. Adding UDT data to information schema and returning the higher-order type directly rather than the built-in, would allow the server to internally continue to use its own primitives

Comment by Vladislav Vaintroub [ 2022-11-16 ]

gclaborn, apparently, JSON already implemented, and type information is passed down.
MariaDB has an extensible type system, and server-side plugin API to add new data types (no it is not SQL)
Examples of new types are JSON, and there is UUID, and there is INET4/6. This was done in a way that preserves the protocol, and presents data types as strings or blobs to the down-level clients that are unaware of type info. If clients indicate they can read type info (i.e know this MariaDB functionality, the information is passed down to the client. Now you can complain about the drivers that are not using this information, but the datatypes are as true as it gets.

Comment by Martha C Henry [ 2022-11-16 ]

Vladislav Vaintroub - can you give a link to a driver that uses the information?

Comment by Vladislav Vaintroub [ 2022-11-17 ]

martiehen- a link to a driver that uses this information https://github.com/mariadb-corporation/mariadb-connector-nodejs . Why does this driver use this information - it wants to return JSON as JSON, and correct representations of subtypes of Geometry data (points, multilines etc). If you want to contribute to your driver, you can take a look how Capabilities.MARIADB_CLIENT_EXTENDED_TYPE_INFO is used in this driver.

The information can also be retrieved with MariaDB ConnectorC API , with the new API mariadb_field_attr() . It is exposed with the command line client, if you start it with --column-type-info switch.. In the example below . notice "BLOB(format=json)" in the output.

shell> mysql --column-type-info -uroot
...
MariaDB [test]> create table t(j json);
Query OK, 0 rows affected (0.006 sec)
 
MariaDB [test]> select * from t;
Field   1:  `j`
Org_field:  `j`
Catalog:    `def`
Database:   `test`
Table:      `t`
Org_table:  `t`
Type:       BLOB (format=json)
Collation:  utf8mb4_general_ci (45)

Comment by Martha C Henry [ 2022-11-17 ]

Awesome! Thank you! I will check it out

Generated at Thu Feb 08 08:22:17 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.