Connectors team is asking for a new way to distinguish real data types on the client side:
- BOOLEAN vs TINYINT
- JSON vs LONGEXT
- GEOMETRY sub-types (such as POINT) vs just GEOMETRY
This task is a proposal how we can implement this functionality without breaking backward compatibility much, and taking into account the SQL Standard requirements.
Under terms of this task we'll:
- Add a way to transfer additional result set metadata:
- data type name
- format name
- Make INET6 and GEOMETRY sub-types (e.g. POINT, LINESTRING, etc) send the exact data type name
- Make JSON send the format name
The BOOLEAN data type is out of scope of this task and will be addressed separately.
Clients supporting extended metadata will send a new capability during handshake:
If the capability is set, the server will send an additional chunk in the result set metadata, after the "org_col_name" field:
The data inside the new chunk will consist of sub-chunks encoded as follows:
- metatata type - 1 byte, e.g. data type name, format name
- metadata sub-chunk length - 1 or 2 bytes, in net_store_length format
- medatata sub-chunk value, according to #2
Note, the format supports multiple sub-chunks in the metadata chunk for the future extensions, though under terms of this task we won't have multiple sub-chunks.
For example, in case of INET6, the server will send the following extended metadata chunk:
- 0x07 - the length of the extended metadata chunk
- 0x01 - sub-chunk type meaning "data type name"
- 0x05 - sub-chunk length, which is length if the string "inet6"
- "inet6" - sub-chunk value
This type of encoding allows to:
- easily add new metadata sub-chunks in the future (e.g. constraints, domain names, etc)
- iterate through sub-chunks quickly (e.g. skip sub-chunks that are not needed at the moment)
In order to read the extended metadata in a C API program, let's add:
- A new enum for the metadata sub-chunk types:
- A new data type for const strings:
Note, alternatively, this can be a more generic name, say MARIADB_CSTRING.
- A new API function, to extract metadata from MYSQL_FIELD:
The new API function will return:
- a NULL string
if the sub-chunk with the specified type does not exist in the extended metadata
- a string with a non-zero pointer overwise
In the client-server version we'll add a new structure MA_FIELD_EXTENSION:
so the metadata chunk will be available as:
In the embedded version we'll add metadata members directly to MYSQL_FIELD, so the metadata chunk will be available as:
Note, the new function will hide these implementation details, so a C API program will look exactly the same for the client-server and the embedded version.
The information below is here only for history purposes. All real implementation details are given above
JSON cannot be a domain, for the following reasons:
- JSON constructor functions distinguish TEXT vs JSON arguments:
Notice, different output.
- One could create a new domain:
and this domain won't be seen as "json" on the client side any more.
- Domains belong to a schema
JSON cannot be a separate data type, because the standard has the FORMAT clause, e.g.:
We won't want to create a variety of data types: JSON_VARCHAR, JSON_TINYTEXT, JSON_TEXT, JSON_MEDIUMTEXT, JSON_LONGTEXT, etc.
SQL standard proposes two features to build new data types:
- Domains, which are simple subsets of existing data types. Domains can only have optional default values, default collations and check constraints.
- User defined data types (UDT). UTDs are more complex than domains and can specify things like storage type and cast functions (to cast from/to the supertype, as well as to cast from/to an artitrary data type using CREATE CAST statements).
The most interesting grammar part for these features looks like this:
BOOLEAN is currently translated to TINYINT(1).
However, by all means BOOLEAN looks like a separate data type, as it needs its own conversion functions, e.g.:
- CAST(true_bool AS CHAR) should return a string "true", while
- CAST(true_bool AS SIGNED) should return "1".
Domains cannot do this. Moreover, the SQL standard has a distinct predefined BOOLEAN data type. MariaDB's BOOLEAN should be changed to a separate data type, with all distinct MariaDB data type features such as its own type code MYSQL_TYPE_BOOLEAN. All clients should be fixed to understand the new type code.
JSON looks like a domain under LONGTEXT, as it does not need its own CASTs and ORDERINGs. In terms of the SQL standard it could be defined as:
GEOMETRY sub-types (such as POINT, LINESTRING, POLYGON, etc) also look like domains under GEOMETRY. They could also be defined as follows:
In long terms, we'll implement JSON and GEOMETRY sub-types as domains.
However, this needs the "CREATE DOMAIN" statement and all related infrastructure, which is a lot of work.
We want JSON and GEOMETRY sub-types to return "real type" to the client urgently. So we'll implement JSON and GEOMETRY sub-types as built-in domains for now, but in the way that they look and feel like normal SQL-standard domains.
Note, GEOMETRY sub-types already have built-in constraint checks:
We won't change this.
JSON, however, is a simple alias to LONGTEXT and it does not have constraints. We'll change JSON by adding an implicit constraint check, so this script also returns an error, approximately like this:
We'll allow CAST to BOOLEAN:
Also, the SQL standard allows to cast to a domain:
We'll allow cast to JSON and GEOMETRY sub-types, e.g.:
The SQL standard determines the return type of <concatenation> according to “Result of data type combinations”, which treats domains as their base types.
The following concatenations:
will return LONGTEXT.
The following concatenations:
will returns VARBINARY or a BLOB variant (depending on lengths).
This section applies to CASE and its all abbreviations: COALESCE, IFNULL, IF, NVL2, NULLIF, Oracle's DECODE.
The SQL standard determines the return type of <case expression> according to “Result of data type combinations”, which treats domains as their base types.
The following expressions:
should return LONGTEXT (if we treat JSON as a domain).
We can consider returning JSON instead.
The following expressions:
should return GEOMETRY (if we treat POINT as a domain).
We can consider returning POINT instead.
TODO: decide what CASE will return in MariaDB if all input arguments are of the same domain: the domain, or the domain base type.
The SQL standard does not allow using domains in the RETURNS clause:
We can probably allow using domains.
TODO: decide if we allow domains in RETURNS clause.
The server already stores geometry type in FRM files, as one byte, according to this enum definition:
We won't change this. GEOMETRY sub-types will stay in FRM as they are, for compatibility purposes.
We'll add a new way to store domain information into FRM files, as a text string representing the domain name.
In case of JSON, we'll store the type code MYSQL_TYPE_LONGTEXT into its regular place in the FRM file, and will also store the string "JSON" into an extended FRM slot.
Note, to store domain names, we can probably reuse the same FRM slot where GEOMETRY stores extended information like precision, scale, srid, storage model.
GEOMETRY sub-types will not need any special upgrade procedure.
Old tables that were created using JSON in their column definitions will stay LONGTEXT after upgrade.
To start JSON columns report themself as JSON rather than LONGTEXT, users will have to recreate or ALTER their tables, e.g.:
For columns that were created using a domain name rather than a data type, SHOW CREATE TABLE will display domain names rather than base predefined data types for these domains.
This is already true for GEOMETRY sub-types. We'll change JSON accordingly, so this script displays JSON:
Note, SHOW CREATE TABLE will not display any constraints like this:
because this constraint belongs to the domain JSON, not to the column `a`.
This will conform the SQL standard: the CREATE TABLE statement expects either a data type or a domain name in column definitions:
so displaying a domain name in any automatically generated CREATE TABLE statements is fine.
mysqldump uses SHOW CREATE TABLE. So tables with the JSON domain will automatically start dump with "JSON" rather than "LONGTEXT" in their column definitions.
As of SQL-2014, the INFORMATION_SCHEMA.COLUMNS metadata view has the following domain related columns (going after the COLLATION_NAME column):
- DOMAIN_CATALOG SQL_IDENTIFIER
- DOMAIN_SCHEMA SQL_IDENTIFIER
- DOMAIN_NAME SQL_IDENTIFIER
We won't need SQL catalogs in the near future, so we can skip DOMAIN_CATALOG for now.
We'll add only DOMAIN_SCHEMA and DOMAIN_NAME, so "SHOW CREATE TABLE COLUMNS" will display the following output:
As as SQL-2014, INFORMATION_SCHEMA has the following views that refer to domain names:
We have non of these views. We won't add these tables for now. We can add these tables later, when we'll be working on implementing the CREATE DOMAIN" statement.
Geometry sub-types do not need any changes.
We'll change DESCRIBE to display domain names when applicable (i.e. when a column was created using a doman name rather than a data type), so this script will display JSON:
The SQL standard (as of SQL-2014) does not seem to send domain information in the result set metadata.
Clients only see domain's base data types (e.g. LONGTEXT in case of JSON).
See ``Table 6 — Fields in SQL/CLI row and parameter descriptor areas''.
However, the standard provides a way to send implementation-defined item fields. We'll send domain names as such implementation defined field.
We'll add a new client capability for domain names.
Old clients we'll not see domain names, so they'll still see JSON as LONGTEXT and GEOMETRY sub-types as just GEOMETRY.
If the client during hand-shake reports that it understands domain names, the server will send:
- the domain base data type in its usual place (e.g. MYSQL_TYPE_LONGTEXT in case of JSON)
- the domain name in string format, in a new slot.
TODO: discuss (with the connector team) the exact specifications how the server will send domain names in the client server protocol.