Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-17832

Protocol: extensions for Pluggable types and JSON, GEOMETRY

    XMLWordPrintable

    Details

      Description

      Connectors team is asking for a new way to distinguish real data types on the client side:

      • BOOLEAN vs TINYINT
      • JSON vs LONGEXT
      • GEOMETRY sub-types (such as POINT) vs just GEOMETRY

      This task is a proposal how we can implement this functionality without breaking backward compatibility much, and taking into account the SQL Standard requirements.

      Under terms of this task we'll:

      • Add a way to transfer additional result set metadata:
        • data type name
        • format name
      • Make INET6 and GEOMETRY sub-types (e.g. POINT, LINESTRING, etc) send the exact data type name
      • Make JSON send the format name

      The BOOLEAN data type is out of scope of this task and will be addressed separately.

      Transferring new extended result set metadata in the client-server protocol

      Clients supporting extended metadata will send a new capability during handshake:

      #define MARIADB_CLIENT_EXTENDED_TYPE_INFO (1ULL << 35)
      

      If the capability is set, the server will send an additional chunk in the result set metadata, after the "org_col_name" field:

          if (store(STRING_WITH_LEN("def"), cs, thd_charset) ||
              store_str(field.db_name, cs, thd_charset) ||
              store_str(field.table_name, cs, thd_charset) ||
              store_str(field.org_table_name, cs, thd_charset) ||
              store_str(field.col_name, cs, thd_charset) ||
              store_str(field.org_col_name, cs, thd_charset))
            return true;
          if (thd->client_capabilities & MARIADB_CLIENT_EXTENDED_TYPE_INFO)
          {
            if (store_str(field.extended_type_info.lex_cstring(), cs, thd_charset))
              return true;
          }
      

      The data inside the new chunk will consist of sub-chunks encoded as follows:

      • metatata type - 1 byte, e.g. data type name, format name
      • metadata sub-chunk length - 1 or 2 bytes, in net_store_length format
      • medatata sub-chunk value, according to #2

      Note, the format supports multiple sub-chunks in the metadata chunk for the future extensions, though under terms of this task we won't have multiple sub-chunks.

      For example, in case of INET6, the server will send the following extended metadata chunk:

      • 0x07 - the length of the extended metadata chunk
      • 0x01 - sub-chunk type meaning "data type name"
      • 0x05 - sub-chunk length, which is length if the string "inet6"
      • "inet6" - sub-chunk value

      This type of encoding allows to:

      • easily add new metadata sub-chunks in the future (e.g. constraints, domain names, etc)
      • iterate through sub-chunks quickly (e.g. skip sub-chunks that are not needed at the moment)

      C API additions

      In order to read the extended metadata in a C API program, let's add:

      • A new enum for the metadata sub-chunk types:

        enum mariadb_field_metadata_attr_t
        {
          MARIADB_FIELD_METADATA_DATA_TYPE_NAME= 1,
          MARIADB_FIELD_METADATA_FORMAT_NAME= 2
          // We'll add new metadata types later
        };
        

      • A new data type for const strings:

        typedef struct st_mariadb_metadata_string
        {
          const char *str;
          size_t length;
        } MARIADB_FIELD_METADATA_STRING;
        

        Note, alternatively, this can be a more generic name, say MARIADB_CSTRING.

      • A new API function, to extract metadata from MYSQL_FIELD:

        MARIADB_FIELD_METADATA_STRING
          mariadb_field_metadata_attr(const MYSQL_FIELD *field,
                                      enum mariadb_field_metadata_attr_t type);
        

      The new API function will return:

      • a NULL string {NULL,0}

        if the sub-chunk with the specified type does not exist in the extended metadata

      • a string with a non-zero pointer overwise

      Binary representation

      In the client-server version we'll add a new structure MA_FIELD_EXTENSION:

      typedef struct st_mariadb_field_extension
      {
        const char *type_info;         /* Added in MariaDB-10.5 */
        unsigned int type_info_length; /* Added in MariaDB-10.5 */
      } MA_FIELD_EXTENSION
      

      so the metadata chunk will be available as:

      • field->extension->type_info
      • field->extension->type_info_length

      In the embedded version we'll add metadata members directly to MYSQL_FIELD, so the metadata chunk will be available as:

      • field->type_info
      • field->type_info_length

      Note, the new function will hide these implementation details, so a C API program will look exactly the same for the client-server and the embedded version.

      The information below is here only for history purposes. All real implementation details are given above

      Update 2018-12-14:

      JSON cannot be a domain, for the following reasons:

      • JSON constructor functions distinguish TEXT vs JSON arguments:

      SELECT JSON_ARRAY(JSON_ARRAY('x')), JSON_ARRAY('["x"]');
      +-----------------------------+---------------------+
      | JSON_ARRAY(JSON_ARRAY('x')) | JSON_ARRAY('["x"]') |
      +-----------------------------+---------------------+
      | [["x"]]                     | ["[\"x\"]"]         |
      +-----------------------------+---------------------+
      

      Notice, different output.

      • One could create a new domain:

        CREATE DOMAIN json2 AS LONGTEXT CHECK VALID_JSON(value) AND LENGTH(value) <30;
        

        and this domain won't be seen as "json" on the client side any more.

      • Domains belong to a schema

      Update 2018-12-14#2:

      JSON cannot be a separate data type, because the standard has the FORMAT clause, e.g.:

      SELECT JSON_ARRAY('x') RETURNING VARCHAR(30) FORMAT JSON;
      SELECT JSON_ARRAY('x') RETURNING TEXT FORMAT JSON;
      

      We won't want to create a variety of data types: JSON_VARCHAR, JSON_TINYTEXT, JSON_TEXT, JSON_MEDIUMTEXT, JSON_LONGTEXT, etc.

      Constructing new data types in the SQL standard

      SQL standard proposes two features to build new data types:

      • Domains, which are simple subsets of existing data types. Domains can only have optional default values, default collations and check constraints.
      • User defined data types (UDT). UTDs are more complex than domains and can specify things like storage type and cast functions (to cast from/to the supertype, as well as to cast from/to an artitrary data type using CREATE CAST statements).

      The most interesting grammar part for these features looks like this:

      <domain definition> ::=
        CREATE DOMAIN <domain name> [ AS ] <predefined type>
        [ <default clause> ]
        [ <domain constraint>... ]
        [ <collate clause> ]
       
      <domain constraint> ::=
        [ <constraint name definition> ]
        <check constraint definition>
        [<constraint characteristics> ]
      

      <user-defined type definition> ::= CREATE TYPE <user-defined type body>
       
      <user-defined type body> ::=
        <schema-resolved user-defined type name>
        [ <subtype clause> ]
        [ AS <representation> ]
        [ <user-defined type option list> ]
        [ <method specification list> ]
       
      <user-defined type option list> ::=
        <user-defined type option> [ <user-defined type option>... ]
       
      <user-defined type option> ::=
        <instantiable clause>
      | <finality>
      | <reference type specification>
      | <cast to ref>
      | <cast to type>
      | <cast to distinct>
      | <cast to source>
       
      <subtype clause> ::= UNDER <supertype name>
      <supertype name> ::= <path-resolved user-defined type name>
       
      <representation> ::= <predefined type> | <collection type> | <member list>
      

      <user-defined cast definition> ::=
        CREATE CAST <left paren> <source data type> AS <target data type> <right paren>
        WITH <cast function>
       
      <user-defined ordering definition> ::=
        CREATE ORDERING FOR <schema-resolved user-defined type name> <ordering form>
      
      

      MariaDB data types that need attention

      BOOLEAN

      BOOLEAN is currently translated to TINYINT(1).
      However, by all means BOOLEAN looks like a separate data type, as it needs its own conversion functions, e.g.:

      • CAST(true_bool AS CHAR) should return a string "true", while
      • CAST(true_bool AS SIGNED) should return "1".

      Domains cannot do this. Moreover, the SQL standard has a distinct predefined BOOLEAN data type. MariaDB's BOOLEAN should be changed to a separate data type, with all distinct MariaDB data type features such as its own type code MYSQL_TYPE_BOOLEAN. All clients should be fixed to understand the new type code.

      JSON

      JSON looks like a domain under LONGTEXT, as it does not need its own CASTs and ORDERINGs. In terms of the SQL standard it could be defined as:

      CREATE DOMAIN json
        AS LONGTEXT CHARACTER SET utf8mb4
        CHECK (JSON_VALID(value));
      

      GEOMETRY sub-types

      GEOMETRY sub-types (such as POINT, LINESTRING, POLYGON, etc) also look like domains under GEOMETRY. They could also be defined as follows:

      CREATE DOMAIN point
        AS GEOMETRY
        CHECK (ST_GeometryType(value)='POINT');
      

      In long terms, we'll implement JSON and GEOMETRY sub-types as domains.
      However, this needs the "CREATE DOMAIN" statement and all related infrastructure, which is a lot of work.

      We want JSON and GEOMETRY sub-types to return "real type" to the client urgently. So we'll implement JSON and GEOMETRY sub-types as built-in domains for now, but in the way that they look and feel like normal SQL-standard domains.

      Note, GEOMETRY sub-types already have built-in constraint checks:

      CREATE OR REPLACE TABLE t1 (a LINESTRING);
      INSERT INTO t1 VALUES (GeomFromText('POINT(1 1)'));
      

      ERROR 1366 (22007): Incorrect LINESTRING value: 'POINT' for column 'a' at row 1
      

      We won't change this.

      JSON, however, is a simple alias to LONGTEXT and it does not have constraints. We'll change JSON by adding an implicit constraint check, so this script also returns an error, approximately like this:

      CREATE OR REPLACE TABLE t1 (a JSON);
      INSERT INTO t1 VALUES ('garbage');
      

      ERROR 4025 (23000): CONSTRAINT `JSON_VALID` failed for `test`.`t1`
      

      CAST

      We'll allow CAST to BOOLEAN:

      SELECT CAST(int_column AS BOOLEAN) FROM t1;
      

      Also, the SQL standard allows to cast to a domain:

      <cast target> ::= 
        <domain name>
      | <data type>
      

      We'll allow cast to JSON and GEOMETRY sub-types, e.g.:

      SELECT CAST(text_column AS JSON) FROM t1;
      SELECT CAST(geom_column AS POINT) FROM t2;
      

      Concatenation

      The SQL standard determines the return type of <concatenation> according to “Result of data type combinations”, which treats domains as their base types.
      The following concatenations:

      SELECT CONCAT(json_column1, json_column2) FROM t1;
      SELECT CONCAT(json_column1, text_column1) FROM t1;
      

      will return LONGTEXT.

      The following concatenations:

      SELECT CONCAT(point_column1, point_column2) FROM t1;
      SELECT CONCAT(point_column1, binary_column1) FROM t1;
      

      will returns VARBINARY or a BLOB variant (depending on lengths).

      CASE and CASE abbreviations

      This section applies to CASE and its all abbreviations: COALESCE, IFNULL, IF, NVL2, NULLIF, Oracle's DECODE.

      The SQL standard determines the return type of <case expression> according to “Result of data type combinations”, which treats domains as their base types.

      The following expressions:

      SELECT CASE WHEN x THEN json_column1 ELSE json_column2 END FROM t1;
      SELECT COALESCE(json_column1, json_column2) FROM t1;
      

      should return LONGTEXT (if we treat JSON as a domain).
      We can consider returning JSON instead.

      The following expressions:

      SELECT CASE WHEN x THEN point_column1 ELSE point_column2 END FROM t1;
      SELECT COALESCE(point_column1, point_column2) FROM t1;
      

      should return GEOMETRY (if we treat POINT as a domain).
      We can consider returning POINT instead.

      TODO: decide what CASE will return in MariaDB if all input arguments are of the same domain: the domain, or the domain base type.

      RETURNS clause

      The SQL standard does not allow using domains in the RETURNS clause:

      <returns clause> ::= RETURNS <returns type>
       
      <returns type> ::=
        <returns data type>  [ <result cast>  ]
      | <returns table type>
       
      <returns data type> ::= <data type> [ <locator indication>  ]
      

      We can probably allow using domains.

      TODO: decide if we allow domains in RETURNS clause.

      FRM

      The server already stores geometry type in FRM files, as one byte, according to this enum definition:

        enum geometry_type
        {
          GEOM_GEOMETRY = 0, GEOM_POINT = 1, GEOM_LINESTRING = 2, GEOM_POLYGON = 3,
          GEOM_MULTIPOINT = 4, GEOM_MULTILINESTRING = 5, GEOM_MULTIPOLYGON = 6,
          GEOM_GEOMETRYCOLLECTION = 7
        };
      

      We won't change this. GEOMETRY sub-types will stay in FRM as they are, for compatibility purposes.

      We'll add a new way to store domain information into FRM files, as a text string representing the domain name.
      In case of JSON, we'll store the type code MYSQL_TYPE_LONGTEXT into its regular place in the FRM file, and will also store the string "JSON" into an extended FRM slot.

      Note, to store domain names, we can probably reuse the same FRM slot where GEOMETRY stores extended information like precision, scale, srid, storage model.

      Upgrade

      GEOMETRY sub-types will not need any special upgrade procedure.

      Old tables that were created using JSON in their column definitions will stay LONGTEXT after upgrade.
      To start JSON columns report themself as JSON rather than LONGTEXT, users will have to recreate or ALTER their tables, e.g.:

      ALTER TABLE t1 MODIFY j JSON;
      

      SHOW CREATE TABLE

      For columns that were created using a domain name rather than a data type, SHOW CREATE TABLE will display domain names rather than base predefined data types for these domains.

      This is already true for GEOMETRY sub-types. We'll change JSON accordingly, so this script displays JSON:

      CREATE TABLE t1 (a JSON);
      SHOW CREATE TABLE t1;
      

      +-------+---------------------------------------------------------------------------------------------------------------------------------------+
      | Table | Create Table                                                                                                                          |
      +-------+---------------------------------------------------------------------------------------------------------------------------------------+
      | t1    | CREATE TABLE `t1` (
        `a` JSON DEFAULT NULL,
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1 |
      +-------+---------------------------------------------------------------------------------------------------------------------------------------+
      

      Note, SHOW CREATE TABLE will not display any constraints like this:

      CHECK (VALID_JASON(a))
      

      because this constraint belongs to the domain JSON, not to the column `a`.

      This will conform the SQL standard: the CREATE TABLE statement expects either a data type or a domain name in column definitions:

      <column definition> ::= <column name> [ <data type or domain name> ]
      

      so displaying a domain name in any automatically generated CREATE TABLE statements is fine.

      mysqldump

      mysqldump uses SHOW CREATE TABLE. So tables with the JSON domain will automatically start dump with "JSON" rather than "LONGTEXT" in their column definitions.

      Domains and INFORMATION_SCHEMA.COLUMNS

      As of SQL-2014, the INFORMATION_SCHEMA.COLUMNS metadata view has the following domain related columns (going after the COLLATION_NAME column):

      • DOMAIN_CATALOG SQL_IDENTIFIER
      • DOMAIN_SCHEMA SQL_IDENTIFIER
      • DOMAIN_NAME SQL_IDENTIFIER

      We won't need SQL catalogs in the near future, so we can skip DOMAIN_CATALOG for now.

      We'll add only DOMAIN_SCHEMA and DOMAIN_NAME, so "SHOW CREATE TABLE COLUMNS" will display the following output:

       CREATE TEMPORARY TABLE `COLUMNS` (
        `TABLE_CATALOG` varchar(512) NOT NULL DEFAULT '',
        `TABLE_SCHEMA` varchar(64) NOT NULL DEFAULT '',
        `TABLE_NAME` varchar(64) NOT NULL DEFAULT '',
        `COLUMN_NAME` varchar(64) NOT NULL DEFAULT '',
        `ORDINAL_POSITION` bigint(21) unsigned NOT NULL DEFAULT 0,
        `COLUMN_DEFAULT` longtext DEFAULT NULL,
        `IS_NULLABLE` varchar(3) NOT NULL DEFAULT '',
        `DATA_TYPE` varchar(64) NOT NULL DEFAULT '',
        `CHARACTER_MAXIMUM_LENGTH` bigint(21) unsigned DEFAULT NULL,
        `CHARACTER_OCTET_LENGTH` bigint(21) unsigned DEFAULT NULL,
        `NUMERIC_PRECISION` bigint(21) unsigned DEFAULT NULL,
        `NUMERIC_SCALE` bigint(21) unsigned DEFAULT NULL,
        `DATETIME_PRECISION` bigint(21) unsigned DEFAULT NULL,
        `CHARACTER_SET_NAME` varchar(32) DEFAULT NULL,
        `COLLATION_NAME` varchar(32) DEFAULT NULL,
        `DOMAIN_SCHEMA` varchar(64) NOT NULL DEFAULT '',    <--- new column
        `DOMAIN_NAME` varchar(64) NOT NULL DEFAULT '',      <--- new column
        `COLUMN_TYPE` longtext NOT NULL DEFAULT '',
        `COLUMN_KEY` varchar(3) NOT NULL DEFAULT '',
        `EXTRA` varchar(30) NOT NULL DEFAULT '',
        `PRIVILEGES` varchar(80) NOT NULL DEFAULT '',
        `COLUMN_COMMENT` varchar(1024) NOT NULL DEFAULT '',
        `IS_GENERATED` varchar(6) NOT NULL DEFAULT '',
        `GENERATION_EXPRESSION` longtext DEFAULT NULL
      ) ENGINE=Aria DEFAULT CHARSET=utf8 PAGE_CHECKSUM=0 
      

      Other domain related INFORMATION_SCHEMA tables

      As as SQL-2014, INFORMATION_SCHEMA has the following views that refer to domain names:

      • DOMAINS
      • COLUMN_DOMAIN_USAGE
      • DATA_TYPE_PRIVILEGES
      • DOMAIN_CONSTRAINSTS
      • USAGE_PRIVILEGES

      We have non of these views. We won't add these tables for now. We can add these tables later, when we'll be working on implementing the CREATE DOMAIN" statement.

      Domains and DESCRIBE

      Geometry sub-types do not need any changes.

      We'll change DESCRIBE to display domain names when applicable (i.e. when a column was created using a doman name rather than a data type), so this script will display JSON:

      CREATE OR REPLACE TABLE t1 (a JSON);
      DESCRIBE t1;
      

      +-------+------+------+-----+---------+-------+
      | Field | Type | Null | Key | Default | Extra |
      +-------+------+------+-----+---------+-------+
      | a     | JSON | YES  |     | NULL    |       |
      +-------+------+------+-----+---------+-------+
      

      Domains and client-server protocol

      The SQL standard (as of SQL-2014) does not seem to send domain information in the result set metadata.
      Clients only see domain's base data types (e.g. LONGTEXT in case of JSON).

      See ``Table 6 — Fields in SQL/CLI row and parameter descriptor areas''.

      However, the standard provides a way to send implementation-defined item fields. We'll send domain names as such implementation defined field.

      We'll add a new client capability for domain names.
      Old clients we'll not see domain names, so they'll still see JSON as LONGTEXT and GEOMETRY sub-types as just GEOMETRY.

      If the client during hand-shake reports that it understands domain names, the server will send:

      • the domain base data type in its usual place (e.g. MYSQL_TYPE_LONGTEXT in case of JSON)
      • the domain name in string format, in a new slot.

      TODO: discuss (with the connector team) the exact specifications how the server will send domain names in the client server protocol.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              bar Alexander Barkov
              Reporter:
              bar Alexander Barkov
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: