Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-25374

XML output does not include character encoding in <?xml?> declaration

Details

    • Bug
    • Status: Open (View Workflow)
    • Minor
    • Resolution: Unresolved
    • 10.4.17
    • None
    • Scripts & Clients
    • openSUSE 15.2

      libmariadb3-3.1.12-lp152.7.1.x86_64
      mariadb-errormessages-10.4.17-lp152.2.8.1.noarch
      mariadb-10.4.17-lp152.2.8.1.x86_64
      mariadb-tools-10.4.17-lp152.2.8.1.x86_64
      mariadb-client-10.4.17-lp152.2.8.1.x86_64

    Description

      When invoking mysql --xml from the command line, the first part of the output is:

      <?xml version="1.0"?>
       
      <resultset ...
      

      Note there is no encoding specified in the XML declaration.

      As long as mysql is outputting UTF-8, this is technically correct because UTF-8 is the default (discussion here).

      However, to avoid any possible confusion, perhaps the XML declaration should include an explicit declaration of the character encoding, e.g.:

      <?xml version="1.0" encoding="UTF-8"?>
       
      <resultset ...
      

      Just a thought...

      Attachments

        Activity

          haidong Haidong Ji added a comment -

          This seems to be an easy fix. I'd like to work on it.

          I have created a fork (https://github.com/haidong/server) and started digging around. It looks line 3829 under client/mysql.cc is where the problem. See snippet below.

          // Line 3829
          static void
          print_table_data_xml(MYSQL_RES *result)
          {  
            MYSQL_ROW   cur;
            MYSQL_FIELD *fields;
             
            mysql_field_seek(result,0);
             
            tee_fputs("<?xml version=\"1.0\"?>\n\n<resultset statement=\"", PAGER);          // I think this is our main problem!                                                                                                                                                                    
            xmlencode_print(glob_buffer.ptr(), (int)strlen(glob_buffer.ptr()));
            tee_fputs("\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">",
                      PAGER);
          

          Additionally, there are a few places, particularly files in tests, that could also benefit from this small change. For example, mysql-test/main/client_xml.result file has many places where this encoding info might be helpful.

          I have read the contribution guide. I'll start hacking against my own forked repo. I'd appreciate any steps you want me to follow and suggestions/tips you can give.

          Thanks!

          haidong Haidong Ji added a comment - This seems to be an easy fix. I'd like to work on it. I have created a fork ( https://github.com/haidong/server ) and started digging around. It looks line 3829 under client/mysql.cc is where the problem. See snippet below. // Line 3829 static void print_table_data_xml(MYSQL_RES *result) { MYSQL_ROW cur; MYSQL_FIELD *fields; mysql_field_seek(result,0); tee_fputs("<?xml version=\"1.0\"?>\n\n<resultset statement=\"", PAGER); // I think this is our main problem! xmlencode_print(glob_buffer.ptr(), (int)strlen(glob_buffer.ptr())); tee_fputs("\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">", PAGER); Additionally, there are a few places, particularly files in tests, that could also benefit from this small change. For example, mysql-test/main/client_xml.result file has many places where this encoding info might be helpful. I have read the contribution guide. I'll start hacking against my own forked repo. I'd appreciate any steps you want me to follow and suggestions/tips you can give. Thanks!
          haidong Haidong Ji added a comment -

          After further reading, I'm not sure if this is actually a bug. In fact, I wonder adding it may actually do more harm than good.

          1. It looks for XML 1.0, encoding info in header is not required. See this stackoverflow article:
          https://stackoverflow.com/questions/7007427/does-a-valid-xml-file-require-an-xml-declaration

          2. Checking the man page for mysql client, it says "The output when --xml is used with mysql matches that of mysqldump --xml. See mysqldump(1) for details." Looking at man page for mysqldump xml section, I suspect the encoding info is perhaps omitted on purpose. Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong.

          So I don't plan to work on this now. I'll probably pick other beginner-friendly issues to work on. If you have suggestions on what I could work on, that'd be good too. Thanks!

          haidong Haidong Ji added a comment - After further reading, I'm not sure if this is actually a bug. In fact, I wonder adding it may actually do more harm than good. 1. It looks for XML 1.0, encoding info in header is not required . See this stackoverflow article: https://stackoverflow.com/questions/7007427/does-a-valid-xml-file-require-an-xml-declaration 2. Checking the man page for mysql client, it says "The output when --xml is used with mysql matches that of mysqldump --xml. See mysqldump(1) for details." Looking at man page for mysqldump xml section, I suspect the encoding info is perhaps omitted on purpose . Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong. So I don't plan to work on this now. I'll probably pick other beginner-friendly issues to work on. If you have suggestions on what I could work on, that'd be good too. Thanks!

          Good point. I think that encoding, if at all specified, should be not hard-coded UTF-8, but should actually match the encoding of the following data. mysql client knows what encoding it uses, so, technically, it could be possible to put it into the xml declaration. But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name,

          serg Sergei Golubchik added a comment - Good point. I think that encoding, if at all specified, should be not hard-coded UTF-8, but should actually match the encoding of the following data. mysql client knows what encoding it uses, so, technically, it could be possible to put it into the xml declaration. But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name,
          archie172 Archie Cobbs added a comment -

          Some comments...

          First, The <?xml?> declaration is indeed optional, but it's always better to have it than to not have. I think we can all agree on this. Newer versions of the XML spec make it mandatory.

          Secondly, character encoding used in the file must match what is declared in the <?xml?> (or implicitly declared by omission: if there is no encoding specified, then the encoding is being implicitly declared as UTF-8 or UTF-16). More info on encoding is here.

          Remember "XML" is a well-defined thing. If you're going to claim to output "XML" then you have to follow the XML rules. Note that a "character" in XML is by definition a Unicode character.

          In particular this statement worries me:

          Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong.

          This is a reasonable thing to do but only if it's consistent with the <?xml?> declaration. For example, it would be totally wrong to output an <?xml?> declaration encoding specified as UTF-8, yet encode the characters as UTF-16.

          If you want to output UTF-16 that's fine but you must then include a byte-order mark (see reference above).

          But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name

          Yes - the encoding names must follow what is allowed for XML. What MySQL likes to call them is irrelevant.

          So to summarize this:

          • Including the <?xml?> declaration is a good thing and we should keep doing it
          • As long as the <?xml?> declaration does not specify an encoding and there's no byte-order mark, then the file must be encoded in UTF-8
          • Using a different encoding, e.g., same as the database itself, is reasonable, but if this is done, the <?xml?> declaration must explicitly specify that encoding using the proper "XML" name for it

          Questions:

          • Since currently there is no explicit encoding declared, can someone verify that MySQL currently does actually output XML using UTF-8 encoding, regardless of the encoding of the database?
          • Are there any MySQL-supported encodings that can't 100% be mapped into UTF-8? If so then they would have to be output using a different encoding to preserve all the characters.
          archie172 Archie Cobbs added a comment - Some comments... First, The <?xml?> declaration is indeed optional, but it's always better to have it than to not have. I think we can all agree on this. Newer versions of the XML spec make it mandatory. Secondly, character encoding used in the file must match what is declared in the <?xml?> (or implicitly declared by omission: if there is no encoding specified, then the encoding is being implicitly declared as UTF-8 or UTF-16). More info on encoding is here . Remember "XML" is a well-defined thing. If you're going to claim to output "XML" then you have to follow the XML rules. Note that a "character" in XML is by definition a Unicode character. In particular this statement worries me: Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong. This is a reasonable thing to do but only if it's consistent with the <?xml?> declaration. For example, it would be totally wrong to output an <?xml?> declaration encoding specified as UTF-8, yet encode the characters as UTF-16. If you want to output UTF-16 that's fine but you must then include a byte-order mark (see reference above). But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name Yes - the encoding names must follow what is allowed for XML. What MySQL likes to call them is irrelevant. So to summarize this: Including the <?xml?> declaration is a good thing and we should keep doing it As long as the <?xml?> declaration does not specify an encoding and there's no byte-order mark, then the file must be encoded in UTF-8 Using a different encoding, e.g., same as the database itself, is reasonable, but if this is done, the <?xml?> declaration must explicitly specify that encoding using the proper "XML" name for it Questions: Since currently there is no explicit encoding declared, can someone verify that MySQL currently does actually output XML using UTF-8 encoding, regardless of the encoding of the database? Are there any MySQL-supported encodings that can't 100% be mapped into UTF-8? If so then they would have to be output using a different encoding to preserve all the characters.

          Generally the dump is a mix of different character sets. Most of it is done in the --default-character-set value, which is utf8mb4 by default. But stored routines and alike (triggers, events) are dumped in their own body charsets.

          I'd think users wouldn't want stored routines and triggers in the xml, and would disable them for xml output, but it's technically possible to make mysqldump to generate an xml which has different parts in different encodings.

          serg Sergei Golubchik added a comment - Generally the dump is a mix of different character sets. Most of it is done in the --default-character-set value, which is utf8mb4 by default. But stored routines and alike (triggers, events) are dumped in their own body charsets. I'd think users wouldn't want stored routines and triggers in the xml, and would disable them for xml output, but it's technically possible to make mysqldump to generate an xml which has different parts in different encodings.
          archie172 Archie Cobbs added a comment -

          Hmm, well if it's possible for MySQL to output something that looks like XML but contains characters that are not encoded consistently with the <?xml?> declaration, then it's not outputting XML, it's outputting something else, in which case MySQL should be fixed to conform to the standard, or else stop calling the output "XML".

          I'm not familiar with the internals, but from a coding point of view there may be a missing component. E.g., each MySQL encoding is going to need to know how to convert its character data to UTF-8 or UTF-16 (or perhaps throw an error if not possible). Then this layer would be applied before writing the XML output, etc.

          archie172 Archie Cobbs added a comment - Hmm, well if it's possible for MySQL to output something that looks like XML but contains characters that are not encoded consistently with the <?xml?> declaration, then it's not outputting XML, it's outputting something else, in which case MySQL should be fixed to conform to the standard, or else stop calling the output "XML". I'm not familiar with the internals, but from a coding point of view there may be a missing component. E.g., each MySQL encoding is going to need to know how to convert its character data to UTF-8 or UTF-16 (or perhaps throw an error if not possible). Then this layer would be applied before writing the XML output, etc.

          People

            Unassigned Unassigned
            archie172 Archie Cobbs
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.