[MDEV-25374] XML output does not include character encoding in <?xml?> declaration Created: 2021-04-08  Updated: 2021-05-13

Status: Open
Project: MariaDB Server
Component/s: Scripts & Clients
Affects Version/s: 10.4.17
Fix Version/s: None

Type: Bug Priority: Minor
Reporter: Archie Cobbs Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: beginner-friendly
Environment:

openSUSE 15.2

libmariadb3-3.1.12-lp152.7.1.x86_64
mariadb-errormessages-10.4.17-lp152.2.8.1.noarch
mariadb-10.4.17-lp152.2.8.1.x86_64
mariadb-tools-10.4.17-lp152.2.8.1.x86_64
mariadb-client-10.4.17-lp152.2.8.1.x86_64



 Description   

When invoking mysql --xml from the command line, the first part of the output is:

<?xml version="1.0"?>
 
<resultset ...

Note there is no encoding specified in the XML declaration.

As long as mysql is outputting UTF-8, this is technically correct because UTF-8 is the default (discussion here).

However, to avoid any possible confusion, perhaps the XML declaration should include an explicit declaration of the character encoding, e.g.:

<?xml version="1.0" encoding="UTF-8"?>
 
<resultset ...

Just a thought...



 Comments   
Comment by Haidong Ji [ 2021-05-01 ]

This seems to be an easy fix. I'd like to work on it.

I have created a fork (https://github.com/haidong/server) and started digging around. It looks line 3829 under client/mysql.cc is where the problem. See snippet below.

// Line 3829
static void
print_table_data_xml(MYSQL_RES *result)
{  
  MYSQL_ROW   cur;
  MYSQL_FIELD *fields;
   
  mysql_field_seek(result,0);
   
  tee_fputs("<?xml version=\"1.0\"?>\n\n<resultset statement=\"", PAGER);          // I think this is our main problem!                                                                                                                                                                    
  xmlencode_print(glob_buffer.ptr(), (int)strlen(glob_buffer.ptr()));
  tee_fputs("\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">",
            PAGER);

Additionally, there are a few places, particularly files in tests, that could also benefit from this small change. For example, mysql-test/main/client_xml.result file has many places where this encoding info might be helpful.

I have read the contribution guide. I'll start hacking against my own forked repo. I'd appreciate any steps you want me to follow and suggestions/tips you can give.

Thanks!

Comment by Haidong Ji [ 2021-05-01 ]

After further reading, I'm not sure if this is actually a bug. In fact, I wonder adding it may actually do more harm than good.

1. It looks for XML 1.0, encoding info in header is not required. See this stackoverflow article:
https://stackoverflow.com/questions/7007427/does-a-valid-xml-file-require-an-xml-declaration

2. Checking the man page for mysql client, it says "The output when --xml is used with mysql matches that of mysqldump --xml. See mysqldump(1) for details." Looking at man page for mysqldump xml section, I suspect the encoding info is perhaps omitted on purpose. Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong.

So I don't plan to work on this now. I'll probably pick other beginner-friendly issues to work on. If you have suggestions on what I could work on, that'd be good too. Thanks!

Comment by Sergei Golubchik [ 2021-05-05 ]

Good point. I think that encoding, if at all specified, should be not hard-coded UTF-8, but should actually match the encoding of the following data. mysql client knows what encoding it uses, so, technically, it could be possible to put it into the xml declaration. But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name,

Comment by Archie Cobbs [ 2021-05-05 ]

Some comments...

First, The <?xml?> declaration is indeed optional, but it's always better to have it than to not have. I think we can all agree on this. Newer versions of the XML spec make it mandatory.

Secondly, character encoding used in the file must match what is declared in the <?xml?> (or implicitly declared by omission: if there is no encoding specified, then the encoding is being implicitly declared as UTF-8 or UTF-16). More info on encoding is here.

Remember "XML" is a well-defined thing. If you're going to claim to output "XML" then you have to follow the XML rules. Note that a "character" in XML is by definition a Unicode character.

In particular this statement worries me:

Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong.

This is a reasonable thing to do but only if it's consistent with the <?xml?> declaration. For example, it would be totally wrong to output an <?xml?> declaration encoding specified as UTF-8, yet encode the characters as UTF-16.

If you want to output UTF-16 that's fine but you must then include a byte-order mark (see reference above).

But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name

Yes - the encoding names must follow what is allowed for XML. What MySQL likes to call them is irrelevant.

So to summarize this:

  • Including the <?xml?> declaration is a good thing and we should keep doing it
  • As long as the <?xml?> declaration does not specify an encoding and there's no byte-order mark, then the file must be encoded in UTF-8
  • Using a different encoding, e.g., same as the database itself, is reasonable, but if this is done, the <?xml?> declaration must explicitly specify that encoding using the proper "XML" name for it

Questions:

  • Since currently there is no explicit encoding declared, can someone verify that MySQL currently does actually output XML using UTF-8 encoding, regardless of the encoding of the database?
  • Are there any MySQL-supported encodings that can't 100% be mapped into UTF-8? If so then they would have to be output using a different encoding to preserve all the characters.
Comment by Sergei Golubchik [ 2021-05-13 ]

Generally the dump is a mix of different character sets. Most of it is done in the --default-character-set value, which is utf8mb4 by default. But stored routines and alike (triggers, events) are dumped in their own body charsets.

I'd think users wouldn't want stored routines and triggers in the xml, and would disable them for xml output, but it's technically possible to make mysqldump to generate an xml which has different parts in different encodings.

Comment by Archie Cobbs [ 2021-05-13 ]

Hmm, well if it's possible for MySQL to output something that looks like XML but contains characters that are not encoded consistently with the <?xml?> declaration, then it's not outputting XML, it's outputting something else, in which case MySQL should be fixed to conform to the standard, or else stop calling the output "XML".

I'm not familiar with the internals, but from a coding point of view there may be a missing component. E.g., each MySQL encoding is going to need to know how to convert its character data to UTF-8 or UTF-16 (or perhaps throw an error if not possible). Then this layer would be applied before writing the XML output, etc.

Generated at Thu Feb 08 09:37:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.