Some comments...
First, The <?xml?> declaration is indeed optional, but it's always better to have it than to not have. I think we can all agree on this. Newer versions of the XML spec make it mandatory.
Secondly, character encoding used in the file must match what is declared in the <?xml?> (or implicitly declared by omission: if there is no encoding specified, then the encoding is being implicitly declared as UTF-8 or UTF-16). More info on encoding is here.
Remember "XML" is a well-defined thing. If you're going to claim to output "XML" then you have to follow the XML rules. Note that a "character" in XML is by definition a Unicode character.
In particular this statement worries me:
Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong.
This is a reasonable thing to do but only if it's consistent with the <?xml?> declaration. For example, it would be totally wrong to output an <?xml?> declaration encoding specified as UTF-8, yet encode the characters as UTF-16.
If you want to output UTF-16 that's fine but you must then include a byte-order mark (see reference above).
But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name
Yes - the encoding names must follow what is allowed for XML. What MySQL likes to call them is irrelevant.
So to summarize this:
- Including the <?xml?> declaration is a good thing and we should keep doing it
- As long as the <?xml?> declaration does not specify an encoding and there's no byte-order mark, then the file must be encoded in UTF-8
- Using a different encoding, e.g., same as the database itself, is reasonable, but if this is done, the <?xml?> declaration must explicitly specify that encoding using the proper "XML" name for it
Questions:
- Since currently there is no explicit encoding declared, can someone verify that MySQL currently does actually output XML using UTF-8 encoding, regardless of the encoding of the database?
- Are there any MySQL-supported encodings that can't 100% be mapped into UTF-8? If so then they would have to be output using a different encoding to preserve all the characters.
This seems to be an easy fix. I'd like to work on it.
I have created a fork (https://github.com/haidong/server) and started digging around. It looks line 3829 under client/mysql.cc is where the problem. See snippet below.
// Line 3829
static void
print_table_data_xml(MYSQL_RES *result)
{
MYSQL_ROW cur;
MYSQL_FIELD *fields;
mysql_field_seek(result,0);
tee_fputs("<?xml version=\"1.0\"?>\n\n<resultset statement=\"", PAGER); // I think this is our main problem!
xmlencode_print(glob_buffer.ptr(), (int)strlen(glob_buffer.ptr()));
tee_fputs("\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">",
PAGER);
Additionally, there are a few places, particularly files in tests, that could also benefit from this small change. For example, mysql-test/main/client_xml.result file has many places where this encoding info might be helpful.
I have read the contribution guide. I'll start hacking against my own forked repo. I'd appreciate any steps you want me to follow and suggestions/tips you can give.
Thanks!