[MDEV-25374] XML output does not include character encoding in <?xml?> declaration Created: 2021-04-08 Updated: 2021-05-13 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Scripts & Clients |
| Affects Version/s: | 10.4.17 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Minor |
| Reporter: | Archie Cobbs | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | beginner-friendly | ||
| Environment: |
openSUSE 15.2 libmariadb3-3.1.12-lp152.7.1.x86_64 |
||
| Description |
|
When invoking mysql --xml from the command line, the first part of the output is:
Note there is no encoding specified in the XML declaration. As long as mysql is outputting UTF-8, this is technically correct because UTF-8 is the default (discussion here). However, to avoid any possible confusion, perhaps the XML declaration should include an explicit declaration of the character encoding, e.g.:
Just a thought... |
| Comments |
| Comment by Haidong Ji [ 2021-05-01 ] | |||||||||||||
|
This seems to be an easy fix. I'd like to work on it. I have created a fork (https://github.com/haidong/server) and started digging around. It looks line 3829 under client/mysql.cc is where the problem. See snippet below.
Additionally, there are a few places, particularly files in tests, that could also benefit from this small change. For example, mysql-test/main/client_xml.result file has many places where this encoding info might be helpful. I have read the contribution guide. I'll start hacking against my own forked repo. I'd appreciate any steps you want me to follow and suggestions/tips you can give. Thanks! | |||||||||||||
| Comment by Haidong Ji [ 2021-05-01 ] | |||||||||||||
|
After further reading, I'm not sure if this is actually a bug. In fact, I wonder adding it may actually do more harm than good. 1. It looks for XML 1.0, encoding info in header is not required. See this stackoverflow article: 2. Checking the man page for mysql client, it says "The output when --xml is used with mysql matches that of mysqldump --xml. See mysqldump(1) for details." Looking at man page for mysqldump xml section, I suspect the encoding info is perhaps omitted on purpose. Shouldn't the encoding of the dump results match that of the database? If the database encoding is utf16, then the hard-coded utf8 is wrong. So I don't plan to work on this now. I'll probably pick other beginner-friendly issues to work on. If you have suggestions on what I could work on, that'd be good too. Thanks! | |||||||||||||
| Comment by Sergei Golubchik [ 2021-05-05 ] | |||||||||||||
|
Good point. I think that encoding, if at all specified, should be not hard-coded UTF-8, but should actually match the encoding of the following data. mysql client knows what encoding it uses, so, technically, it could be possible to put it into the xml declaration. But mysql client only knows MariaDB name of the character set, there must be some mapping to find a corresponding XML encoding name, | |||||||||||||
| Comment by Archie Cobbs [ 2021-05-05 ] | |||||||||||||
|
Some comments... First, The <?xml?> declaration is indeed optional, but it's always better to have it than to not have. I think we can all agree on this. Newer versions of the XML spec make it mandatory. Secondly, character encoding used in the file must match what is declared in the <?xml?> (or implicitly declared by omission: if there is no encoding specified, then the encoding is being implicitly declared as UTF-8 or UTF-16). More info on encoding is here. Remember "XML" is a well-defined thing. If you're going to claim to output "XML" then you have to follow the XML rules. Note that a "character" in XML is by definition a Unicode character. In particular this statement worries me:
This is a reasonable thing to do but only if it's consistent with the <?xml?> declaration. For example, it would be totally wrong to output an <?xml?> declaration encoding specified as UTF-8, yet encode the characters as UTF-16. If you want to output UTF-16 that's fine but you must then include a byte-order mark (see reference above).
Yes - the encoding names must follow what is allowed for XML. What MySQL likes to call them is irrelevant. So to summarize this:
Questions:
| |||||||||||||
| Comment by Sergei Golubchik [ 2021-05-13 ] | |||||||||||||
|
Generally the dump is a mix of different character sets. Most of it is done in the --default-character-set value, which is utf8mb4 by default. But stored routines and alike (triggers, events) are dumped in their own body charsets. I'd think users wouldn't want stored routines and triggers in the xml, and would disable them for xml output, but it's technically possible to make mysqldump to generate an xml which has different parts in different encodings. | |||||||||||||
| Comment by Archie Cobbs [ 2021-05-13 ] | |||||||||||||
|
Hmm, well if it's possible for MySQL to output something that looks like XML but contains characters that are not encoded consistently with the <?xml?> declaration, then it's not outputting XML, it's outputting something else, in which case MySQL should be fixed to conform to the standard, or else stop calling the output "XML". I'm not familiar with the internals, but from a coding point of view there may be a missing component. E.g., each MySQL encoding is going to need to know how to convert its character data to UTF-8 or UTF-16 (or perhaps throw an error if not possible). Then this layer would be applied before writing the XML output, etc. |