[MDEV-24732] CP1252 is not Latin-1 Created: 2021-01-29  Updated: 2021-05-17  Resolved: 2021-05-17

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.1, 10.3, 10.5
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Marc Prat Masó Assignee: Sergei Golubchik
Resolution: Incomplete Votes: 0
Labels: character-set
Environment:

Ubuntu 20.04, python3.7 with Mariadb10.3 or/and mariadb:10.1 (docker)



 Description   

In MySQL Latin-1 is an alias to cp1252. Meanwhile the standard (e.g. python) Latin-1 is ISO-8859-1.
The problem gets bigger when accents and apostrophe occurs and you use a stick decode (default method in python mysql library).
Here is a detailed explanation: https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html

Solution: add ISO-8859-1 charset and maybe change the documentation to avoid confusion.
Thanks

This is a duplicated issue from mysql: https://bugs.mysql.com/bug.php?id=101556



 Comments   
Comment by Sergei Golubchik [ 2021-02-17 ]

python isn't the standard.

Microsoft is the standard for CP1252, they also call it "Windows Latin 1" or "ANSI Latin 1", depending on where you look.
ISO is the standard for iso8859-1. it is also appears to be known as "ISO Latin 1".

It seems the issue you're describing is that python connector treats MariaDB's latin1 as "ISO Latin 1", while it is, in fact, "ANSI Latin 1". Looks like a bug in the mysql python connector then, doesn't it?

Comment by Marc Prat Masó [ 2021-02-17 ]

As I understand, this is what you say:
ISO Latin 1 equals to ISO-8895-1 (a standard from the International Organization for Standardization). And ANSI Latin 1 equals to CP1252-1 (a standard from Microsoft).
Any ways It's obvious that changing the mariadb alias is too late but adding a new charset should be easy and avoid future problems (and even help to fix this problem).

Also, as you mention I will report this bug to python too.
thanks

Comment by Sergei Golubchik [ 2021-02-17 ]

Adding iso88590-1 as a new charset is possible. But it doesn't look meaningful to me, see, for example, here or here. ISO-8859-1 is a strict subset of CP1252, that is anything that ISO-8859-1 can represent, CP1252 can represent too.

The problem is not that the string is stored in the wrong charset, the problem is that it is interpreted by the client in the wrong charset. The client interprets strings in cp1252 as strings in iso-8859-1. The fix is to stop doing it, the client should interpret strings in cp1252 as strings in cp1252.

Comment by Marc Prat Masó [ 2021-02-18 ]

Yes of course for the end user this doesn't offer nothing new, if they stick to the same charset.
But this not only affected the select but also the inserts and for quite long.
Also, if this happened once with python, this could happen with other software. Why not avoid the confusion?

Comment by Sergei Golubchik [ 2021-04-19 ]

The python connector used incorrect "latin 1" charset, it used "ISO Latin 1", instead of "ANSI Latin 1". I agree this bug could also possibly happen with other software, it's unfortunate that "Latin 1" name is so overloaded.

But we cannot help it, can we? Existing charsets cannot be renamed. Adding new iso8859_1 charset won't remove an ambiguity from latin1.

What problem does the current definition of latin1 causes? How does it affect selects and inserts?

Generated at Thu Feb 08 09:32:14 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.