[MDEV-29389] Output French error strings in UTF8 Created: 2022-08-25  Updated: 2022-08-31

Status: Open
Project: MariaDB Server
Component/s: Storage Engine - Connect
Fix Version/s: None

Type: Task Priority: Major
Reporter: Anson Chung Assignee: Unassigned
Resolution: Unresolved Votes: 0
Labels: None


 Description   

Modern software (including text editors, static analysis software,
and web-based code review interfaces) often requires source code files
to be interpretable via a consistent character encoding, with UTF-8 or
ASCII (a strict subset of UTF-8) as the default. Several of the MariaDB
source files contain bytes that are not valid in either the UTF-8 or
ASCII encodings, but instead represent strings encoded in the
ISO-8859-1/Latin-1 or ISO-8859-2/Latin-2 encodings.

This JIRA stemmed from this PR to allow for more discussion regarding how to handle these strings: https://github.com/MariaDB/server/pull/2224

In the PR, we are using '\x' escapes to replace the non-valid ASCII encoded characters. By doing this, we do not change the fundamental encoding that these strings are encoded in (ISO-8859-1). This ticket aims to foster discussion regarding the feasibility of changing MariaDB to output these strings in UTF8 instead altogether.



 Comments   
Comment by Vladislav Vaintroub [ 2022-08-26 ]

I do not think you have to worry about those french strings in those header files. The header files are not used in compilation, and you'd need a preprocessor constant -DFRENCH so they are used. thus, you can change them to whatever you want, or remove them entirely, together with #if defined(FRENCH) inside storage/connect directory.
Of you can convert them to UTF8, whatever you do to those headers has no effect.

Comment by Daniel Lenski [ 2022-08-31 ]

wlad wrote:

I do not think you have to worry about those french strings in those header files

I don't understand this idea that we "shouldn't worry about these strings". PR #2224 was opened specifically because automated tools (including both Flawfinder and the Github web UI itself) were getting confused by the encoding of the string literals in these source files.

Of you can convert them to UTF8, whatever you do to those headers has no effect.

Converting these string literals to UTF8 certainly will have an effect. Outputting a UTF8-encoded error string to a console with locale fr_FR.ISO-8859-1 will cause an unreadable/mangled message, just like outputting an ISO-8859-1-encoded error string to a console with locale fr_FR.UTF-8.

Comment by Daniel Lenski [ 2022-08-31 ]

Recently, linuxjedi closed https://github.com/MariaDB/server/pull/2224. However, for the reasons I describe in this comment I still believe it's a useful first step, and should be merged:

Giant lists of case statements in C source files simply aren't a great way to represent translated strings … gettext would be a far better way to do it.

But I believe that this PR is still useful and should be merged, because it achieves the short-term goal of making the encoding of these string literals unambiguous to both human and machine readers, without changing the exact the exact bytes that they contain.

Generated at Thu Feb 08 10:08:12 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.