[MDEV-29389] Output French error strings in UTF8 - Jira

Details

Type: Task
Status: Open (View Workflow)
Priority: Major
Resolution: Unresolved
Fix Version/s: None
Component/s: Storage Engine - Connect
Labels:
None

Description

Modern software (including text editors, static analysis software,
and web-based code review interfaces) often requires source code files
to be interpretable via a consistent character encoding, with UTF-8 or
ASCII (a strict subset of UTF-8) as the default. Several of the MariaDB
source files contain bytes that are not valid in either the UTF-8 or
ASCII encodings, but instead represent strings encoded in the
ISO-8859-1/Latin-1 or ISO-8859-2/Latin-2 encodings.

This JIRA stemmed from this PR to allow for more discussion regarding how to handle these strings: https://github.com/MariaDB/server/pull/2224

In the PR, we are using '\x' escapes to replace the non-valid ASCII encoded characters. By doing this, we do not change the fundamental encoding that these strings are encoded in (ISO-8859-1). This ticket aims to foster discussion regarding the feasibility of changing MariaDB to output these strings in UTF8 instead altogether.

Attachments

Activity

Ascending order - Click to sort in descending order

Vladislav Vaintroub added a comment - 2022-08-26 08:30

I do not think you have to worry about those french strings in those header files. The header files are not used in compilation, and you'd need a preprocessor constant -DFRENCH so they are used. thus, you can change them to whatever you want, or remove them entirely, together with #if defined(FRENCH) inside storage/connect directory.
Of you can convert them to UTF8, whatever you do to those headers has no effect.

Vladislav Vaintroub added a comment - 2022-08-26 08:30 I do not think you have to worry about those french strings in those header files. The header files are not used in compilation, and you'd need a preprocessor constant -DFRENCH so they are used. thus, you can change them to whatever you want, or remove them entirely, together with #if defined(FRENCH) inside storage/connect directory. Of you can convert them to UTF8, whatever you do to those headers has no effect.

Daniel Lenski (Inactive) added a comment - 2022-08-31 17:04

wlad wrote:

I do not think you have to worry about those french strings in those header files

I don't understand this idea that we "shouldn't worry about these strings". PR #2224 was opened specifically because automated tools (including both Flawfinder and the Github web UI itself) were getting confused by the encoding of the string literals in these source files.

Of you can convert them to UTF8, whatever you do to those headers has no effect.

Converting these string literals to UTF8 certainly will have an effect. Outputting a UTF8-encoded error string to a console with locale fr_FR.ISO-8859-1 will cause an unreadable/mangled message, just like outputting an ISO-8859-1-encoded error string to a console with locale fr_FR.UTF-8.

Daniel Lenski (Inactive) added a comment - 2022-08-31 17:04 wlad wrote: I do not think you have to worry about those french strings in those header files I don't understand this idea that we "shouldn't worry about these strings". PR #2224 was opened specifically because automated tools (including both Flawfinder and the Github web UI itself) were getting confused by the encoding of the string literals in these source files. Of you can convert them to UTF8, whatever you do to those headers has no effect. Converting these string literals to UTF8 certainly will have an effect . Outputting a UTF8-encoded error string to a console with locale fr_FR.ISO-8859-1 will cause an unreadable/mangled message, just like outputting an ISO-8859-1-encoded error string to a console with locale fr_FR.UTF-8 .

Daniel Lenski (Inactive) added a comment - 2022-08-31 17:09

Recently, linuxjedi closed https://github.com/MariaDB/server/pull/2224. However, for the reasons I describe in this comment I still believe it's a useful first step, and should be merged:

Giant lists of case statements in C source files simply aren't a great way to represent translated strings … gettext would be a far better way to do it.

But I believe that this PR is still useful and should be merged, because it achieves the short-term goal of making the encoding of these string literals unambiguous to both human and machine readers, without changing the exact the exact bytes that they contain.

Daniel Lenski (Inactive) added a comment - 2022-08-31 17:09 Recently, linuxjedi closed https://github.com/MariaDB/server/pull/2224 . However, for the reasons I describe in this comment I still believe it's a useful first step, and should be merged: Giant lists of case statements in C source files simply aren't a great way to represent translated strings … gettext would be a far better way to do it. But I believe that this PR is still useful and should be merged, because it achieves the short-term goal of making the encoding of these string literals unambiguous to both human and machine readers, without changing the exact the exact bytes that they contain.

People

Assignee:: Unassigned

Reporter:: Anson Chung

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 2022-08-25 18:05

Updated:: 2022-08-31 17:09

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server