Details

    Description

      Modern software (including text editors, static analysis software,
      and web-based code review interfaces) often requires source code files
      to be interpretable via a consistent character encoding, with UTF-8 or
      ASCII (a strict subset of UTF-8) as the default. Several of the MariaDB
      source files contain bytes that are not valid in either the UTF-8 or
      ASCII encodings, but instead represent strings encoded in the
      ISO-8859-1/Latin-1 or ISO-8859-2/Latin-2 encodings.

      This JIRA stemmed from this PR to allow for more discussion regarding how to handle these strings: https://github.com/MariaDB/server/pull/2224

      In the PR, we are using '\x' escapes to replace the non-valid ASCII encoded characters. By doing this, we do not change the fundamental encoding that these strings are encoded in (ISO-8859-1). This ticket aims to foster discussion regarding the feasibility of changing MariaDB to output these strings in UTF8 instead altogether.

      Attachments

        Activity

          I do not think you have to worry about those french strings in those header files. The header files are not used in compilation, and you'd need a preprocessor constant -DFRENCH so they are used. thus, you can change them to whatever you want, or remove them entirely, together with #if defined(FRENCH) inside storage/connect directory.
          Of you can convert them to UTF8, whatever you do to those headers has no effect.

          wlad Vladislav Vaintroub added a comment - I do not think you have to worry about those french strings in those header files. The header files are not used in compilation, and you'd need a preprocessor constant -DFRENCH so they are used. thus, you can change them to whatever you want, or remove them entirely, together with #if defined(FRENCH) inside storage/connect directory. Of you can convert them to UTF8, whatever you do to those headers has no effect.

          wlad wrote:

          I do not think you have to worry about those french strings in those header files

          I don't understand this idea that we "shouldn't worry about these strings". PR #2224 was opened specifically because automated tools (including both Flawfinder and the Github web UI itself) were getting confused by the encoding of the string literals in these source files.

          Of you can convert them to UTF8, whatever you do to those headers has no effect.

          Converting these string literals to UTF8 certainly will have an effect. Outputting a UTF8-encoded error string to a console with locale fr_FR.ISO-8859-1 will cause an unreadable/mangled message, just like outputting an ISO-8859-1-encoded error string to a console with locale fr_FR.UTF-8.

          dlenski Daniel Lenski (Inactive) added a comment - wlad wrote: I do not think you have to worry about those french strings in those header files I don't understand this idea that we "shouldn't worry about these strings". PR #2224 was opened specifically because automated tools (including both Flawfinder and the Github web UI itself) were getting confused by the encoding of the string literals in these source files. Of you can convert them to UTF8, whatever you do to those headers has no effect. Converting these string literals to UTF8 certainly will have an effect . Outputting a UTF8-encoded error string to a console with locale fr_FR.ISO-8859-1 will cause an unreadable/mangled message, just like outputting an ISO-8859-1-encoded error string to a console with locale fr_FR.UTF-8 .

          Recently, linuxjedi closed https://github.com/MariaDB/server/pull/2224. However, for the reasons I describe in this comment I still believe it's a useful first step, and should be merged:

          Giant lists of case statements in C source files simply aren't a great way to represent translated strings … gettext would be a far better way to do it.

          But I believe that this PR is still useful and should be merged, because it achieves the short-term goal of making the encoding of these string literals unambiguous to both human and machine readers, without changing the exact the exact bytes that they contain.

          dlenski Daniel Lenski (Inactive) added a comment - Recently, linuxjedi closed https://github.com/MariaDB/server/pull/2224 . However, for the reasons I describe in this comment I still believe it's a useful first step, and should be merged: Giant lists of case statements in C source files simply aren't a great way to represent translated strings … gettext would be a far better way to do it. But I believe that this PR is still useful and should be merged, because it achieves the short-term goal of making the encoding of these string literals unambiguous to both human and machine readers, without changing the exact the exact bytes that they contain.

          People

            Unassigned Unassigned
            ansondchu Anson Chung
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.