Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26713

Windows - improve i18n support




      While Linux systems are on utf8 now, for terminal/console IO (having converted from latin1 sometime in the last 15 years), Windows Unicode support trails here. Using OEMCP encoding, in the command line client like now, was OK maybe 20 years ago, but it is no more modern.

      There are
      many people who try to use Unicode on the command line on Windows, with MariaDB, and fail.

      It is definitely possible to input characters outside of the OEMCP range, and write some Unicode to the console. cmd.exe faces some challenges , e.g it is necessary to choose monotype font that supports CJK, if we want CJK, though doing that is relatively easy.
      However Windows Terminal has none of the problems, choosing font is automated, it is monotype, Eastern and Western alphabets/ideographs.

      The plan

      We will go all-utf8, but only starting with reasonably modern Windows (Windows 10 1903 and later).
      allows to define active process codepage as UTF8 (https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) via special setting in the application manifest. We will use it, which would have following implications

      • main () command line arguments are UTF8
      • Win32 ANSI API are utf-8 (i.e CreateFile, etc)
      • Console input and output codepage remain OEMCP by default, but we will change them via SetConsoleCP(), SetConsoleInputCP() to utf8
      • ReadConsoleA is still buggy with UTF8, so we'll have to read input with ReadConsoleW, and convert to GetConsoleCP()

      Changes on Windows 10 1903 and later (Windows Server 2020+, Windows 11+)

      • Mysql client will connect with utf8mb4 , display all characters correctly.
      • The mojibake, that comes from ANSI command line parameters/console OEMCP codepage mismatch, will be fixed (see below)
      • charset defined by chcp will be ignored, but one can pass --default-charset if on whatever reasons UTF8 is not desired (maybe controlling output charset in script that use stdio redirection)
      • my.ini will allow i18n path, passwords etc, provided it is correctly stored as UTF8 file.

      Implications of the upgrade, for the end user.

      • If user does not use mariadb-upgrade-service for upgrade, and has non-ASCII paths(or username,database,etc) in my.ini, he'll need to convert the config file to UTF-8 himself.
        https://stackoverflow.com/a/76808/547065 is how to do that with powershell, but Notepad nowadays will also do
      • If a user had a non-UTF8 in password, he won't be able to connect out-of-the box. Because password must be the same bytes for successful authentication, as was used in SET PASSWORD or CREATE USER, and those bytes will differ, in a change from OEM to ANSI to UTF8
        We'll provide a workaround :

            mariadb --user=<name> --password=<passwd> --default-character-set=<charset>

        will convert <passwd>, as well as <name> to <charset> bytes, on UTF-8 Windows . so that a user can login in most cases, by using either his platform's OEM or ANSI charsets. Note: we do not do anything like that elsewhere.
        Note: outside of Windows problem is not well researched or known, or maybe forgotten by now, because Unixes conveniently standardized on UTF-8 in the last decades. See both answer, and comments here )

      What happens on older Windows

      --default-charset will, (for the first time!) show characters correctly on the console, also without chcp.

      Note: there is only one variation of older Windows we have to support, Windows Server 2019, but this one is currently most used server edition, and will stick around for longer time. It is capable of using Unicode on the console , but CP_UTF8 can't be made process active ANSI codepage, thus neither main() command line arguments nor CreateFileA(or other ANSI APIs) can be made UTF8, without major efforts.


      • --default-charset in the client, when used, will be handled by setting console input and output codepages according to charset.
      • Illustration of command line glitches as of today - passing non-ASCII via command line

        mysql.exe -uroot -e "select 'hällo'"
        | hΣllo |
        | hΣllo |

        Explanation of mojibake : command line argument passed as ANSI charset (codepage 1252), the console output is OEM charset (codepage 850). This is how Ä becomes misinterpreted as Σ. However, characters that are typed, rather than passed via parameters, are interpreted correctly . This does not affect command client only, even something as simple as echo.exe is affected

        C:\work\10.6\xxx> C:\work\10.6\xxx\client\Debug\echo.exe hällö

      What others do

      • MySQL client can do unicode, but in a rather awkward manner, based on my original patch of 12 years ago.
        MySQL "fixed" the patch, so that , UTF8 is not enabled by default, and also one needs to pass --default-character-set=utf8 on the command line for the magic to works. The patch itself is large complex , partly due to the C runtime did not handle UTF8 well, and UTF8 was not a valid codepage you can pass to SetConsoleCP() and on some so on. So, console IO was detected, and fputc/fputs family of functions were replaced with win_console_fputs, win_console__fputc etc family of functions, which would all use WriteConsoleW at the end. That was necessary back then, yet C runtime, and Windows came long way since, in supporting UTF8.
      • Some other project, like Ninja and R are moving or recently moved into the same direction as described here.


          Issue Links



              wlad Vladislav Vaintroub
              wlad Vladislav Vaintroub
              0 Vote for this issue
              4 Start watching this issue



                  Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.