While Linux systems are on utf8 now, for terminal/console IO (having converted from latin1 sometime in the last 15 years), Windows Unicode support trails here. Using OEMCP encoding, in the command line client like now, was OK maybe 20 years ago, but it is no more modern.
It is definitely possible to input characters outside of the OEMCP range, and write some Unicode to the console. cmd.exe faces some challenges , e.g it is necessary to choose monotype font that supports CJK, if we want CJK, though doing that is relatively easy.
However Windows Terminal has none of the problems, choosing font is automated, it is monotype, Eastern and Western alphabets/ideographs.
We will go all-utf8, but only starting with reasonably modern Windows (Windows 10 1903 and later).
allows to define active process codepage as UTF8 (https://docs.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page) via special setting in the application manifest. We will use it, which would have following implications
- main () command line arguments are UTF8
- Win32 ANSI API are utf-8 (i.e CreateFile, etc)
- Console input and output codepage remain OEMCP by default, but we will change them via SetConsoleCP(), SetConsoleInputCP() to utf8
- ReadConsoleA is still buggy with UTF8, so we'll have to read input with ReadConsoleW, and convert to GetConsoleCP()
- Mysql client will connect with utf8mb4 , display all characters correctly.
- The mojibake, that comes from ANSI command line parameters/console OEMCP codepage mismatch, will be fixed (see below)
- charset defined by chcp will be ignored, but one can pass --default-charset if on whatever reasons UTF8 is not desired (maybe controlling output charset in script that use stdio redirection)
- my.ini will allow i18n path, passwords etc, provided it is correctly stored as UTF8 file.
Implications of the upgrade, for the end user.
- If user does not use mariadb-upgrade-service for upgrade, and has non-ASCII paths(or username,database,etc) in my.ini, he'll need to convert the config file to UTF-8 himself.
https://stackoverflow.com/a/76808/547065 is how to do that with powershell, but Notepad nowadays will also do
- If a user had a non-UTF8 in password, he won't be able to connect out-of-the box. Because password must be the same bytes for successful authentication, as was used in SET PASSWORD or CREATE USER, and those bytes will differ, in a change from OEM to ANSI to UTF8
We'll provide a workaround :
will convert <passwd>, as well as <name> to <charset> bytes, on UTF-8 Windows . so that a user can login in most cases, by using either his platform's OEM or ANSI charsets. Note: we do not do anything like that elsewhere.
Note: outside of Windows problem is not well researched or known, or maybe forgotten by now, because Unixes conveniently standardized on UTF-8 in the last decades. See both answer, and comments here )
--default-charset will, (for the first time!) show characters correctly on the console, also without chcp.
Note: there is only one variation of older Windows we have to support, Windows Server 2019, but this one is currently most used server edition, and will stick around for longer time. It is capable of using Unicode on the console , but CP_UTF8 can't be made process active ANSI codepage, thus neither main() command line arguments nor CreateFileA(or other ANSI APIs) can be made UTF8, without major efforts.
- --default-charset in the client, when used, will be handled by setting console input and output codepages according to charset.
- Illustration of command line glitches as of today - passing non-ASCII via command line
Explanation of mojibake : command line argument passed as ANSI charset (codepage 1252), the console output is OEM charset (codepage 850). This is how Ä becomes misinterpreted as Σ. However, characters that are typed, rather than passed via parameters, are interpreted correctly . This does not affect command client only, even something as simple as echo.exe is affected
- MySQL client can do unicode, but in a rather awkward manner, based on my original patch of 12 years ago.
MySQL "fixed" the patch, so that , UTF8 is not enabled by default, and also one needs to pass --default-character-set=utf8 on the command line for the magic to works. The patch itself is large complex , partly due to the C runtime did not handle UTF8 well, and UTF8 was not a valid codepage you can pass to SetConsoleCP() and on some so on. So, console IO was detected, and fputc/fputs family of functions were replaced with win_console_fputs, win_console__fputc etc family of functions, which would all use WriteConsoleW at the end. That was necessary back then, yet C runtime, and Windows came long way since, in supporting UTF8.