Skip to content

Latest commit

 

History

History
199 lines (175 loc) · 14 KB

README.md

File metadata and controls

199 lines (175 loc) · 14 KB

Setup environment

call "C:\Program Files (x86)\Microsoft Visual Studio\2019\Enterprise\VC\Auxiliary\Build\vcvars64.bat"

source-charset and execution-charset

The source-charset is the encoding used by Visual Studio to interpret the source files into the internal representation. Specially, for Narrow String Literals in the source files, the compiler use UTF-8 (why not UTF-16?) encoded strings as the internal representation, and then these strings are converted to the execution-charset and store in the compiled object files.

To sum up, the compiler converts narrow string literals in source files from source-charset to Unicode and then to execution-charset, and finally stores the results into compiled binaries. source-charset must be the encoding of the source files used to store on disk. execution-charset is the encoding of const char[] in memory when the program runs. source-charset and execution-charset are independent. If a character in the source file cannot be represented in the execution character set, the Unicode conversion substitutes a question mark '?' character, see /validate-charset option.

By default, execution-charset is the Windows code page, a.k.a. ANSI code page (ACP), unless you have specified a character set name or code page by using the /execution-charset option. For source-charset, if no /source-charset option is specified, Visual Studio detects BOM to determine if a source file is in an encoded Unicode format, for example, UTF-16 or UTF-8. If no BOM is found, it assumes the source file is encoded using ACP.

The testing source file test\execution_charset.c is encoded as Windows-1252 which cannot be auto-detected and these are characters invalid in ACP. Without /source-charset, the compiler performs ACP to Unicode conversion for Windows-1252 strings and complains C4819 for some invalid ACP characters.

cl /c test\execution_charset.c

warning C4819: The file contains a character that cannot be represented in the current code page (936). Save the file in
Unicode format to prevent data loss.

Tell compiler the real encoding of the source file, the Unicode to ACP conversion is finally performed and the compiler complains C4566 for some Unicode characters for which then substitutes a question mark '?'.

cl /c /source-charset:.1252 test\execution_charset.c

warning C4566: character represented by universal-character-name '\u00FF' cannot be represented in the current code page (936).

Encoding of Windows Console

Windows Console (conhost.exe) is a Win32 GUI app that consists of:

  • InputBuffer: Stores keyboard and mouse event records generated by user input.
  • OutputBuffer: Stores the text rendered on the Console's window client area.

OutputBuffer was ssentially a 2D array of CHAR_INFO structs which contain each cell's character data & attributes. That means only UCS-2 text was supported. Since Windows 10 October 2018 Update (Version 1809, Build Number 10.0.17763), a new OutputBuffer is introduced to fully support all unicode characters.

Another issue is that Console uses GDI for text rendering, which doesn't support font-fallback. So some complex glyphs can't be displayed even if the OutputBuffer could store them. ConPTY is introduced together with the new OutputBuffer. Then Console becomes a true "Console Host", which is windowless and not responsible for user input and rendering, supporting all Command-Line apps and/or GUI apps that communicate with Command-Line apps through Console Virtual Terminal Sequences. Terminal (TTY) is such a typical GUI app responsible for user input and rendering. With ConPTY infrastructure, Windows Terminal uses a new rendering engine that supports font-fallback and displays all testing characters correctly.

Command-Line apps use WriteConsoleW to write unicode text to OutputBuffer and ReadConsoleW to read unicode text from InputBuffer. WriteConsoleA/WriteFile can also be used for output but that involves a encoding conversion from ConsoleOutputCP (defaults to OEMCP) to Unicode before storing text into OutputBuffer. Accordingly, use ReadConsoleA/ReadFile for input will do the conversion from Unicode to ConsoleInputCP (also defaults to OEMCP). Note that ConsoleInputCP only supports DBCS, see ms-terminal/src/host/dbcs.cpp#TranslateUnicodeToOem.

The builtin command type of the "Command Prompt" shell (cmd.exe) checks the start of a file for a UTF-16LE BOM. If it finds such a mark, it displays the file content using WriteConsoleW, otherwise using WriteConsoleA/WriteFile. So type displays correctly only for UTF-16LE BOM-ed files and those encoded in current ConsoleOutputCP. In PowerShell, type detects BOM for UTF-16 and UTF-8. To verify these, just run type words\word-*.txt in Cmd and PowerShell.

UCRT and UTF-8

UCRT is the Windows' equivalent of the GNU C Library (glibc) that including C99 and POSIX functionality and some extensions since Visual Studio 2015. Some POSIX functions have historically used the ACP for doing narrow->wide conversions. In order to support UTF-8, utf8 locale is implemented in ucrt/locale/get_qualified_locale.cpp since UCRT 10.0.17134.0, and those functions have been modified so that they use CP_UTF8 when current locale is utf8, but the ACP otherwise in order to preserve backwards compatibility. These POSIX functions call ucrt/inc/corecrt_internal_win32_buffer.h#__acrt_get_utf8_acp_compatibility_codepage to grab the codepage they should use for their conversions. An example is fopen: it convert narrow path to wide path using the grabbed codepage and then delegates to wide version of ucrt/lowio/open.cpp#_sopen_nolock. Besides, the encoding of the narrow string representation of std::filesystem::path is also the grabbed codepage.

The I/O flow path in the UCRT is

C++ I/O -> C I/O -> POSIX I/O  -> Win32 File/Console I/O
filebuf -> FILE* -> read/write -> ReadFile/WriteFile/ReadConsoleW/WriteConsoleW

[w]cin/f[w]scanf/fget[w]s -> fget[w]c
[w]cout/f[w]printf/fput[w]s -> fput[w]c

fgetwc -> fgetc (*2, compose for _O_U16TEXT and _O_BINARY, mbtowc(DBCS) for _O_TEXT) -> fread -> read
fputwc -> (wctomb -> fputc, for _O_TEXT) -> fwrite -> write

The details of read with different mode:

  • _O_BINARY or _O_TEXT: ReadFile
  • _O_U8TEXT: ReadFile -> UTF-8 -> UTF-16
  • File _O_U16TEXT: ReadFile
  • Console _O_U16TEXT: ReadConsoleW

The details of write with different mode:

  • _O_BINARY: WriteFile
  • File _O_U8TEXT: UTF-16 -> UTF-8 -> WriteFile
  • File otherwise: WriteFile
  • Console Unicode: WriteConsoleW for each wchar, so only supports UCS-2
  • Console _O_TEXT with LC_CTYPE:
    • C: WriteFile
    • utf8: UTF-8 -> UTF-16 -> ConsoleInputCP ConsoleOutputCP -> WriteFile
    • otherwise: DBCS (mbtowc) -> UTF-16 -> ConsoleInputCP ConsoleOutputCP -> WriteFile

Win32 Direct Console I/O and C Wide I/O are always available for Unicode Console I/O. Since UCRT 10.0.17763.0, print functions treat the text data as UTF-8 encoded if locale is set to utf8. The changes are in ucrt/lowio/write.cpp#write_double_translated_ansi_nolock. The translation to ConsoleInputCP is strange, I think it should be ConsoleOutputCP and (This bug is fixed in UCRT 10.0.19041.0) double translation is no need. UCRT should be reworked to use WriteConsoleW after translated to UTF-16 such that no codepage is involved: ANSI(including UTF-8) -> UTF-16 -> WriteConsoleW.

ReadConsoleA/ReadFile get ANSI characters from ConsoleInputCP, but SetConsoleCP(CP_UTF8) doesn't work since it only supports DBCS. There are two workarounds to support UTF-8 Console input: delegating to wide input or doing ConsoleInputCP -> UTF-16 -> UTF-8 conversion. The example test\utf8_io.cpp illustrates these two workarounds. UCRT should implement input as the reverse process of reworked output, i.e. ReadConsoleW -> UTF-16 -> ANSI(including UTF-8).

UCRT and MinGW

Since May 2021, UCRT64 for gcc toolchain and CLANG64 for clang toolchain are available as MSYS2 environments. They link against UCRT instead of MSVCRT.

Encoding of argv and envp

Windows is UTF-16 internal, so command-line arguments and the environment variables set are all UTF-16. Visual C++ compiler provides a Unicode version of C/C++ program entry point, named wmain. For the ANSI version of main, argv, an array of null-terminated strings representing command-line arguments entered by the user of the program, and envp, an array of key=value formatted null-terminated strings representing a "frozen" copy of the variables set in the user's environment during the program startup, are all encoded in ACP (converted from Unicode). So even a simple C program using printf to echo command-line arguments doesn't work, since ACP != OEMCP (usually), e.g. in English language Windows, ACP is 1252 while OEMCP is 437, and code point 00F7 is "÷" for 1252 but "≈" for 437.

To get UTF-8 encoded argv, simply link wmain into the final executable, take a look at the example test\echo.c.

Process code page

As of Windows 10 May 2019 Update (Version 1903, Build Number 10.0.18362), one can set active code page per process in the manifest. By using UTF-8 process code page, the command-line arguments and ANSI variant of Win32 APIs are all UTF-8 encoded as test\win32_gui.cpp demonstrated. This model has the benefit of supporting existing code built with -A APIs without any code changes, but must handle legacy code page detection and conversion as usual if targeting/running on earlier Windows builds.

Conclusion for ALL-UTF8 on Windows

  • Use Visual Studio 2015 or later with UCRT 10.0.17763.0 or later.
  • Add /utf-8 to compile options to make all narrow string literals UTF-8.
  • Link to wmain to get UTF-8 encoded argv.
  • setlocale(LC_CTYPE, ".utf8") to support UTF-8 output and filenames (e.g. printf, fopen and std::filesystem::path).
  • SetConsoleCP(CP_UTF8) due to the bug of double translation for console output before UCRT 10.0.19041.0. No need for C locale.
  • Skip above three items if using UTF-8 process code page.
  • SetConsoleOutputCP(CP_UTF8) to display characters correctly due to the encoding conversion from ConsoleOutputCP to Unicode in Windows Console.
  • Use wide console input. The typical structure of a Command-Line app is: input somewhere, output everywhere.
    HANDLE hConsoleInput = GetStdHandle(STD_INPUT_HANDLE);
    DWORD mode;
    if (GetConsoleMode(hConsoleInput, &mode))
    {
        _setmode(_fileno(stdin), _O_U16TEXT);
        wstring ws;
        string s;    // a UTF-8 string
        while (getline(std::wcin, ws))
        {
            s.resize(WideCharToMultiByte(CP_UTF8, 0, ws.data(), ws.size(), NULL, 0, NULL, NULL));
            WideCharToMultiByte(CP_UTF8, 0, ws.data(), ws.size(), s.data(), s.size(), NULL, NULL);
            // process(s);
        }
    }
    else
    {
        string s;    // a UTF-8 string
        while (getline(std::cin, s))
        {
            // process(s);
        }
    }