This project is about testing UTF-8 decoders and validators. It provides tests cases you can use to test your application and it has useful information about UTF-8 and where to find more information.
There is a reference implementation of an UTF-8 decoder.
There is a table at the bottom with test results for some common applications.
You can find the source code for test cases in the utf8tests.txt file. This file gets compiled to utf8tests.bin and utf8tests.html which you use for your testing.
- utf8tests.txt -- test cases; the source code for the next two files
- utf8tests.bin -- a binary file containing valid and invalid UTF-8 byte sequences
- utf8tests.html -- a file for visually testing html readers.
We are testing an application's UTF-8 reading and writing by sending the utf8tests.bin text file through it then checking the resulting file.
read write read
utf8tests.bin —--> (app) —---> artifact file --—> (utf8tests app)
You can visually test the reading part with tests numbered 36.x.
read
utf8tests.html --—> (app)
The test file groups the tests into these categories:
- Valid Characters
- Too Big Characters
- Overlong Characters
- Surrogate Characters
- Valid Noncharacters
- Miscellaneous Byte Sequences
- Visual Tests
- Null Characters
You can run the utf8tests.bin file through your decoder then check its output against the expected output shown in the utf8tests.txt file. You can do this automatically by passing your artifact file to utf8tests command line app.
We show you how to do this using iconv as an example. The iconv command line application converts between different encoding. For our example you convert from UTF-8 to UTF-8.
Generate an Artifact:
You tell iconv to read the utf8tests.bin file and write it to an artifacts file called utf8.skip.icon.txt. The -c option tells iconv to skip invalid byte sequences.
iconv -c -f UTF-8 -t UTF-8 utf8tests.bin >artifacts/utf8.skip.icon.txt
Note: you can find many artifact files from different applications in the artifacts folder.
Evaluate the Artifact:
You then pass the artifact file you created to the utf8test application and it will run all the test cases. The results go to the screen. In the example we view the output in less:
bin/utf8tests -e=utf8tests.txt --skip=artifacts/utf8.skip.iconv.1.11.txt | less
In my version of iconv, it allows characters bigger than the maximum and it allows surrogates. Here is a sample of the output:
6.0: invalid test case: f7 bf bf bf
6.0: expected: nothing
6.0: got: f7 bf bf bf
6.0.1: invalid test case: f4 90 80 80
6.0.1: expected: nothing
6.0.1: got: f4 90 80 80
6.1: invalid test case: f8 88 80 80 80
6.1: expected: nothing
6.1: got: f8 88 80 80 80
6.2: invalid test case: f7 bf bf bf bf
6.2: expected: nothing
6.2: got: f7 bf bf bf
...
The utf8tests.txt file has comments telling what the test does. Below is the 6.0 test and its comment. The 6.0 test expects the invalid byte sequence F7 BF BF BF to produce nothing in the artifact file when you are skipping. The utf8tests.txt file describes the test line format in detail.
# too big U+001FFFFF, <F7 BF BF BF>
6.0:invalid hex:F7 BF BF BF:nothing:EFBFBD EFBFBD EFBFBD EFBFBD
My version of iconv is very old but it is the current version on an up-to-date mac.
For applications that read UTF-8 but don't write it, you can test them with the visual tests numbered 36.x.
The visual tests check:
- 36.1: The replacement character.
- 36.2: Invalid ff byte.
- 36.3: Invalid two byte sequence <e0 80>.
- 36.4: Invalid three byte sequence <f0 80 80>.
- 36.5: Invalid three byte sequence followed by an invalid byte <f8 80 80 80>.
- 36.6: Invalid two byte sequence <e0 80> twice.
- 36.7: too big U+001FFFFF,
- 36.8: surrogate U+D800, <ed a0 80>
- 36.10: overlong solidus <e0 80 af>
- 36.9: valid noncharacter U+FFFF,
- 36.9.1: valid noncharacter U+FFFE,
This is what is expected when the reader replaces invalid bytes sequences with the replacement character:
36.1:valid:replacement character=�=�.
36.2:invalid:�=�.
36.3:invalid:��=��.
36.4:invalid:���=���.
36.5:invalid:����=����.
36.6:invalid:����=����.
36.7:invalid:����=����.
36.8:invalid:���=���.
36.10:invalid:���=���.
36.9:valid:�=�.
36.9.1:valid:�=�.
I ported Bjoern Hoehrmann's decoder to Nim for a reference implementation. You can find it in the src folder in the utf8decoder.nim file. Here is Bjoern's web page:
I created a state diagram for UTF-8 with an error state inspired by the diagram created by Bjoern Hoehrmann. You can find the same information in table 3-7 in the Unicode specification.
I found a editor for creating state diagrams which allows you to save it as an SVG file. Once you have an SVG file you can tweek it in a text editor or in inkscape.
- Finite State Machine Editor - simple on-line finite state machine editor
- Unicode 14.0 -- See Table 3-7. Well-Formed UTF-8 Byte Sequences
Important UTF-8 facts for testing:
- Code points must be in the range U+0000 to U+10FFFF.
- A UTF-8 code point is encoded with 1 to 4 bytes.
- Some byte sequences are invalid.
- One byte UTF-8 characters are ASCII characters, 0 - 7f.
- The surrogate characters are not valid in UTF-8.
The surrogate characters are not valid in UTF-8. The high surrogate range from U+D800 to U+DBFF. The low surrogate range from U+DC00 to U+DFFF.
- U+D800 to U+DBFF -- high surrogates
- U+DC00 to U+DFFF -- low surrogates
Below we show the bit patterns for 1 to 4 byte UTF-8 code point sequences. The one byte characters all have the high bit 0. The unicode values from U+0080 to U+07FF are encoded with two bytes, etc.
0 - 7F: 0xxxxxxx
80 - 7FF: 110xxxxx 10xxxxxx
800 - FFFF: 1110xxxx 10xxxxxx 10xxxxxx
10000 - 10FFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
For example, to encode U+B3F1 by hand you use three bytes because it is in the range 800 - FFFF. The three byte pattern is:
1110xxxx 10xxxxxx 10xxxxxx
You convert U+B3F1 to binary:
B 3 F 1
1011 0011 1111 0001
Then you move the bits into the pattern replacing the x's. U+B3F1 is encoded as EB 8F B1.
1011 001111 110001
1110xxxx 10xxxxxx 10xxxxxx
--------------------------
11101011 10001111 10110001
EB 8F B1
Binary to hex table:
0000 0 1000 8
0001 1 1001 9
0010 2 1010 a
0011 3 1011 b
0100 4 1100 c
0101 5 1101 d
0110 6 1110 e
0111 7 1111 f
The "noncharacters" in the Unicode specification are easy to misunderstand. The following chart shows that noncharacters are really characters. They are characters with special properties.
-
B -- is the full set of bits patterns for 32 bits or four bytes. Visually it is every path through the finite state machine. C + not C = B.
-
C -- is defined by table 3.7 and visually by the finite state machine for paths that terminate at state 0. These are called code points or unicode characters. Each code point has a hex value. The range is U+0000 to U+10FFFF.
-
n -- "noncharacters" is part of C.
-
not C -- terminates at state 1. These are the invalid byte sequences.
-
Corrigendum9 -- Unicode: Corrigendum #9: Clarification About Noncharacters
When you are reading an UTF-8 encoded file you need to determine how to handle the invalid byte sequences. You could replace them with the replacement character U+FFFD, <EF BF BD> for this purpose.
You have flexiblity how you do the replacement and it will conform to the spec. However, it is recommended that you do it in a particular way. The test cases in this project test for the standard way which is called "Maximal Subpart" in the spec.
The Unicode specification says:
An increasing number of implementations are adopting the handling of ill-formed subsequences as specified in the W3C standard for encoding to achieve consistent U+FFFD replacements.
See:
- Unicode 14.0 -- Unicode 14.0 Specification -- Conformance page 126, section 3.9.
- w3.org Encoding -- w3.org encoding
Wikipedia has a good explanation of UTF-8:
- Wikipedia UTF-8 -- UTF-8 is a variable-width character encoding used for electronic communication...
You can look up unicode characters by code point:
- Unicode Charts -- Unicode 14.0 Character Code Charts
You can convert from code point to UTF-8 hex byte sequences and visa-versa using an online app:
- Code Point <-> UTF-8 byte sequence -- Online UTF-8 converter
Many test cases in utf8tests.txt were inspired by Markus Kuhn tests. See:
- Markus Kuhn UTF-8 Tests -- UTF-8 decoder capability and stress test
Daniel Lemire wrote a fast UTF-8 decoder and I enjoyed his paper:
I tested a few languages, editors and browsers on my Mac. Each language was tested in one or two ways, either skipping invalid byte sequences or replacing them with the replacement character. NA means it is not supported by the language.
Code | Skip | Replace |
---|---|---|
Chrome 97.0.4692.71 | NA | ✅ passv |
Emacs 25.3.1 | 🛑 fail | NA |
Firefox 95.0.2 | NA | ✅ passv |
Iconv 1.11 | 🛑 fail | NA |
Less 487 | NA | ✅ passv |
Nano 2.0.6 | NA | 🛑 fail |
Nim 1.4.8 | 🛑 fail | NA |
Node js 17.2.0 | NA | ✅ pass |
Python 3.7.5 | ✅ pass | ✅ pass |
Perl 5.30.2 | NA | 🛑 fail |
Reference | ✅ pass | ✅ pass |
Safari 14.1.2 | NA | ✅ passv |
Vim 8.2.2029 | NA | 🛑 fail |
- v - visually tested using 36.x tests
See the procedure page for the steps to reproduce the results shown in the table.
- Procedures — steps to reproduce the table results.
You can contribute to this project by writing a command line app in your favorite language and adding it to the writers directory.
The app reads a binary file, encodes the bytes as UTF-8, then writes it to a new file. It takes three command line parameters: the input file name the output file name and the word “skip” or “replace”.
Or you can run the utf8tests.bin file through your decoder and post the resulting file to the artifacts directory.
If you find a bug or something is wrong, please file an issue.
Requirements:
- git
- docker
Download the project using git clone.
mkdir ~/code/utf8tests
cd ~/code/utf8tests
git clone [email protected]:flenniken/utf8tests.git
Build a docker development environment. The first run builds the image and the second one creates the container and runs it.
./devenv run debian
./devenv run debian
Build the nim compiler.
cd ~/Nim
./build_all.sh
nim -v
Build the utf8tests project:
cd ~/utf8tests
n b
n test
You can delete the development container and image with the delete command:
./devenv delete debian