Rewrite the low-level UTF-8 parser from scratch.
authorschwarze <schwarze@openbsd.org>
Fri, 19 Dec 2014 04:57:11 +0000 (04:57 +0000)
committerschwarze <schwarze@openbsd.org>
Fri, 19 Dec 2014 04:57:11 +0000 (04:57 +0000)
commit52a7f4662432db837ecf4c838b4be59349e5f106
tree24e8b6acc4769d14ce05a77e8bfead8f04f14efd
parent762cb5c93628166ca5d600c5ac208692566aceff
Rewrite the low-level UTF-8 parser from scratch.
It accepted invalid byte sequences like 0xc080-c1bf, 0xe08080-e09fbf,
0xeda080-edbfbf, and 0xf0808080-f08fbfbf, produced valid roff Unicode
escape sequences from them, and the algorithm contained strong
defenses against any attempt to fix it.

This cures an assertion failure in the terminal formatter caused
by sneaking in ASCII 0x08 (backspace) by "encoding" it as an (invalid)
multibyte UTF-8 sequence, found by jsg@ with afl.

As a bonus, the new algorithm also reduces the code in the function
by about 20%.
regress/usr.bin/mandoc/char/unicode/Makefile
regress/usr.bin/mandoc/char/unicode/input.in [new file with mode: 0644]
regress/usr.bin/mandoc/char/unicode/input.out_ascii [new file with mode: 0644]
regress/usr.bin/mandoc/char/unicode/input.out_lint [new file with mode: 0644]
regress/usr.bin/mandoc/char/unicode/input.out_utf8 [new file with mode: 0644]
usr.bin/mandoc/preconv.c