artulab.com Git - openbsd/commit

author	schwarze <schwarze@openbsd.org>
	Tue, 14 May 2024 18:38:13 +0000 (18:38 +0000)
committer	schwarze <schwarze@openbsd.org>
	Tue, 14 May 2024 18:38:13 +0000 (18:38 +0000)
commit	702123639534b7d2ffe3eb692a2717af7205ae5c
tree	450b42066a3c2fbdfe1923ab1b2ad66555b813de	tree \| snapshot
parent	806da1f162b9b0bfd5ac586d8ef838bbf9460798	commit \| diff

The makewhatis(8) program already provided a "-T utf8" option
to put UTF-8 strings into the database, but that only worked
for input files containing the manually written, mnemonic roff(7)
character escape sequences documented in mandoc_char(7).
Even though mandoc(1), man(1), and man.cgi(8) have been able to
properly handle UTF-8 and ISO-Latin-1 encoded input files for many
years, makewhatis(8) unconditionally replaced all non-ASCII bytes
in all input files with ASCII question marks ("?").

Improve this by changing two aspects of non-ASCII character handling
in makewhatis(8) at the same time.

1. In the makewhatis(8) main program, when configuring the roff(7) parser,
enable UTF-8 and ISO-Latin-1 autorecognition and translation
to \[uXXXX] roff(7) Unicode character escape sequences.
The man(1) and man.cgi(8) programs prove that this option has
been working very reliably for many years, so there is no risk.

2. In the makewhatis(8) string rendering code, if "-T utf8" was
requested, translate these escape sequences to UTF-8 strings,
just like makewhatis(8) already did it for ESCAPE_SPECIAL sequences.
Otherwise, i.e. if an ASCII-only database is desired, replace
all character escape sequences by ASCII transliterations, again
like it was already done for ESCAPE_SPECIAL sequences.

With this change, giving UTF-8 command line arguments to apropos(1)
allows searching in UTF-8 and ISO-Latin-1 encoded manual pages if the
respective mandoc.db(5) has been built with makewhatis(8) -T utf8.

Issue found while investigating a question from
Valid-Amirali-Averiva at rambler dot ru, who is using mandoc
on FreeBSD to process documents containing cyrillic letters.