USING CJK CODED CHARACTER SETS IN RAW ENCODING: A COMMON SOURCE OF CONFUSION Common Confusion With CJK Character Set Naming Anton Tagunov Moscow State University Scientific Computer Research Center http://tagunov.tripod.com tagunov@motor.ru Abstract This is a technical note on using "raw" encodings for ISO 2022 compliant _coded character sets_ (especially CJK). It points out that these _coded character sets_ may be used on their own out the ISO 2022 or EUC frameworks. Perl is used as an example. ----------------------------------- The author would highly value every kind of feedback to this article. The interest is on errors and inaccuracies, even the smallest (a pedantic reply would be more then welcome :) A special area of interest is the usage of _character set_, _coded character set_ and encoding terms. The author feels that probably his usage of this terms may be improved and any remarks in this are are welcome. The author of this small note, being a post graduate student would value very much an on-line publication in any kind of on-line (or not on-line :-) periodical of this note and/or the forthcoming article "PROCESSING CHARACTER INFORMATION CODED IN MULTIPLE CODED CHARACTER SETS. (CONSISTENT SIMPLE STRING COLLATION, MATCHING TO CHARACTER CLASSES, MERGING DATA. PERL AS A PROTOTYPE)" (see the P.S. of this message for an abstract) Help on this is being requested and every kind of advice on possible publication is eagerly accepted. Any kind of rework of this and next article possible. ----------------------------------- USING CJK CODED CHARACTER SETS IN RAW ENCODING: A COMMON SOURCE OF CONFUSION Common Confusion With CJK Character Set Naming There is a certain confusion around JIS X 201/208/212, GB 1988/2312, GB/T 18030 (and similar) standards. It is well known that all of them provide 94 or 94x94 _coded character sets_ usable within ISO 2022 and EUC frameworks. ISO-2022-JP, ISO-2022-CN, ISO-2022-KR, EUC-JP, EUC-CN, EUC-TW, Shift_JIS (and other) coded character sets make use of them. But this is not the only use of these standards. In practice they also build up "raw" encodings. RFC 1345 describes a number of (mostly ISO registered) coded character sets including JIS_C6220-1969-jp = iso-ir-13 = katakana = x0201-7 (7-bit) JIS_C6220-1969-ro = iso-ir-14 = jp = ISO646-JP (7-bit) GB_1988-80 = iso-ir-57 = cn = ISO646-CN (7-bit) JIS_C6226-1983 = iso-ir-87 = JIS_X0208-1983 = x0208 (2 x 7-bit) JIS_X0212-1990 = iso-ir-159 = x0212 (2 x 7-bit) GB_2312-80 = iso-ir-58 = chinese (2 x 7-bit) JIS_X0201 = X0201 (8-bit) All these are JIS and GB standards used in "raw": they correspond to 94 or 94x94 _character sets_ plugged into the GL (0x20-0x7F) region. JIS_X0201 is special: it plugs two 94 character sets, one into the GL and and one into the GR (0xA0-0xFF). GB/T 18030 has no "raw" coded character set for it in this RFC, but, for instance, Perl 5.8.0 provides translation to/from some coded character set named "GB 18030". The later seems to be 2 bytes per char 7-bit plugging GB/T 18030 into GL. My assumption is that whenever we meet a name of a 94 or 94x94 (pluggable into ISO 2022/EUC) _character set_ used as a name of encoding we should understand that _character set_ is in its "raw" encoding plugging itself into GL (0x20-0x7F). I do not know how widely such "raw" encodings for _character sets_ are used, but at least there exist fonts for X Window that seem to be designed to display data in such encodings: -cc-song-medium-r-normal-jiantizi-40-400-75-75-c-400-gb2312.1980-0 -cc-song-medium-r-normal-jiantizi-48-480-75-75-c-480-gb2312.1980-0 -isas-fangsong ti-medium-r-normal--16-160-72-72-c-160-gb2312.1980-0 -isas-song ti-medium-r-normal--16-160-72-72-c-160-gb2312.1980-0 -isas-song ti-medium-r-normal--24-240-72-72-c-240-gb2312.1980-0 And I see no reason to call these "raw" encodings for _character sets_ a wrong way to store data: after all this is a efficient. = Terminology The term _(coded) character set_ in this note followes [ECMA 35], a free access analog of ISO 2022: "A set of unabiguous rules that establishes a character set and one-to-one relationship between characters of the set and their bit representation." This is exactly a rewording from the [RFC 1345] definition of the same term. It should be noted however that while the all ISO 2022 complient coded character sets are defined as having 0x20-0x7F code points only, they may be shifted, at need to 0xA0-0xFF rage. This is what possibly happens when these coded character sets are uses with ISO 2022 and EUC framworks. In the discussed raw encodings, however, all these charsets occumpy there natural 0x20-0x7F (or almost always only 0x21-0x7E, being a 94 character charcter sets). The term 'encoding' is used for the same thing as the _character set_, but it is understood that encoding is somehow prepared from 'raw' coded character sets, possibly by one of the following recipes: If we have an ISO 2022 conforming 94 character character set we may cook from it - a raw encoding, by plugging the charaset in the GL (0x21-0x7E) region - an ISO 2022 family encoding by possibly mixing this _charcter set_ with other ISO 2022 conforming _character sets_, possibly shifting them to the 0xA1-0xFE region (plugging into GR) and adding ESC codes for designation and invocation as it is done in ISO 2022 - a EUC family encoding by possibly mixing with another _character set_ and pluging one of them into GL and the other into GR The main point of this note is that raw encodings exists and may be used, for example, for input/output from/to Perl 5.8 programs. I use the term "plug" a _character set_ into GL (GR) to denote that octets in the 0x20-0x7F (0xA0-0xFF) are mapped to the characters in the corresponding _character set_. This is similar to what would happen if in the ISO 2022 framework we designated the _character set_ into G0 (G1) and then invoked G0 (G1) into GL (GR). = Perl naming troubles Although the forthcoming Perl 5.8 has an Ecode module shipped that supports conversion between internal Unicode representation and all the "raw" encodings mentioned in this note, the naming expected for these encodings currently present in the development releases of Perl are vulnerable to criticism. Please consult the following table: IANA: JIS_X0201 X0201 RFC 1345: JIS_X0201 X0201 Perl: JIS 0201 IANA: JIS_C6226-1983 JIS_X0208-1983 x0208 iso-ir-87 RFC 1345: JIS_C6226-1983 JIS_X0208-1983 x0208 iso-ir-87 Perl: JIS 0208 IANA: JIS_X0212-1990 x0212 iso-ir-159 RFC 1345: JIS_X0212-1990 x0212 iso-ir-159 Perl: JIS 0212 IANA: GB_1988-80 cn iso-ir-57 (7-bit) RFC 1345: GB_1988-80 cn iso-ir-57 ISO646-CN (7-bit) Perl: GB 1988 (8-bit also includes JIS_C6226-1983-jp == katakana at 0xA1-0xFE is this okay???) IANA: GB_2312-80 chinese iso-ir-58 RFC 1345: GB_2312-80 chinese iso-ir-58 Perl: GB 2312 = Questions still vague to the author RFC 1345 while specifying _coded character sets_ like JIS_X0208-1983 and GB_2312-80 makes a vague comment: If the coded character set is a 96-character set, it is tabled with the relevant GL set (normally ISO-IR-6) and with ISO 6429 as C0 and C1 (12). If it is a 94-character set, it is tabled with the C0 set of ISO 6429. If it is a double-octet coded character set, it is tabled without control character sets and accompanying one-octet coded character sets, and the two-octet code is tabled as a G0 set. What does this say about possibility and rules of encoding control characters (like CR and LF for instance) with two byte 7-bit 94x94 character _coded characater sets_ when they are used in their raw encoding? It it at all possible to encode the control characters? As single or double byte? Readers' help is more then gladly welcome to update this subsection. Thanks to Autrijus Tang Brian McGuirk Alex Potter for helping to improve this document References [ECMA 35] http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM [RFC1345] Character Mnemonics and Character Sets. K. Simonsen. June 1992. http://www.ietf.org/rfc/rfc1345.txt P.S. Another article PROCESSING CHARACTER INFORMATION CODED IN MULTIPLE CODED CHARACTER SETS. (CONSISTENT SIMPLE STRING COLLATION, MATCHING TO CHARACTER CLASSES, MERGIN DATA. PERL AS A PROTOTYPE) Abstract Software components sometimes have to use cooperatively and combine textual data coded with different character sets. Comparison for equality, simple lexicographical ordering, automatic and explicit conversions, matching characters agaist character classes ([[:space::]], [[:alpha:]]..) for textual data coded in different coded character sets are questions discussed in this article. is being prepared by the same author. He also hopes that possible criticism on this article will help to polish the wordings in the next one.