CJK CHARACTER SET STNADARDS CLASSIFICATION. VERSION 0.1 Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru Moscow State University Scientific Computer Research Center http://srcc.msu.su TABLE OF CONTENTS 1. CLASSIFYING CJK STANDARDS 1.1 pure-CES 1.1.1 MULTIBYTE 1.1.2 SINGLE BYTE 1.1.3 CP932, Windows-31J 1.2 CCS-CES 1.2.1 Big5 1.2.2 CP950 1.2.4 CP936, GBK 1.2.5 CP949 1.2.6 GB 18030-2000 1.2.7 ISO646-* SINGLE-BYTES 1.3 pure CCS? 1.4 pure-CCS: CNS 11643-1992, JIS X 0213 1.5 NAMING REFERENCE Appendix A. REFERENCES 1. CLASSIFYING CJK STANDARDS This is an extension to [SURVEY]. This companion classifies a number of CJK character set standards according to the classification scheme practiced in [Survey] L
: CCS-CES, pure-CES, pure-CCS. Argumentation is given for classifying each of them. 1.1 pure-CES 1.1.1 MULTIBYTE The following MIME charsets (3.2) ISO-2022-KR (reportedly unused) EUC-KR ISO-2022-JP ISO-2022-JP-1 ISO-2022-JP-2 Shift_JIS EUC-JP GB2312 (aka EUC-CN, CN-GB) EUC-TW ISO-2022-CN-EXT easily classify as CES's L<(Survey 2.1)|survey2.html#A2.1>, moreover as pure-CES's L<(Survey Appendix D)|survey2.html#BD>. 1.1.2 SINGLE BYTE JIS_X0201 [RFC 1345] may be viewed as an 8-bit CES that encodes JIS_C6220-1969-ro (ISO646-JP) JIS_C6220-1969-jp (katakana) (also [RFC 1345]). 1.1.3 CP932, Windows-31J Microsoft understanding of Shift_JIS. Referred to as Shift_JIS by Microsoft products. Refer to [IANA REG] for a detailed description. [IANA REG] registration separtely names the CCS's this CES is based upon. 1.2 CCS-CES The following standards like ISO-8859-* L<(Survey 4.1)|survey2.html#A4.1> define both a CCS and a CES each. 1.2.1 Big5 1.2.2 CP950 Microsoft extension to Big1. Erroneously refered to as Big5 by Microsoft products. 1.2.4 CP936, GBK Microsoft extension to GB2312. Erroneously referred to as GB2312 by Microsoft products. Backword compatible with GB2312. Extends GB2312 with unified Han characters from ISO 10646-1:1993 not already present in GB_2312-80. New characters are inserted to unused postions in EUC-CN natural 0x8181 - 0xFEFE range and to newly allocated 0x8140-0xFE7E range. 1.2.5 CP949 Microsoft extension to EUC-KR. Also known as Unified Hangul Code, UHC. Erroneously referred to as KS_C_5601-1987 by Microsoft products. Adds 8822 pre-combined Hangul syllables to EUC-KR. Uses extension technique same as GBK. 1.2.6 GB 18030-2000 China national standard. Extends GBK. Introduces 4 byte codings for characters. Provides space for all assigned and unassigned Unicode 3.2 (BMP plus 16 extension planes) code points. 1.2.7 ISO646-* SINGLE-BYTES Single byte JIS_C6220-1969-ro alias ISO646-JP GB_1988-80 alias ISO646-CN KSC5636 alias ISO646-KR all serve as "raw material" to pure CES's listed in (1.1), by defining 94-character CCS's. Like any standard following the ISO 'coded character set' definition L<(Survey 3.1)|survey2.html#A3.1> they also define CES's: ASCII-like 7-bit. The difference from ASCII is Yen, Yuan, Won symbols replacing "$" (0x24) or "\" (0x5C), and other minor changes, as it is natural for the ISO646-* family. 1.3 pure CCS? The following standards cited by [RFC 1345] (and registered by [IANA REG]) JIS_C6220-1969-jp alias JIS_C6220-1969 alias iso-ir-13 JIS_C6226-1983 alias JIS_X0208-1983 alias iso-ir-87 JIS_X0212-1990 alias iso-ir-159 GB_2312-80 alias iso-ir-58 KS_C_5601-1987 alias KS_C_5601-1989 alias iso-ir-149 alias KSC_5601 are in a peculiar position. Because ISO "coded character set" definition (all these are ISO standards as the iso-ir-xyz alias shows) does not allow to create a pure CCS standard L<(Survey 2.1)|survey2.html#A2.1>, a CES is inevitably gets defined as well. This effect has been discussed in L<(Survey Appendix D)|survey2.html#BD>. So each of these standard additionally defines a 7-bit CES. We shall call these CES "raw". Natural role of these standards is to help building CES's listed in (1.1). Thir "raw" usage is so rare that the we may even suspect that it was not their creators' intension. For example JIS_C6220-1969-jp's "raw" CES is a 7-bit CES that encodes - regular (ISO 6429) control characters at 0x00 - 0x20 - SPACE and DELETE at 0x21, 0x7F - Katakana at 0x21 - 0x7E. It is not clear if this CES is used at all, but it is probably not (see also section (9)). JIS_C6226-1983 (aka JIS_X0208-1983), GB_2312-80 and KS_C_5601-1987 have even less usable "raw" 7-bit CES's. These CES's are double byte 7-bit CES that have neither control characters nor SPACE and delete. Naturally CES's that do not encode CR, LF and SPACE are only of a limited use. It is likely however that these "raw" encodings are used internally by the X Window system. One more confusing moment about the mentioned standards is that according to Jungshik Shin, - GB 2312-80 really defines a "raw" encoding by enumerating characters by their hexademical code points - JIS and KS standards on the other hand enumerate characters in terms of decimal positions in character tables (see L for a related discussion) thus being pure-CCS standards. It looks like however that [RFC 1345] in particular and ISO 2022 ([ECMA 35]) approach in general have neglected this subtle difference and now we have to classify _all_ these basic standards as CCS-CES, and have to speak for _"raw"_ encodings for all of them. Other 94x94 standards should be in the same position of implicitly defining mostly uselessly double byte 7-bit CES's without SPACE, BACKSPACE, CR, LF and other control characters. These probably include ISO-IR-165 GB/T 12345-90 1.4 pure-CCS: CNS 11643-1992, JIS X 0213 CNS 11643-1992 defines 16 94x94 planes. First two planes have almost the same set of characters as Big5, but at different code points. Is used only inside EUC-TW. Does not define an implicit "raw" CES like those described in (1.3), because only each of it's planes might be vulnarable to this "implicit" creation not the standard as a whole. Same with JIS X 0213 which also defines two character planes. 1.5 NAMING REFERENCE For completeness here's a short reference that maps [IANA REG] names back to the original standard name reference. (It is not clear, why this info is missing from [IANA REG]). Multiple standard names are given if a standard has several names or several standard versions are identical at definition of the given CES. [IANA REG] aliases are marked with '='. 94 DUALs [IANA REG] original name(s) JIS_C6220-1969-ro = ISO646-JP, JIS C 6220-1969 JIS_C6220-1969-jp = katakana = JIS_C6220-1969 GB_1988-80 = ISO646-CN GB 1988-80 KSC5636 = ISO646-KR KS C 5636-1993 KS C 5636-1989 94x94 DUALs JIS_C6226-1983 = JIS_X0208-1983 JIS C 6226-1983 JIS X 0208:1983 JIS_X0212-1990 JIS X 0212:1990 GB_2312-80 GB 2312-80 KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601 KS C 5601-1987 KS C 5601-1992 KS X 1001:1997 Extensions/revisions of the mentioned 94x94 DUAL's (2.1) not registered at [IANA REG]: JIS X 0208:1990 JIS X 0208:1997 KS X 1001:1998 (euro and one other char added) Appendix A. REFERENCES [CJKV] CJKV Information Processing Ken Lunde. 1999 O'Reilly & Associates, ISBN : 1-56592-224-7 http://www.oreilly.com/catalog/cjkvinfo/ [CJK.INF] CJK.INF Version 2.1 Online Companion to\ "Understanding Japanese Information Processing", predessor of [CJKV].\ Ken Lunde. July 12, 1996 http://www.oreilly.com/people/authors/lunde/cjk_inf.html [GB 18030] OREILLY'S GB 18030 SUMMARY San Jose, February 2001 ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/\ GB18030_Summary.pdf [HANGUL FAQ] Jungshik Shin's Hangul FAQ http://jshin.net/faq http://jshin.net/faq/qa8.html [IANA REG] The Character Sets Registry (IANA registers charaset values according to RFC 2278) http://www.iana.org/assignments/character-sets [SURVEY] "CHARACTER SET" TERMINOLOGY SURVEY L>.\ April 2002 L