CJK CHARACTER SET STNADARDS CLASSIFICATION. VERSION 0.1

Anton Tagunov
http://tagunov.tripod.com
tagunov@motor.ru
Moscow State University
Scientific Computer Research Center
http://srcc.msu.su
Apr 6 2002 text version of this document

TABLE OF CONTENTS

1. CLASSIFYING CJK STANDARDS

1.1 pure-CES

1.1.1 MULTIBYTE

1.1.2 SINGLE BYTE

1.1.3 CP932, Windows-31J

1.2 CCS-CES

1.2.1 Big5

1.2.2 CP950

1.2.4 CP936, GBK

1.2.5 CP949

1.2.6 GB 18030-2000

1.2.7 ISO646-* SINGLE-BYTES

1.3 pure CCS?

1.4 pure-CCS: CNS 11643-1992, JIS X 0213

1.5 NAMING REFERENCE

Appendix A. REFERENCES

1. CLASSIFYING CJK STANDARDS

This is an extension to [SURVEY].

This companion classifies a number of CJK character set standards according to the classification scheme practiced in [Survey] section 4: CCS-CES, pure-CES, pure-CCS.

Argumentation is given for classifying each of them.

1.1 pure-CES

1.1.1 MULTIBYTE

The following MIME charsets (3.2)

ISO-2022-KR (reportedly unused)
EUC-KR
ISO-2022-JP  ISO-2022-JP-1 ISO-2022-JP-2
Shift_JIS
EUC-JP
GB2312      (aka EUC-CN, CN-GB)
EUC-TW

ISO-2022-CN-EXT

easily classify as CES's (Survey 2.1), moreover as pure-CES's (Survey Appendix D).

1.1.2 SINGLE BYTE

JIS_X0201 [RFC 1345] may be viewed as an 8-bit CES that encodes

JIS_C6220-1969-ro (ISO646-JP)
JIS_C6220-1969-jp (katakana)

(also [RFC 1345]).

1.1.3 CP932, Windows-31J

Microsoft understanding of Shift_JIS.

Referred to as Shift_JIS by Microsoft products.

Refer to [IANA REG] for a detailed description.

[IANA REG] registration separtely names the CCS's this CES is based upon.

1.2 CCS-CES

The following standards like ISO-8859-* (Survey 4.1) define both a CCS and a CES each.

1.2.1 Big5

1.2.2 CP950

Microsoft extension to Big1. Erroneously refered to as Big5 by Microsoft products.

1.2.4 CP936, GBK

Microsoft extension to GB2312.

Erroneously referred to as GB2312 by Microsoft products.

Backword compatible with GB2312. Extends GB2312 with unified Han characters from ISO 10646-1:1993 not already present in GB_2312-80.

New characters are inserted to unused postions in EUC-CN natural 0x8181 - 0xFEFE range and to newly allocated 0x8140-0xFE7E range.

1.2.5 CP949

Microsoft extension to EUC-KR. Also known as Unified Hangul Code, UHC. Erroneously referred to as KS_C_5601-1987 by Microsoft products. Adds 8822 pre-combined Hangul syllables to EUC-KR.

Uses extension technique same as GBK.

1.2.6 GB 18030-2000

China national standard. Extends GBK. Introduces 4 byte codings for characters. Provides space for all assigned and unassigned Unicode 3.2 (BMP plus 16 extension planes) code points.

1.2.7 ISO646-* SINGLE-BYTES

Single byte

JIS_C6220-1969-ro    alias ISO646-JP
GB_1988-80           alias ISO646-CN
KSC5636              alias ISO646-KR

all serve as "raw material" to pure CES's listed in (1.1), by defining 94-character CCS's.

Like any standard following the ISO 'coded character set' definition (Survey 3.1) they also define CES's: ASCII-like 7-bit.

The difference from ASCII is Yen, Yuan, Won symbols replacing "$" (0x24) or "\" (0x5C), and other minor changes, as it is natural for the ISO646-* family.

1.3 pure CCS?

The following standards cited by [RFC 1345] (and registered by [IANA REG])

JIS_C6220-1969-jp    alias JIS_C6220-1969   alias iso-ir-13

JIS_C6226-1983       alias JIS_X0208-1983   alias iso-ir-87
JIS_X0212-1990                              alias iso-ir-159
GB_2312-80                                  alias iso-ir-58
KS_C_5601-1987       alias KS_C_5601-1989   alias iso-ir-149
                     alias KSC_5601

are in a peculiar position. Because ISO "coded character set" definition (all these are ISO standards as the iso-ir-xyz alias shows) does not allow to create a pure CCS standard (Survey 2.1), a CES is inevitably gets defined as well. This effect has been discussed in (Survey Appendix D). So each of these standard additionally defines a 7-bit CES. We shall call these CES "raw".

Natural role of these standards is to help building CES's listed in (1.1).

Thir "raw" usage is so rare that the we may even suspect that it was not their creators' intension.

For example JIS_C6220-1969-jp's "raw" CES is a 7-bit CES that encodes

It is not clear if this CES is used at all, but it is probably not (see also section (9)).

JIS_C6226-1983 (aka JIS_X0208-1983), GB_2312-80 and KS_C_5601-1987 have even less usable "raw" 7-bit CES's. These CES's are double byte 7-bit CES that have neither control characters nor SPACE and delete. Naturally CES's that do not encode CR, LF and SPACE are only of a limited use.

It is likely however that these "raw" encodings are used internally by the X Window system.

One more confusing moment about the mentioned standards is that according to Jungshik Shin,

It looks like however that [RFC 1345] in particular and ISO 2022 ([ECMA 35]) approach in general have neglected this subtle difference and now we have to classify all these basic standards as CCS-CES, and have to speak for "raw" encodings for all of them.

Other 94x94 standards should be in the same position of implicitly defining mostly uselessly double byte 7-bit CES's without SPACE, BACKSPACE, CR, LF and other control characters. These probably include

ISO-IR-165
GB/T 12345-90

1.4 pure-CCS: CNS 11643-1992, JIS X 0213

CNS 11643-1992 defines 16 94x94 planes. First two planes have almost the same set of characters as Big5, but at different code points. Is used only inside EUC-TW. Does not define an implicit "raw" CES like those described in (1.3), because only each of it's planes might be vulnarable to this "implicit" creation not the standard as a whole.

Same with JIS X 0213 which also defines two character planes.

1.5 NAMING REFERENCE

For completeness here's a short reference that maps [IANA REG] names back to the original standard name reference. (It is not clear, why this info is missing from [IANA REG]).

Multiple standard names are given if a standard has several names or several standard versions are identical at definition of the given CES. [IANA REG] aliases are marked with '='.

94 DUALs

[IANA REG]                                   original name(s)

JIS_C6220-1969-ro = ISO646-JP,               JIS C 6220-1969
JIS_C6220-1969-jp = katakana =
JIS_C6220-1969

GB_1988-80        = ISO646-CN                GB 1988-80

KSC5636           = ISO646-KR                KS C 5636-1993
                                             KS C 5636-1989

94x94 DUALs

JIS_C6226-1983 = JIS_X0208-1983              JIS C 6226-1983
                                             JIS X 0208:1983
JIS_X0212-1990                               JIS X 0212:1990
GB_2312-80                                   GB 2312-80

KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601   KS C 5601-1987
                                             KS C 5601-1992
                                             KS X 1001:1997

Extensions/revisions of the mentioned 94x94 DUAL's (2.1) not registered at [IANA REG]:

JIS X 0208:1990
JIS X 0208:1997
KS X 1001:1998  (euro and one other char added)

Appendix A. REFERENCES

[CJKV]
CJKV Information Processing
Ken Lunde. 1999
O'Reilly & Associates, ISBN : 1-56592-224-7
http://www.oreilly.com/catalog/cjkvinfo/
[CJK.INF]
CJK.INF Version 2.1
Online Companion to "Understanding Japanese Information Processing",
predessor of [CJKV]. Ken Lunde. July 12, 1996
http://www.oreilly.com/people/authors/lunde/cjk_inf.html
[GB 18030]
OREILLY'S GB 18030 SUMMARY
San Jose, February 2001
ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/GB18030_Summary.pdf
[HANGUL FAQ]
Jungshik Shin's Hangul FAQ
http://jshin.net/faq
http://jshin.net/faq/qa8.html
[IANA REG]
The Character Sets Registry
(IANA registers charaset values according to RFC 2278)
http://www.iana.org/assignments/character-sets
[SURVEY]
"CHARACTER SET" TERMINOLOGY SURVEY
Anton Tagunov. April 2002
survey2.html