CJK CHARACTER SET STNADARDS CLASSIFICATION. VERSION 0.1

Anton Tagunov
http://tagunov.tripod.com
tagunov@motor.ru


Moscow State University
Scientific Computer Research Center
http://srcc.msu.su

TABLE OF CONTENTS
1.  CLASSIFYING CJK STANDARDS
    1.1 pure-CES
        1.1.1 MULTIBYTE
        1.1.2 SINGLE BYTE
        1.1.3 CP932, Windows-31J
    1.2 CCS-CES
        1.2.1 Big5
        1.2.2 CP950
        1.2.4 CP936, GBK
        1.2.5 CP949
        1.2.6 GB 18030-2000
        1.2.7 ISO646-* SINGLE-BYTES
    1.3 pure CCS?
    1.4 pure-CCS: CNS 11643-1992, JIS X 0213
    1.5 NAMING REFERENCE
Appendix A. REFERENCES

1. CLASSIFYING CJK STANDARDS

This is an extension to [SURVEY].

This companion classifies a number of CJK character 
set standards according to the classification scheme
practiced in [Survey] L<section 4|survey2.html#A4>:
CCS-CES, pure-CES, pure-CCS.

Argumentation is given for classifying each of them.


1.1 pure-CES

1.1.1 MULTIBYTE

The following MIME charsets (3.2)

  ISO-2022-KR (reportedly unused)
  EUC-KR
  ISO-2022-JP  ISO-2022-JP-1 ISO-2022-JP-2
  Shift_JIS
  EUC-JP
  GB2312      (aka EUC-CN, CN-GB)
  EUC-TW

  ISO-2022-CN-EXT

easily classify as CES's L<(Survey 2.1)|survey2.html#A2.1>,
moreover as pure-CES's L<(Survey Appendix D)|survey2.html#BD>.

1.1.2 SINGLE BYTE

JIS_X0201 [RFC 1345] may be viewed as an 8-bit CES
that encodes

  JIS_C6220-1969-ro (ISO646-JP)
  JIS_C6220-1969-jp (katakana)

(also [RFC 1345]).

1.1.3 CP932, Windows-31J

Microsoft understanding of Shift_JIS.

Referred to as Shift_JIS by Microsoft products.

Refer to [IANA REG] for a detailed description.

[IANA REG] registration separtely names the CCS's this CES is
based upon.

1.2 CCS-CES

The following standards like ISO-8859-* 
L<(Survey 4.1)|survey2.html#A4.1> define both a CCS and a CES
each.

1.2.1 Big5

1.2.2 CP950

Microsoft extension to Big1.
Erroneously refered to as Big5 by Microsoft products.

1.2.4 CP936, GBK

Microsoft extension to GB2312.

Erroneously referred to as GB2312 by Microsoft products.

Backword compatible with GB2312.
Extends GB2312 with unified Han characters from ISO 10646-1:1993
not already present in GB_2312-80.

New characters are inserted to unused postions in EUC-CN natural
0x8181 - 0xFEFE range and to newly allocated 0x8140-0xFE7E range.

1.2.5 CP949

Microsoft extension to EUC-KR.
Also known as Unified Hangul Code, UHC.
Erroneously referred to as KS_C_5601-1987 by Microsoft products.
Adds 8822 pre-combined Hangul syllables to EUC-KR.

Uses extension technique same as GBK.

1.2.6 GB 18030-2000

China national standard.
Extends GBK.
Introduces 4 byte codings for characters.
Provides space for all assigned and unassigned Unicode 3.2
(BMP plus 16 extension planes) code points.

1.2.7 ISO646-* SINGLE-BYTES

Single byte

  JIS_C6220-1969-ro    alias ISO646-JP
  GB_1988-80           alias ISO646-CN
  KSC5636              alias ISO646-KR

all serve as "raw material" to pure CES's listed in (1.1), by
defining 94-character CCS's.

Like any standard following the ISO 'coded character set'
definition L<(Survey 3.1)|survey2.html#A3.1>
they also define CES's: ASCII-like 7-bit.

The difference from ASCII is Yen, Yuan, Won symbols
replacing "$" (0x24) or "\" (0x5C), and other minor changes,
as it is natural for the ISO646-* family.

1.3 pure CCS?

The following standards cited by [RFC 1345] (and registered by
[IANA REG])

  JIS_C6220-1969-jp    alias JIS_C6220-1969   alias iso-ir-13

  JIS_C6226-1983       alias JIS_X0208-1983   alias iso-ir-87
  JIS_X0212-1990                              alias iso-ir-159
  GB_2312-80                                  alias iso-ir-58
  KS_C_5601-1987       alias KS_C_5601-1989   alias iso-ir-149
                       alias KSC_5601


are in a peculiar position. Because ISO "coded character set"
definition (all these are ISO standards as the iso-ir-xyz alias
shows) does not allow to create a pure CCS standard 
L<(Survey 2.1)|survey2.html#A2.1>, a CES is inevitably gets
defined as well. This effect has been discussed in
L<(Survey Appendix D)|survey2.html#BD>.
So each of these standard additionally defines
a 7-bit CES. We shall call these CES "raw".

Natural role of these standards is to help building CES's listed
in (1.1).

Thir "raw" usage is so rare that the we may even suspect that it
was not their creators' intension.

For example JIS_C6220-1969-jp's "raw" CES is a 7-bit CES that
encodes
- regular (ISO 6429) control characters at 0x00 - 0x20
- SPACE and DELETE                      at 0x21,  0x7F
- Katakana                              at 0x21 - 0x7E.
It is not clear if this CES is used at all, but it is probably not
(see also section (9)).

JIS_C6226-1983 (aka JIS_X0208-1983), GB_2312-80 and KS_C_5601-1987
have even less usable "raw" 7-bit CES's. These CES's are double byte
7-bit CES that have neither control characters nor SPACE and delete.
Naturally CES's that do not encode CR, LF and SPACE are only of
a limited use.

It is likely however that these "raw" encodings are used internally
by the X Window system.

One more confusing moment about the mentioned standards is that
according to Jungshik Shin,
- GB 2312-80 really defines a "raw" encoding by enumerating
  characters by their hexademical code points
- JIS and KS standards on the other hand enumerate characters
  in terms of decimal positions in character tables 
  (see L<Survey Appendix B|survey2.html#BB> for a related
  discussion) thus being pure-CCS standards.
It looks like however that [RFC 1345] in particular and ISO 2022
([ECMA 35]) approach in general have neglected this subtle difference
and now we have to classify _all_ these basic standards as CCS-CES,
and have to speak for _"raw"_ encodings for all of them.

Other 94x94 standards should be in the same position
of implicitly defining mostly uselessly double byte 7-bit CES's
without SPACE, BACKSPACE, CR, LF and other control characters.
These probably include

  ISO-IR-165
  GB/T 12345-90

1.4 pure-CCS: CNS 11643-1992, JIS X 0213

CNS 11643-1992 defines 16 94x94 planes.
First two planes have almost the same set of characters as Big5,
but at different code points. Is used only inside EUC-TW. Does
not define an implicit "raw" CES like those described in (1.3),
because only each of it's planes might be vulnarable to this
"implicit" creation not the standard as a whole.

Same with JIS X 0213 which also defines two character planes.

1.5 NAMING REFERENCE

For completeness here's a short reference that maps [IANA REG]
names back to the original standard name reference. (It is not
clear, why this info is missing from [IANA REG]). 

Multiple standard names are given if a standard has
several names or several standard versions are identical at
definition of the given CES. [IANA REG] aliases are marked with '='.

94 DUALs

  [IANA REG]                                   original name(s)

  JIS_C6220-1969-ro = ISO646-JP,               JIS C 6220-1969
  JIS_C6220-1969-jp = katakana =
  JIS_C6220-1969

  GB_1988-80        = ISO646-CN                GB 1988-80

  KSC5636           = ISO646-KR                KS C 5636-1993
                                               KS C 5636-1989
94x94 DUALs

  JIS_C6226-1983 = JIS_X0208-1983              JIS C 6226-1983
                                               JIS X 0208:1983
  JIS_X0212-1990                               JIS X 0212:1990
  GB_2312-80                                   GB 2312-80

  KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601   KS C 5601-1987
                                               KS C 5601-1992
                                               KS X 1001:1997

Extensions/revisions of the mentioned 94x94 DUAL's (2.1) not
registered at [IANA REG]:

  JIS X 0208:1990
  JIS X 0208:1997
  KS X 1001:1998  (euro and one other char added)

Appendix A. REFERENCES

[CJKV]
  CJKV Information Processing
  Ken Lunde. 1999 
  O'Reilly & Associates, ISBN : 1-56592-224-7
  http://www.oreilly.com/catalog/cjkvinfo/

[CJK.INF]  
  CJK.INF Version 2.1
  Online Companion to\
  "Understanding Japanese Information Processing",
  predessor of [CJKV].\
  Ken Lunde. July 12, 1996
  http://www.oreilly.com/people/authors/lunde/cjk_inf.html

[GB 18030]
  OREILLY'S GB 18030 SUMMARY
  San Jose, February 2001
  ftp://ftp.oreilly.com/pub/examples/nutshell/cjkv/pdf/\
GB18030_Summary.pdf

[HANGUL FAQ]
  Jungshik Shin's Hangul FAQ
  http://jshin.net/faq
  http://jshin.net/faq/qa8.html

[IANA REG] The Character Sets Registry
           (IANA registers charaset values according to RFC 2278)
           http://www.iana.org/assignments/character-sets

[SURVEY]
  "CHARACTER SET" TERMINOLOGY SURVEY
  L<Anton Tagunov|mailto: Anton Tagunov <tagunov@motor.ru>>.\
  April 2002
  L<survey2.html>