Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru |
Moscow State University Scientific Computer Research Center http://srcc.msu.su |
Apr 6 2002 | text version of this document |
This is an extension to [SURVEY].
This companion classifies a number of CJK
character
set standards according to the classification scheme
practiced in [Survey] section 4:
CCS-CES
, pure-CES
, pure-CCS
.
Argumentation is given for classifying each of them.
The following MIME
charsets (3.2)
ISO-2022-KR (reportedly unused) EUC-KR ISO-2022-JP ISO-2022-JP-1 ISO-2022-JP-2 Shift_JIS EUC-JP GB2312 (aka EUC-CN, CN-GB) EUC-TW
ISO-2022-CN-EXT
easily classify as CES
's (Survey 2.1),
moreover as pure-CES
's (Survey Appendix D).
JIS_X0201
[RFC 1345] may be viewed as an 8-bit CES
that encodes
JIS_C6220-1969-ro (ISO646-JP) JIS_C6220-1969-jp (katakana)
(also [RFC 1345]).
Microsoft understanding of Shift_JIS
.
Referred to as Shift_JIS
by Microsoft products.
Refer to [IANA REG] for a detailed description.
[IANA REG] registration separtely names the CCS
's this CES
is
based upon.
The following standards like ISO-8859-*
(Survey 4.1) define both a CCS
and a CES
each.
Microsoft extension to Big1
.
Erroneously refered to as Big5
by Microsoft products.
Microsoft extension to GB2312
.
Erroneously referred to as GB2312
by Microsoft products.
Backword compatible with GB2312
.
Extends GB2312
with unified Han characters from ISO 10646-1:1993
not already present in GB_2312-80
.
New characters are inserted to unused postions in EUC-CN
natural
0x8181
- 0xFEFE
range and to newly allocated 0x8140-0xFE7E
range.
Microsoft extension to EUC-KR
.
Also known as Unified Hangul Code, UHC
.
Erroneously referred to as KS_C_5601-1987
by Microsoft products.
Adds 8822 pre-combined Hangul syllables to EUC-KR
.
Uses extension technique same as GBK
.
China national standard.
Extends GBK
.
Introduces 4 byte codings for characters.
Provides space for all assigned and unassigned Unicode 3.2
(BMP
plus 16 extension planes) code points.
Single byte
JIS_C6220-1969-ro alias ISO646-JP GB_1988-80 alias ISO646-CN KSC5636 alias ISO646-KR
all serve as "raw material" to pure CES
's listed in (1.1), by
defining 94-character CCS
's.
Like any standard following the ISO
'coded character set'
definition (Survey 3.1)
they also define CES
's: ASCII-like
7-bit.
The difference from ASCII
is Yen, Yuan, Won symbols
replacing "$
" (0x24
) or "\" (0x5C
), and other minor changes,
as it is natural for the ISO646-*
family.
The following standards cited by [RFC 1345] (and registered by [IANA REG])
JIS_C6220-1969-jp alias JIS_C6220-1969 alias iso-ir-13
JIS_C6226-1983 alias JIS_X0208-1983 alias iso-ir-87 JIS_X0212-1990 alias iso-ir-159 GB_2312-80 alias iso-ir-58 KS_C_5601-1987 alias KS_C_5601-1989 alias iso-ir-149 alias KSC_5601
are in a peculiar position. Because ISO
"coded character set"
definition (all these are ISO
standards as the iso-ir-xyz alias
shows) does not allow to create a pure CCS
standard
(Survey 2.1), a CES
is inevitably gets
defined as well. This effect has been discussed in
(Survey Appendix D).
So each of these standard additionally defines
a 7-bit CES
. We shall call these CES
"raw".
Natural role of these standards is to help building CES
's listed
in (1.1).
Thir "raw" usage is so rare that the we may even suspect that it was not their creators' intension.
For example JIS_C6220-1969-jp
's "raw" CES
is a 7-bit CES
that
encodes
ISO 6429
) control characters at 0x00
- 0x20
SPACE
and DELETE
at 0x21
, 0x7F
0x21
- 0x7E
.
It is not clear if this CES
is used at all, but it is probably not
(see also section (9)).
JIS_C6226-1983
(aka JIS_X0208-1983
), GB_2312-80
and KS_C_5601-1987
have even less usable "raw" 7-bit CES
's. These CES
's are double byte
7-bit CES
that have neither control characters nor SPACE
and delete.
Naturally CES
's that do not encode CR
, LF
and SPACE
are only of
a limited use.
It is likely however that these "raw" encodings are used internally by the X Window system.
One more confusing moment about the mentioned standards is that according to Jungshik Shin,
GB 2312-80
really defines a "raw" encoding by enumerating
characters by their hexademical code points
JIS
and KS
standards on the other hand enumerate characters
in terms of decimal positions in character tables
(see Survey Appendix B for a related
discussion) thus being pure-CCS
standards.
It looks like however that [RFC 1345] in particular and ISO 2022
([ECMA 35]) approach in general have neglected this subtle difference
and now we have to classify all these basic standards as CCS-CES
,
and have to speak for "raw" encodings for all of them.
Other 94x94
standards should be in the same position
of implicitly defining mostly uselessly double byte 7-bit CES
's
without SPACE
, BACKSPACE
, CR
, LF
and other control characters.
These probably include
ISO-IR-165 GB/T 12345-90
CNS 11643-1992
defines 16 94x94
planes.
First two planes have almost the same set of characters as Big5
,
but at different code points. Is used only inside EUC-TW
. Does
not define an implicit "raw" CES
like those described in (1.3),
because only each of it's planes might be vulnarable to this
"implicit" creation not the standard as a whole.
Same with JIS X 0213
which also defines two character planes.
For completeness here's a short reference that maps [IANA REG] names back to the original standard name reference. (It is not clear, why this info is missing from [IANA REG]).
Multiple standard names are given if a standard has
several names or several standard versions are identical at
definition of the given CES
. [IANA REG] aliases are marked with '='
.
94 DUALs
[IANA REG] original name(s)
JIS_C6220-1969-ro = ISO646-JP, JIS C 6220-1969 JIS_C6220-1969-jp = katakana = JIS_C6220-1969
GB_1988-80 = ISO646-CN GB 1988-80
KSC5636 = ISO646-KR KS C 5636-1993 KS C 5636-1989
94x94 DUAL
s
JIS_C6226-1983 = JIS_X0208-1983 JIS C 6226-1983 JIS X 0208:1983 JIS_X0212-1990 JIS X 0212:1990 GB_2312-80 GB 2312-80
KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601 KS C 5601-1987 KS C 5601-1992 KS X 1001:1997
Extensions/revisions of the mentioned 94x94 DUAL
's (2.1) not
registered at [IANA REG]:
JIS X 0208:1990 JIS X 0208:1997 KS X 1001:1998 (euro and one other char added)
CJKV I
nformation ProcessingISBN
: 1-56592-224-7
CJK.INF V
ersion 2.1OREILLY
'S GB 18030 SUMMARY
FAQ
IANA
registers charaset values according to RFC 2278
)CHARACTER SET
" TERMINOLOGY SURVEY
script for generating this HTML from text |