Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru |
Moscow State University Scientific Computer Research Center http://srcc.msu.su |
Apr 6 2002 | text version of this document |
"The Report of the IAB C
haracter Set Workshop ...
1996" [RFC 2130] has introduced the
terms.
Planted in these definitions (2) the current survey
ISO 2022
framework, (Appendix D) and (Companion 1.3)The basis, system of coordinates adopted by this survey
consists of two [RFC 2130] defintions: CCS
and CES
.
[RFC 2130]: Coded Character Set (CCS
) is a mapping from a
set of abstract characters to a set of integers.
See also Appendix B.
[RFC 2130]: Character Encoding Scheme (CES
) is a mapping
from a Coded Character Set or several coded character
sets to a set of octets.
[RFC 2130]: A definition of a character encoding scheme consists of:
<CCS, code value>
or to the error state "illegal
octet sequence"CCS
's registered by
IANA
or in text, of each CCS
upon which this CES
is based.'Encoding'
is used synonymously to 'CES'
.
See also Appendix C.
The 'CCS'
acronym is used differently from it's
expanded form:
'CCS'
seems to be used solely to mean (2.1)
'Coded Character Set'
may also used in the ISO
meaning (3.1)[ISO 8859-14], [ECMA 35] define:
'coded character set; code'
: A set of unambiguous rules that
establishes a character set and the one-to-one relationship
between the characters of the set and their bit combinations.
Equivalent to
CCS (2.1) + CES (2.2).
See also Appendix D.
This section discusses MIME
usage of the 'charset'
term.
This document uses the term "charset" to mean a set of rules for
mapping from a sequence of octets to a sequence of characters, such
as the combination of a coded character set and a character encoding
scheme; this is also what is used as an identifier in MIME
"charset="
parameters, and registered in the IANA
charset registry ... (Note
that this is NOT
a term used by other standards bodies, such as ISO
).
Equivalent to
CES (2.2), encoding
'charset'
has been divorced from 'character set'
to allow (2.1).
This [ISO 8859-14] excerpt shows that ASN.1
defines
analogs of CCS
and CES
:
7.2 Identification according to ISO
/IEC 8824-1
(ASN.1
)
In the terminology of ISO
/IEC 8824-1
the character
set of this part of ISO
/IEC 8859
and the corresponding
coded representation are distinct, and are known as the
"character abstract syntax" and the "character transfer
syntax" respectively
'Character Set'
: A collection of elements
used to represent textual information.
'Character Repertoire'
: The collection of
characters included in a character set.
'Coded Character Set'
:
A character set in which each character is assigned a numeric
code point.
It is not clear if 'Character Set'
is equivalent here to
'Character Repertoire'
or to 'Coded Character Set'
.
Or, if it is distinct from both, what the difference is.
Appendix E summarizes ambiguities we have discussed so far.
This section classifies a number of character set standards
(Appendix E) into pure-CCS
, pure-CES
and CCS-CES
categories.
Standards are refered to by [IANA REG] preferred MIME
name
if applicable.
[Companion] goes into deeper details on CJK
standards and
explains reasons behind the proposed classification.
ISO-8859-*
standards are dual: they specify both a CCS
and a CES
for it:
ISO
/IEC 8859
consists of several parts. Each
part specifies a set of up to 191 graphic characters and the
coded representation of these characters by means of a single
8-bit byte [ISO 8859-14]
This duality is both a result of the ISO C
oded Character Set
definition (3.1) and a natural property of
such straightforward schemes as US-ASCII
and ISO-8859-1
which consist of a small set of characters and a simple one-to-one
mapping from single octets to single characters.
- from [RFC 2278]. These include
US-ASCII ISO-8859-* KOI8-*
JIS_C6220-1969-ro [RFC 1345] GB_1988-80 [RFC 1345] KSC5636 [RFC 1345]
and others. See also (Companion 1.2).
The following [IANA REG] registered MIME
charsets (3.2)
are classified by this survey as pure-CES
's:
UTF-8 UTF-16 UTF-16BE UTF-16LE
ISO-10646-UCS-2 ISO-10646-UCS-4
ISO-2022-KR
EUC-KR ISO-2022-JP ISO-2022-JP-1 ISO-2022-JP-2 Shift_JIS Windows-31J EUC-JP GB2312 (aka EUC-CN, CN-GB) EUC-TW
ISO-2022-CN-EXT
See also Appendix D and (Companion 1.1).
The only non-CJK
pure-CCS
know to the author is
Unicode 3.2
[COMPANION], (Companion 1.3) consideres
the following standards CCS
by nature, but CCS-CES
due to the "raw" effect described in Appendix D:
JIS_C6220-1969-jp alias JIS_C6220-1969
JIS_C6226-1983 alias JIS_X0208-1983 JIS_X0212-1990 GB_2312-80 KS_C_5601-1987
See [RFC 1345]. Refer to [COMPANION] for details.
The author of this survey would very much like to receive as much feedback on this article as possible.
Please send me all kinds of comments on this survey, your opinion on its topicality, all factual mistakes, any statements that you find controvercial!
Sincerely yours, "Anton Tagunov" <tagunov@motor.ru> :-)
Thanks to
Autrijus Tang Jungshik Shin
and other posters of perl-unicode@perl.org
for detailed disscussions of character set standars,
Ken Lunde
for his super-informative [CJK.INF],
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
for supplying the [ISO 8859-14] link.
Special thanks to
Dan Kogai
for developing and maintaining Perl Encode module and for putting me on the track with the character encoding issues!
To be continued :-)
It may be worth to understand the CCS
definition in a special way:
CCS
is a mapping from a set of abstract characters to a set of
integers, a set of integer pairs or a set integers or integer
triplets
Pairs naturally rise from row-column codes of tabels used to present character glyphs and triplets - from arrays of tables.
These muli-demintional indexes easily map to integers.
But this is often done differently: for 94-character CCS
's
we regularly use
0x41
Taking Ken Lunde's [CJK.INF] as an example of a document
discussing 94x94 CCS
's we'll see two different notations:
06-85
0x21
: 0x6161
Similar variations should be possible with 94x94x94 CCS
's.
If we upgrade CCS
definition this way it will be also necessary
to note that the fact a certain CCS
standards enumerates
characters in paired hex notation (0x6161
) should not yet mean
that this standard implies a "raw" CES
(see Appendix D).
To mark a certain standard CCS-CES
it should be necessary to
explicitly define both the CCS
's and CES
's in accordance with
(2.2).
C1. [RFC 2130], [RFC 2278]:
Character Encoding Scheme is a mapping from a Coded Character Set (or several) to a set of octets.
C2. [RFC 2130]:
A definition of a character encoding scheme consists of:
<CCS, code value>
or to the error state "illegal
octet sequence"CCS
's registered by
IANA
or in text, of each CCS
upon which this CES
is based.C3. [RFC 2278]:
The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters... unconditional and unambiguous conversion in the other direction is not required
A notable difference is that C2 allowes several octet sequences to
map to a single <CCS, code value>
sequence while C1 does not.
We may of course say that C1 is dominating and outlaw multiple octet
sequences, but then a "charset" according to the C3 definition is not
automatically a CES
, which breaks our neat classification. So for the
author prefers to silently reverse the C1 definition (and efficiently
make C2 dominating).
In practice this issue is not that important because CES
's try to
avoid the associating several octet sequences to the same
<CCS, code value>
sequence.
UTF-8
prescribes to use the shortest possible byte sequence
to represent every Unicode coded point, and calls every other
presentation "malformed".
ISO-2022-*
family members do not use the ISO 2022
's awaresome
power to its full extent and thus rule out most possible
multiplicities.
Here's an example of multiplicity that occurs however.
All the following sequences of octets produce the same
sequence of <CSS, code-value>
in ISO-2022-JP
:
ESC $ B 0x50 0x50 ESC $ B ESC $ B 0x50 0x50 ESC $ B ESC ( B ESC $ B 0x50 0x50 ...
Here ESC $ B
mean that the following octets should be interpreted
as pairs coding characters in the JIS X 0208-1983
coded character
set. ESC
( B denotes that the following octets should be interpreted
as ASCII
charset. The point is that the redundant escape sequences
may be added quite freely. Of course it is easy to establish
normaliztion transformation that will remove redundant escape
sequences, but ISO-2022-JP
does not forbid them.
Hence the C1 definition should probably be silently dropped in
favour of C2.
ISO
character set framework has one fundumental concept:
ISO C
oded Character Set (3.1).
This appendix discusses the benefits of "Character Set
Calculus" based on two concepts - CCS
(2.1) and CES
(2.2).
As a result of the ISO
"coded character set" definition (3.1)
CES
's while their sole
intent is to define CCS
's. See (4.4), [COMPANION],
(Companion 1.3)
ISO 2022
[ECMA 35] are considered to create a new CCS-CES
from a set of
per-existing CCS-CES
's. This includes creating a new CCS
from
a set of pre-existing CCS
's.(2.1) - (2.2) on the other hand allow
CES
's (UTF-8
, UCS-16BE
, ...)
encode the same CCS
(ISO 10642
), not
reincarnate it
CCS
(JIS X 0208
) encoded by
several CES
's (ISO-2022-JP
, EUC-JP
) be
treated as the same CCS
Look how [RFC 2130] mandates clear CCS
, CES
separation for
future charset registrations:
Most 'charsets'
have a well defined CCS
and CES
,
they should ... be teased apart for the registration.
Currently this has been done for two entries in the [IANA REG]:
Name: Shift_JIS Source: ... The CCS's are JIS X0201:1997 and JIS X0208:1997
Name: Windows-31J Source: ... NEC special characters ... NEC selection of IBM extensions ... and IBM extensions ... The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions.
It totally depends on the point view whether to think, for example,
whether EUC-JP
defines a new CCS
or reuses ASCII
and JIS X 0208
.
ISO
(3.1) point of view is that EUC-JP
defines a new CCS
('character set'
in it's wording).
[RFC 2130] (2.2) point of view is that 'EUC-JP'
encodes
two pre-existing CCS
's:
The author's opinion is that it is optimal to keep to the minimum
the number of entities involeved into our "Character Set Calculus".
We can not reduce the number of encodings used in the world, but we
can reduce the number of "secondary supplements" such as CCS
's :)
Thats why the CCS
, CES
framework (2) has been chosen as a basis
for and is advocated for by this survey.
The term 'Character Set'
means many things to
many people.
Truly so: both 'Coded Character Set'
and
'Character Set'
may mean
ISO
coded character set (3.1) = CCS-CES
(2.1), (2.2)
MIME
charset (3.2) = encoding = CES
(2.2)
CCS
(2.1)(3.1) and (3.2) are very close. Rigorously (3.1) is a subset of (3.2):
Every ISO
'coded character set'
is a CES
.
'Coded Character Set'
means 'ISO coded character set'
in
ISO
standards, for example [ISO 8859-14], [ECMA 35].
'Coded Character Set'
, 'Character Set'
mean 'encoding'
in
all RFC
's prior to [RFC 2130], for example [RFC 1345] (the
second definition), [RFC 2045].
'Coded Character Set'
means CCS
in [RFC 2130], [RFC 2278].
'Character Set'
means CCS
in the ISO
standards, [ISO 8859-14]
for example, and probably in the [UNICODE GLOSSARY].
The later is so short on the subject that in it's
'Character Set'
may also mean
However it is the high degree of 'character set'
term overload
that allowes us to say 'Character set standards'
meaning the
whole body of diverse standards :-)
IANA C
harset Registration Procedures.IETF P
olicy on Character Sets and LanguagesIAB C
haracter Set Workshop heldISO
standards are not easy to find online. This one is aECMA-35
6th E
dition. December 1994.ISO 2022
)IANA
registers charaset values according to RFC 2278
)RE
: modification to registration of charset ks_c_5601-1987
CJK CHARACTER SET STNADARDS CLASSIFICATION
script for generating this HTML from text |