Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru |
Moscow State University Scientific Computer Research Center http://srcc.msu.su |
Apr 4 2002 | text version of this document |
Binary representation of textual data is ruled by character set standards.
Globally unique names (like KS_C_5601-1987
) refer
to specification.
The goal of this survey is to classify named specifications
(JIS_C6220-1969-jp
, JIS_C6220-1969-ro
, JIS_X0208-1983
, etc.).
To achieve this the current character set related terminology is surveyd (3) and abridged to three category classificaton (2.2).
The categories of the (2.2) classification are however overlapping and section (5) advocates adoption of a new classification (2.1). The (2.1) specification differs from (2.2) only in that it has disjoint categories.
Sections (6), (7), (8) practice this classification on a large body of standards.
Section (8) 'CLASSIFICATION EXAMPLE. CJK STANDARDS'
has
unintentionally become a brief reference on CJK
standards.
Its (8.3) subsection is notably a point from which the whole
survey has grown: it emphasises a certain confusion happening
around most basic CJK
standards. Please don't miss 8.3 :-)
Section (9) reflects the great author's desire to collect as much feedback as possible (tagunov@motor.ru).
Subsections (9.2), (9.3) also express a desire to fill any gaps and discrepancies that the readers are likely to notice, and point out the areas where readers' help is desirable. Let us consider this version 0.1 of the servey! ;-)
Section (4) discusses how [RFC 2130] undermines "coded character set" industry terminology and what are the best ways to get around that.
Appendixes B, C, D contain miscellaneous discussions excluded from the main body of the survey.
Have a good reading! :-)
This is the classification proposed by the current survey. It is derived from classification introduced in (2.2) by making the categories disjoint.
See section (5) for an explanation of why this survey finds this classification most fruitfull and advocates it over the (2.2) one.
DUAL specifications that define a CCS and a CES (using this CCS and possibly other CCS's defined elsewhere to define the CES)
CES-ONLY specifications that define a CES using CCS's defined elsewhere (and not specifing any CCS's on their own)
CCS-ONLY specifications that define a CCS (and no CES's)
This classification is complete and disjoint: every character set specification fits into one of these categories, but none fits into two.
(See subsections (3.2), (3.3) for definitions of CES
and CCS
.)
This is the classification derived in section (3) from
definitions found in ISO
and RFC
standards. These standards
operate notions as
"character set" "coded character set" "charset", "character encoding scheme"
the whole diversity of definitions given to these terms reduces to 3 categories:
CCS Coded Character Set after [RFC 2130] CES Character Encoding Scheme after [RFC 2130] ISO-CCS Coded Character Set after ISO, cited by [RFC 1345]
This classfication is complete but not disjoint. Here is how it relates to the (2.1) classification:
CES = DUAL + CES-ONLY CCS = DUAL + CCS-ONLY ISO-CCS = DUAL
The term "character" rises little disagreement. Both [RFC 2278] and [ECMA 35] define it as
a member of a set of elements used for the organization, control, or representation of data.
This is close to definition in [UNICODE] and in other standards.
[RFC 2130], [RFC 2278] formulate
Coded Character Set (CCS
) is a mapping from a set of abstract
characters to a set of integers
To avoid ambiguity we shall refer to this defintion as 'CCS'
,
because this abbrevation seems not to be used in any other meaning.
Examples (after these RFC
's): ISO 10646
, US-ASCII
, ISO-8859-*
.
Please refer to Appendix B for a remark on CCS
definition.
Specifications that match this definition are classified as CCS
in (2.2).
[RFC 2130] and [RFC 2278] formulate:
Character Encoding Scheme (CES
) is a mapping from a Coded
Character Set or several coded character sets to a set of
octets.
[RFC 2130] also contains another abstract that we shall use as a CES
definition instead:
A definition of a character encoding scheme consists of:
<CCS, code value>
or to the error state
"illegal octet sequence"CCS
's registered by
IANA
or in text, of each CCS
upon which this CES
is based.Please refer to Appendix C for an moderate discrepacy between these definitions.
Specifications that match these definitions are classified as CES
in (2.2).
Following the later definition we find two flavours of CES
specifications:
Those that use CCS
's defined only elswhere, CES-ONLY
in (2.1).
And those that contain "specifications .. in text of ... CCS
upon which this CES
is based". It is reasonable to assumed that
every specification defines no more then one CCS
on its own.
Hence the definition of category DUAL
in the (2.1) classification.
The ISO-8859-*
specifications seem both to define a there own
94-character CCS
and to reference another CCS
(ASCII
). This is
what (2.1)'s DUAL
description means by
"using this CCS
and possibly other CCS
's defined elsewhere."
[RFC 1345] cites ISO
definitions of a "coded character set":
The ISO
definition of the term "coded character set" is as
follows: "A set of unambiguous rules that establishes a
character set and the one-to-one relationship between the
characters of the set and their coded representation."
As you can see a specification defines ISO
'coded character set'
if and only if it defines both a CCS
and a CES
for it. This is
exactly category DUAL
in (2.1). Hence category ISO-CCS
in (2.2)
is exactly identical to DUAL
in (2.1).
[RFC 1345] also has it's own definition of a 'coded character set'
:
"A coded character set is a set of rules that unambiguously and completely determines which sequence of characters, if any, is represented by each possible sequence of ... bytes"
[RFC 2045] defines 'character set'
term:
The term "character set" is used in MIME
to refer to a method
of converting a sequence of octets into a sequence of
characters ...
from simple ... mappings such as US-ASCII
to complex table
switching methods such as those that use ISO 2022
's
techniques.
Even the newest [RFC 2278] that gives a CES
definition parallels it
with:
The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters ...
All these definitions are essentially equivalent to
the CES
definition (3.3).
The introduction of CES
/CCS
terminology by [RFC 2130]
helps a lot in understanding the character set world.
But it also has caused certain ambiguity:
'Coded Character Set'
has been used for years in the
ISO
meaning of DUAL
(2.1) and introduction of a CCS
with a different meaning forces a terminology revision.
This section one evaluates old and new terms one by one.
Totally unambiguous and fine.
Certainly usage of 'CCS'
3-letter abbrevation in a meaning
different from the full form 'coded character set'
is artificial. However this is probably the best we can do
now.
The full 'coded character set'
term is currently in a
conflict between the older ISO
definition and the newer,
equivalent to CCS
one. Probably every speaker who choses
to use it should first clarify which meaning he/she adhers
to.
The term 'character set'
is quite ambigeous. It may be accepted as
'coded character set'
(in ISO
meaning),
equivalent of CES
'coded character set'
(in [RFC 2130] meaning),
equivalent of CCS
'charset'
parameter of Content-Type MIME
header,
also probably an equivalent of CES
(4.5)This survey's point of view is that it is best to use this
term only inside the 'character set standards'
word-combination,
meaning the whole body of heterogenous standards.
Many standards (including MIME
and HTTP
ones) treat 'charset'
as a shorthand for 'character set'
or 'coded character set'
.
Both of these have become ambiguous.
But over years 'charset'
has been used to mean "what you may pass
as the 'charset'
parameter of the Content-Typed MIME
or HTTP
header".
In this meaning 'charset'
is equivalent to CES
.
Recognising this common usage [RFC 2278] has proposed to detach
'charset'
from any other terms and use it on its own, effitiently
as a synonim to CES
. [RFC 2278]:
The term "charset" (see historical note below) is used here to refer to a method of converting a sequence of octets into a sequence of characters.
A related discussion on MIME
charset parameter and [RFC 2130]
may also be found in the Appendix D.
The author of this article will probably continue to use this term and exactly in the same meaning he did it before.
A very general term and probably fine as it is. The terminology revision has not touched it.
May mean different things depending on context. In certain
circumstances it may be equivalent to CES
,
in others to a CES
combined with 'Transport Encoding Scheme'
[RFC 2130].
And it not necessirely related to binary representing of textual data.
This section will show that the current [RFC 2130] CES
and CCS
definitons may not be narrow enough for certain purposes.
If a specification is classified as a CES
,
it is not clear if that specification also defines a CCS
.
If a specification is classified as a CCS
,
it is not clear if that specification also defines a CES
.
CES
specification is explicitly allowed to include or not include
a "private" CCS
specification (3.3.2).
Its perfectly okay per se, but this survey calls to establish
different family names for these two categories of CES
specifications.
This survey's opinion is that adoption of terminology based on
DUAL
,CES-ONLY
,CCS-ONLY
classification (2.1) would allow to
establish a better taxonomy for character set standards.
As an epigraph, [RFC 2130]:
... the MIME
registry of character sets ... contains items
that may differ greatly in their applicability and semantics
in various Internet protocols.
[RFC 2278] says:
MIME
... and various other modern Internet protocols are
capable of using many different charsets. This registration
procedure exists ... to associate a specific name or names
with a given charset and to give an indication of whether
or not a given charset can be used in MIME
text objects.
[IANA REG] is the registry established by this RFC
.
[IANA REG] contains records both for DUAL
(2.1) and CES-ONLY
(2.1)
standards.
[IANA REG] does not contain records for CCS-ONLY
(2.1) standards.
The probable reason is that the ISO
"coded character set"
definition (3.4) does not allow to create a CCS-ONLY
(2.1) standard,
but only a DUAL
(2.1).
This explains why many of the standards (including those registered
at [IANA REG]) are DUAL
(2.1) only formally, and function as
CCS-ONLY
(2.1).
Naturally, the entries labeled MIME
applicable (that have a
"preferred MIME
name") are full weight CES
's and are either
CES-ONLY
's or DUALS
both de facto and de jure. The above remark
is applicable to a subset of entries in the MIME
registry that
do not have a "preferred MIME
name".
The following sections (7) and (8) classify a number of stnadards, including those registered in [IANA REG].
Section (8.3) is specially devoted to standards
that are DUAL
by their form but CCS-ONLY
by there nature.
This section classifies some of the non-CJK
and Unicode-related
standards against DUAL
, CCS-ONLY
, CES-ONLY
classification (2.1).
Standards are refered to by [IANA REG] prefered MIME
name if they
have one.
US-ASCII, ISO-8859-*, KOI8-R
UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4
Unicode 3.2
Classification of CJK
standards used to be a great source of
confusion, at least for the author of this survey (and was
the original incentive to write this survey.)
Note: this section references character specifications by [IANA REG] names if such names exist. For completeness (8.5) relates [IANA REG] names back to the names of official standards that contain the specifications.
Specifications like
ISO-2022-KR (reportedly unused) EUC-KR ISO-2022-JP ISO-2022-JP-1 ISO-2022-JP-2 Shift_JIS (as defined by Appendix 1 of JIS X 0208:1997) EUC-JP GB2312 (8-bit EUC-style CES) EUC-TW
ISO-2022-CN-EXT
easily classify as CES-ONLY
(2.1).
JIS_X0201
[RFC 1345] is an 8-bit composition of two
7-bit DUALS
's (2.1):
JIS_C6220-1969-ro (ISO646-JP) JIS_C6220-1969-jp (katakana)
This standard defines its onw CCS
and a CES
, hence classified as
DUAL
.
Microsoft extension to Big5
.
Erroneously refered to as Big5
by Microsoft products.
Adds characters to Big5
thus defining.
Microsoft extension to Shift_JIS
.
Erroneously(?) referred to as Shift_JIS
by Microsoft products.
The official Shift_JIS
includes only JIS X 0201
and JIS X 0208
repertoire, while Microsoft has always been meaning Shift_JIS
to
encode a wider repertoire with Shift_JIS
.
As a hystorical predecessor Microsoft's variant
probably has more rights for the name, albeit it may be objected
that Microsoft shouldn't have used JIS
as part of the name
in the first place.
Microsoft extension to GB2312
.
Erroneously referred to as GB2312
by Microsoft products.
Backword compatible with GB2312
.
Extends GB2312
with unified Han characters from ISO 10646-1:1993
not already present in GB_2312-80
.
New characters are inserted to unused postions in EUC-CN
natural
0x8181
- 0xFEFE
range and to newly allocated 0x8140-0xFE7E
range.
Microsoft extension to EUC-KR
.
Also known as Unified Hangul Code, UHC
.
Erroneously referred to as KS_C_5601-1987
by Microsoft products.
Adds 8822 pre-combined Hangul syllables to EUC-KR
.
Uses extension technique same as GBK
.
China national standard.
Extends GBK
.
Introduces 4 byte codings for characters.
Provides space for all assigned and unassigned Unicode 3.2
(BMP
plus 16 extension planes) code points.
See also [DUERST].
Single byte
JIS_C6220-1969-ro alias ISO646-JP GB_1988-80 alias ISO646-CN KSC5636 alias ISO646-KR
all serve as "raw material" to CES-ONLY
's listed in (8.1), by
defining 94-character CCS
's and also define ASCII-like
7-bit
CES
's. The difference from ASCII
is Yen, Yuan, Won symbols
replacing "$
" (0x24
) or "\" (0x5C
), and other minor changes,
as it is natural for the ISO646-*
family.
The following standards cited by [RFC 1345] (and registered by [IANA REG])
JIS_C6220-1969-jp alias JIS_C6220-1969 alias iso-ir-13
JIS_C6226-1983 alias JIS_X0208-1983 alias iso-ir-87 JIS_X0212-1990 alias iso-ir-159 GB_2312-80 alias iso-ir-58 KS_C_5601-1987 alias KS_C_5601-1989 alias iso-ir-149 alias KSC_5601
are in a peculiar position. Because ISO
"coded character set"
definition (all these are ISO
standards as the iso-ir-xyz alias
shows) does not allow to create a CCS-ONLY
standard (2.1), but
only a DUAL
(2.1). So each of these standard additionally defines
a 7-bit CES
. We shall call this CES
"raw" or "implied".
Natural ("cooked") role of these standards is to help building
CES
's listed in (8.1). (Hence we will call this CES
"raw".)
The "raw" CES
usage is so rare that the we may suspect that it
even was not the standards creators' intension. (Hence we will
also call this CES
"implied".)
For example JIS_C6220-1969-jp
's "raw" CES
is a 7-bit CES
that
encodes
ISO 6429
) control characters at 0x00
- 0x20
SPACE
and DELETE
at 0x21
, 0x7F
0x21
- 0x7E
.
It is not clear if this CES
is used at all, but it is probably not
(see also section (9)).
JIS_C6226-1983
(aka JIS_X0208-1983
), GB_2312-80
and KS_C_5601-1987
have even less usable "raw" 7-bit CES
's. These CES
's are double byte
7-bit CES
that have neither control characters nor SPACE
and delete.
Naturally CES
's that do not encode CR
, LF
and SPACE
are only of
a limited use.
They are know however to be used with fonts in X Window system.
Other 94x94
standards should be in the same position
of implicitly defining mostly uselessly mutibyte 7-bit CES
's
without SPACE
, BACKSPACE
, CR
, LF
and other control characters.
These probably include
ISO-IR-165 GB/T 12345-90
CNS 11643-1992
defines 16 94x94
planes.
First two planes have almost the same set of characters as Big5
,
but at different code points. Is used only inside EUC-TW
. Does
not define an implicit "raw" CES
like those described in 8.3,
because only each of it's planes might be vulnarable to this
"implicit" creation not the standard as a whole.
See also (9.3).
This material probably does not belong here, but as soon as
section (8) has already become a CJK
standard listing
for completeness here's a [IANA REG] name to original standard
name reference. (It is not clear, why this info is missing from
[IANA REG]). Multiple standard names are given if a standard has
several names or several standard versions are identical at
definition of the given CES
. [IANA REG] aliases are marked with '='
.
94 DUALs
[IANA REG] original name(s)
JIS_C6220-1969-ro = ISO646-JP, JIS C 6220-1969 JIS_C6220-1969-jp = katakana = JIS_C6220-1969
GB_1988-80 = ISO646-CN GB 1988-80
KSC5636 = ISO646-KR KS C 5636-1993 KS C 5636-1989
94x94 DUAL
s
JIS_C6226-1983 = JIS_X0208-1983 JIS C 6226-1983 JIS X 0208:1983 JIS_X0212-1990 JIS X 0212:1990 GB_2312-80 GB 2312-80
KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601 KS C 5601-1987 KS C 5601-1992 KS X 1001:1997
Extensions/revisions of the mentioned 94x94 DUAL
's (2.1) not
registered at [IANA REG]:
JIS X 0208:1990 JIS X 0208:1997 KS X 1001:1998 (euro and one other char added)
The author of this survey would very much like to receive as much feedback on this article as possible.
Please send me all kinds of comments on this survey, your opinion on its topicality, all factual mistakes, any statements that you find controvercial!
You corrections will be incorporated into this document,
as soon as possible, or if the author will consider them
arguable will probably form 'Appendex F. SPECIAL OPINIONS'
.
The author is limited in surveing documents freely available online.
Specifically he has no access to ISO
standards, or their
analogs, except for [ECMA 35].
The main source of information on CJK
have been [RFC 1345],
[CJK.INF] and most helpfull replies from Autrijus Tang and
Jungshik Shin on perl-unicode@perl.org.
Therefore contributions from readers who have access to documents that the author of this survey does not have access to may significantly improve it.
Issue #1.
Section (3.4) 'ISO CCS'
would benefit from a direct reference
to some ISO
standard that gives the ISO
definition of "coded
character set".
Issue #2.
Section (3.5) 'OTHER DEFINITIONS REDUCING TO CES DEFINITION'
would benefit from listing more definitions of "character set"
and related identities.
Issue #3.
If for any standard (especially CJK
) it's most official
name (given by its original registration body) is not
listed or is misspelled (up to ':'
vs '-'
differences
and wrong number of spaces) it would be higly appropriate
to correct this. Please do tell me.
Issue #4.
The author of this document will highly wellcome
inclusion of other CJK
standards into the classification,
section (8).
Also, if you feel there are strong reasons for inclusion
of non-CJK
standards into section (7), please do tell me.
Issue #5.
Is "raw" 7-bit "implied" CES
for JIS_C6220-1969-jp
used for any purpose? Has section (8.3) been wrong
in saying it isn't?
Issue #6.
How correct is section (8.4), that lists
CNS 11643-1992
as the only CCS-ONLY
stnadard in the CJK
world?
Issue #7.
What CES
's are used with CCCII
?
Does the CCCII
standard specify a CES
?
If yes, how is it called?
What is the full name of the standard?
Issue #8.
What CES
's are used for ANSI Z39.64-1989
?
Does the ANSI Z39.64-1989
standard specify a CES
?
If yes, how is it called?
Help will be highly welcome!
Thanks to
Autrijus Tang Jungshik Shin
and other posters of perl-unicode@perl.org for
detailed disscussions of CJK
standars!
Ken Lunde
for his super-informative [CJK.INF]
And special thanks to
Dan Kogai
for developing and maintaining Perl Encode module that has put me on the with the character encoding issues.
To be continued :-)
It may be worth to understand the CCS
definition in a special way:
CCS
is a mapping from a set of abstract characters to a set of
integers, a set of integer pairs or a set integers or integer
triplets
Pairs naturally rise from row-column codes of tabels used to present character glyphs and triplets - from arrays of tables.
These muli-demintional indexes easily map to integers.
But this is often done differently: for 94-character CCS
's
we regularly use
0x41
Taking Ken Lunde's [CJK.INF] as an example of a document
discussing 94x94 CCS
's we'll see two different notations:
06-85
0x21
: 0x6161
Similar variations should be possible with 94x94x94 CCS
's.
C1. [RFC 2130], [RFC 2278]:
Character Encoding Scheme is a mapping from a Coded Character Set (or several) to a set of octets.
C2. [RFC 2130]:
A definition of a character encoding scheme consists of:
<CCS, code value>
or to the error state "illegal
octet sequence"CCS
's registered by
IANA
or in text, of each CCS
upon which this CES
is based.C3. [RFC 2278]:
The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters... unconditional and unambiguous conversion in the other direction is not required
A notable difference is that C2 allowes several octet sequences to
map to a single <CCS, code value>
sequence while C1 does not.
We may of course say that C1 is dominating and outlaw multiple octet
sequences, but then a "charset" according to the C3 definition is not
automatically a CES
, which breaks our neat classification. So for the
author prefers to silently reverse the C1 definition (and efficiently
make C2 dominating).
In practice this issue is not that important because CES
's try to
avoid the associating several octet sequences to the same
<CCS, code value>
sequence.
UTF-8
prescribes to use the shortest possible byte sequence
to represent every Unicode coded point, and calls every other
presentation "malformed".
ISO-2022-*
family members do not use the ISO 2022
's awaresome
power to its full extent and thus rule out most possible
multiplicities.
Here's an example of multiplicity that occurs however.
All the following sequences of octets produce the same
sequence of <CSS, code-value>
in ISO-2022-JP
:
ESC $ B 0x50 0x50 ESC $ B ESC $ B 0x50 0x50 ESC $ B ESC ( B ESC $ B 0x50 0x50 ...
Here ESC $ B
mean that the following octets should be interpreted
as pairs coding characters in the JIS X 0208-1983
coded character
set. ESC
( B denotes that the following octets should be interpreted
as ASCII
charset. The point is that the redundant escape sequences
may be added quite freely. Of course it is easy to establish
normaliztion transformation that will remove redundant escape
sequences, but ISO-2022-JP
does not forbid them.
Hence the C1 definition should probably be silently dropped in
favour of C2.
It was [RFC 2130] that introduced the CES
(3.3) definition.
The more funny it is to see how [RFC 2130] itself does not
use it to full power and goes tautological:
... in MIME
, the Coded Character Set and Character Encoding
Scheme are specified by the Charset parameter to the
Content-Type header field ...
Every CES
is already associated with a set of CCS
's (3.3).
Regorously, it would be enough to say:
... in MIME
, the Character Encoding Scheme is specified by
the Charset parameter to the Content-Type header field ...
or more verbously
"charset", as defined by this document and as specified by the
Charset parameter to the Content-Type header field is a
synonym to a CES
. As such the Charset parameter to the
Content-Type header completely defines how to map the result
of de-aplying Transport Encdoing Syntax to the binary
representation of the message body to a sequence of <CCS
,
code-value> pairs.
Of course both variants are much less intuitive then
the original RFC
's text.
IANA C
harset Registration Procedures.IAB C
haracter Set Workshop heldMIME
)ECMA-35
6th E
dition. December 1994.ISO 2022
)IANA
registers charaset values according to RFC 2278
)CJK.INF V
ersion 2.1RE
: modification to registration of charset ks_c_5601-1987
INTERNET-DRAFT
RE
: codes:chars is many:one?script for generating this HTML from text |