| Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru |
Moscow State University Scientific Computer Research Center http://srcc.msu.su |
| Apr 4 2002 | text version of this document |
Binary representation of textual data is ruled by character set standards.
Globally unique names (like KS_C_5601-1987) refer
to specification.
The goal of this survey is to classify named specifications
(JIS_C6220-1969-jp, JIS_C6220-1969-ro, JIS_X0208-1983, etc.).
To achieve this the current character set related terminology is surveyd (3) and abridged to three category classificaton (2.2).
The categories of the (2.2) classification are however overlapping and section (5) advocates adoption of a new classification (2.1). The (2.1) specification differs from (2.2) only in that it has disjoint categories.
Sections (6), (7), (8) practice this classification on a large body of standards.
Section (8) 'CLASSIFICATION EXAMPLE. CJK STANDARDS' has
unintentionally become a brief reference on CJK standards.
Its (8.3) subsection is notably a point from which the whole
survey has grown: it emphasises a certain confusion happening
around most basic CJK standards. Please don't miss 8.3 :-)
Section (9) reflects the great author's desire to collect as much feedback as possible (tagunov@motor.ru).
Subsections (9.2), (9.3) also express a desire to fill any gaps and discrepancies that the readers are likely to notice, and point out the areas where readers' help is desirable. Let us consider this version 0.1 of the servey! ;-)
Section (4) discusses how [RFC 2130] undermines "coded character set" industry terminology and what are the best ways to get around that.
Appendixes B, C, D contain miscellaneous discussions excluded from the main body of the survey.
Have a good reading! :-)
This is the classification proposed by the current survey. It is derived from classification introduced in (2.2) by making the categories disjoint.
See section (5) for an explanation of why this survey finds this classification most fruitfull and advocates it over the (2.2) one.
DUAL specifications that define
a CCS and a CES
(using this CCS and
possibly other CCS's
defined elsewhere to define the CES)
CES-ONLY specifications that define
a CES
using CCS's defined elsewhere
(and not specifing any CCS's
on their own)
CCS-ONLY specifications that define
a CCS
(and no CES's)
This classification is complete and disjoint: every character set specification fits into one of these categories, but none fits into two.
(See subsections (3.2), (3.3) for definitions of CES and CCS.)
This is the classification derived in section (3) from
definitions found in ISO and RFC standards. These standards
operate notions as
"character set" "coded character set" "charset", "character encoding scheme"
the whole diversity of definitions given to these terms reduces to 3 categories:
CCS Coded Character Set after [RFC 2130]
CES Character Encoding Scheme after [RFC 2130]
ISO-CCS Coded Character Set after ISO, cited
by [RFC 1345]
This classfication is complete but not disjoint. Here is how it relates to the (2.1) classification:
CES = DUAL + CES-ONLY CCS = DUAL + CCS-ONLY ISO-CCS = DUAL
The term "character" rises little disagreement. Both [RFC 2278] and [ECMA 35] define it as
a member of a set of elements used for the organization, control, or representation of data.
This is close to definition in [UNICODE] and in other standards.
[RFC 2130], [RFC 2278] formulate
Coded Character Set (CCS) is a mapping from a set of abstract
characters to a set of integers
To avoid ambiguity we shall refer to this defintion as 'CCS',
because this abbrevation seems not to be used in any other meaning.
Examples (after these RFC's): ISO 10646, US-ASCII, ISO-8859-*.
Please refer to Appendix B for a remark on CCS definition.
Specifications that match this definition are classified as CCS
in (2.2).
[RFC 2130] and [RFC 2278] formulate:
Character Encoding Scheme (CES) is a mapping from a Coded
Character Set or several coded character sets to a set of
octets.
[RFC 2130] also contains another abstract that we shall use as a CES
definition instead:
A definition of a character encoding scheme consists of:
<CCS, code value> or to the error state
"illegal octet sequence"CCS's registered by
IANA or in text, of each CCS upon which this CES is based.Please refer to Appendix C for an moderate discrepacy between these definitions.
Specifications that match these definitions are classified as CES
in (2.2).
Following the later definition we find two flavours of CES
specifications:
Those that use CCS's defined only elswhere, CES-ONLY in (2.1).
And those that contain "specifications .. in text of ... CCS
upon which this CES is based". It is reasonable to assumed that
every specification defines no more then one CCS on its own.
Hence the definition of category DUAL in the (2.1) classification.
The ISO-8859-* specifications seem both to define a there own
94-character CCS and to reference another CCS (ASCII). This is
what (2.1)'s DUAL description means by
"using this CCS and possibly other CCS's defined elsewhere."
[RFC 1345] cites ISO definitions of a "coded character set":
The ISO definition of the term "coded character set" is as
follows: "A set of unambiguous rules that establishes a
character set and the one-to-one relationship between the
characters of the set and their coded representation."
As you can see a specification defines ISO 'coded character set'
if and only if it defines both a CCS and a CES for it. This is
exactly category DUAL in (2.1). Hence category ISO-CCS in (2.2)
is exactly identical to DUAL in (2.1).
[RFC 1345] also has it's own definition of a 'coded character set':
"A coded character set is a set of rules that unambiguously and completely determines which sequence of characters, if any, is represented by each possible sequence of ... bytes"
[RFC 2045] defines 'character set' term:
The term "character set" is used in MIME to refer to a method
of converting a sequence of octets into a sequence of
characters ...
from simple ... mappings such as US-ASCII to complex table
switching methods such as those that use ISO 2022's
techniques.
Even the newest [RFC 2278] that gives a CES definition parallels it
with:
The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters ...
All these definitions are essentially equivalent to
the CES definition (3.3).
The introduction of CES/CCS terminology by [RFC 2130]
helps a lot in understanding the character set world.
But it also has caused certain ambiguity:
'Coded Character Set' has been used for years in the
ISO meaning of DUAL (2.1) and introduction of a CCS
with a different meaning forces a terminology revision.
This section one evaluates old and new terms one by one.
Totally unambiguous and fine.
Certainly usage of 'CCS' 3-letter abbrevation in a meaning
different from the full form 'coded character set'
is artificial. However this is probably the best we can do
now.
The full 'coded character set' term is currently in a
conflict between the older ISO definition and the newer,
equivalent to CCS one. Probably every speaker who choses
to use it should first clarify which meaning he/she adhers
to.
The term 'character set' is quite ambigeous. It may be accepted as
'coded character set' (in ISO meaning),
equivalent of CES
'coded character set' (in [RFC 2130] meaning),
equivalent of CCS
'charset' parameter of Content-Type MIME header,
also probably an equivalent of CES (4.5)This survey's point of view is that it is best to use this
term only inside the 'character set standards' word-combination,
meaning the whole body of heterogenous standards.
Many standards (including MIME and HTTP ones) treat 'charset'
as a shorthand for 'character set' or 'coded character set'.
Both of these have become ambiguous.
But over years 'charset' has been used to mean "what you may pass
as the 'charset' parameter of the Content-Typed MIME or HTTP header".
In this meaning 'charset' is equivalent to CES.
Recognising this common usage [RFC 2278] has proposed to detach
'charset' from any other terms and use it on its own, effitiently
as a synonim to CES. [RFC 2278]:
The term "charset" (see historical note below) is used here to refer to a method of converting a sequence of octets into a sequence of characters.
A related discussion on MIME charset parameter and [RFC 2130]
may also be found in the Appendix D.
The author of this article will probably continue to use this term and exactly in the same meaning he did it before.
A very general term and probably fine as it is. The terminology revision has not touched it.
May mean different things depending on context. In certain
circumstances it may be equivalent to CES,
in others to a CES combined with 'Transport Encoding Scheme'
[RFC 2130].
And it not necessirely related to binary representing of textual data.
This section will show that the current [RFC 2130] CES and CCS
definitons may not be narrow enough for certain purposes.
If a specification is classified as a CES,
it is not clear if that specification also defines a CCS.
If a specification is classified as a CCS,
it is not clear if that specification also defines a CES.
CES specification is explicitly allowed to include or not include
a "private" CCS specification (3.3.2).
Its perfectly okay per se, but this survey calls to establish
different family names for these two categories of CES specifications.
This survey's opinion is that adoption of terminology based on
DUAL,CES-ONLY,CCS-ONLY classification (2.1) would allow to
establish a better taxonomy for character set standards.
As an epigraph, [RFC 2130]:
... the MIME registry of character sets ... contains items
that may differ greatly in their applicability and semantics
in various Internet protocols.
[RFC 2278] says:
MIME ... and various other modern Internet protocols are
capable of using many different charsets. This registration
procedure exists ... to associate a specific name or names
with a given charset and to give an indication of whether
or not a given charset can be used in MIME text objects.
[IANA REG] is the registry established by this RFC.
[IANA REG] contains records both for DUAL (2.1) and CES-ONLY (2.1)
standards.
[IANA REG] does not contain records for CCS-ONLY (2.1) standards.
The probable reason is that the ISO "coded character set"
definition (3.4) does not allow to create a CCS-ONLY (2.1) standard,
but only a DUAL (2.1).
This explains why many of the standards (including those registered
at [IANA REG]) are DUAL (2.1) only formally, and function as
CCS-ONLY (2.1).
Naturally, the entries labeled MIME applicable (that have a
"preferred MIME name") are full weight CES's and are either
CES-ONLY's or DUALS both de facto and de jure. The above remark
is applicable to a subset of entries in the MIME registry that
do not have a "preferred MIME name".
The following sections (7) and (8) classify a number of stnadards, including those registered in [IANA REG].
Section (8.3) is specially devoted to standards
that are DUAL by their form but CCS-ONLY by there nature.
This section classifies some of the non-CJK and Unicode-related
standards against DUAL, CCS-ONLY, CES-ONLY classification (2.1).
Standards are refered to by [IANA REG] prefered MIME name if they
have one.
US-ASCII, ISO-8859-*, KOI8-R
UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4
Unicode 3.2
Classification of CJK standards used to be a great source of
confusion, at least for the author of this survey (and was
the original incentive to write this survey.)
Note: this section references character specifications by [IANA REG] names if such names exist. For completeness (8.5) relates [IANA REG] names back to the names of official standards that contain the specifications.
Specifications like
ISO-2022-KR (reportedly unused) EUC-KR ISO-2022-JP ISO-2022-JP-1 ISO-2022-JP-2 Shift_JIS (as defined by Appendix 1 of JIS X 0208:1997) EUC-JP GB2312 (8-bit EUC-style CES) EUC-TW
ISO-2022-CN-EXT
easily classify as CES-ONLY (2.1).
JIS_X0201 [RFC 1345] is an 8-bit composition of two
7-bit DUALS's (2.1):
JIS_C6220-1969-ro (ISO646-JP) JIS_C6220-1969-jp (katakana)
This standard defines its onw CCS and a CES, hence classified as
DUAL.
Microsoft extension to Big5.
Erroneously refered to as Big5 by Microsoft products.
Adds characters to Big5 thus defining.
Microsoft extension to Shift_JIS.
Erroneously(?) referred to as Shift_JIS by Microsoft products.
The official Shift_JIS includes only JIS X 0201 and JIS X 0208
repertoire, while Microsoft has always been meaning Shift_JIS to
encode a wider repertoire with Shift_JIS.
As a hystorical predecessor Microsoft's variant
probably has more rights for the name, albeit it may be objected
that Microsoft shouldn't have used JIS as part of the name
in the first place.
Microsoft extension to GB2312.
Erroneously referred to as GB2312 by Microsoft products.
Backword compatible with GB2312.
Extends GB2312 with unified Han characters from ISO 10646-1:1993
not already present in GB_2312-80.
New characters are inserted to unused postions in EUC-CN natural
0x8181 - 0xFEFE range and to newly allocated 0x8140-0xFE7E range.
Microsoft extension to EUC-KR.
Also known as Unified Hangul Code, UHC.
Erroneously referred to as KS_C_5601-1987 by Microsoft products.
Adds 8822 pre-combined Hangul syllables to EUC-KR.
Uses extension technique same as GBK.
China national standard.
Extends GBK.
Introduces 4 byte codings for characters.
Provides space for all assigned and unassigned Unicode 3.2
(BMP plus 16 extension planes) code points.
See also [DUERST].
Single byte
JIS_C6220-1969-ro alias ISO646-JP GB_1988-80 alias ISO646-CN KSC5636 alias ISO646-KR
all serve as "raw material" to CES-ONLY's listed in (8.1), by
defining 94-character CCS's and also define ASCII-like 7-bit
CES's. The difference from ASCII is Yen, Yuan, Won symbols
replacing "$" (0x24) or "\" (0x5C), and other minor changes,
as it is natural for the ISO646-* family.
The following standards cited by [RFC 1345] (and registered by [IANA REG])
JIS_C6220-1969-jp alias JIS_C6220-1969 alias iso-ir-13
JIS_C6226-1983 alias JIS_X0208-1983 alias iso-ir-87
JIS_X0212-1990 alias iso-ir-159
GB_2312-80 alias iso-ir-58
KS_C_5601-1987 alias KS_C_5601-1989 alias iso-ir-149
alias KSC_5601
are in a peculiar position. Because ISO "coded character set"
definition (all these are ISO standards as the iso-ir-xyz alias
shows) does not allow to create a CCS-ONLY standard (2.1), but
only a DUAL (2.1). So each of these standard additionally defines
a 7-bit CES. We shall call this CES "raw" or "implied".
Natural ("cooked") role of these standards is to help building
CES's listed in (8.1). (Hence we will call this CES "raw".)
The "raw" CES usage is so rare that the we may suspect that it
even was not the standards creators' intension. (Hence we will
also call this CES "implied".)
For example JIS_C6220-1969-jp's "raw" CES is a 7-bit CES that
encodes
ISO 6429) control characters at 0x00 - 0x20
SPACE and DELETE at 0x21, 0x7F
0x21 - 0x7E.
It is not clear if this CES is used at all, but it is probably not
(see also section (9)).
JIS_C6226-1983 (aka JIS_X0208-1983), GB_2312-80 and KS_C_5601-1987
have even less usable "raw" 7-bit CES's. These CES's are double byte
7-bit CES that have neither control characters nor SPACE and delete.
Naturally CES's that do not encode CR, LF and SPACE are only of
a limited use.
They are know however to be used with fonts in X Window system.
Other 94x94 standards should be in the same position
of implicitly defining mostly uselessly mutibyte 7-bit CES's
without SPACE, BACKSPACE, CR, LF and other control characters.
These probably include
ISO-IR-165 GB/T 12345-90
CNS 11643-1992 defines 16 94x94 planes.
First two planes have almost the same set of characters as Big5,
but at different code points. Is used only inside EUC-TW. Does
not define an implicit "raw" CES like those described in 8.3,
because only each of it's planes might be vulnarable to this
"implicit" creation not the standard as a whole.
See also (9.3).
This material probably does not belong here, but as soon as
section (8) has already become a CJK standard listing
for completeness here's a [IANA REG] name to original standard
name reference. (It is not clear, why this info is missing from
[IANA REG]). Multiple standard names are given if a standard has
several names or several standard versions are identical at
definition of the given CES. [IANA REG] aliases are marked with '='.
94 DUALs
[IANA REG] original name(s)
JIS_C6220-1969-ro = ISO646-JP, JIS C 6220-1969 JIS_C6220-1969-jp = katakana = JIS_C6220-1969
GB_1988-80 = ISO646-CN GB 1988-80
KSC5636 = ISO646-KR KS C 5636-1993
KS C 5636-1989
94x94 DUALs
JIS_C6226-1983 = JIS_X0208-1983 JIS C 6226-1983
JIS X 0208:1983
JIS_X0212-1990 JIS X 0212:1990
GB_2312-80 GB 2312-80
KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601 KS C 5601-1987
KS C 5601-1992
KS X 1001:1997
Extensions/revisions of the mentioned 94x94 DUAL's (2.1) not
registered at [IANA REG]:
JIS X 0208:1990 JIS X 0208:1997 KS X 1001:1998 (euro and one other char added)
The author of this survey would very much like to receive as much feedback on this article as possible.
Please send me all kinds of comments on this survey, your opinion on its topicality, all factual mistakes, any statements that you find controvercial!
You corrections will be incorporated into this document,
as soon as possible, or if the author will consider them
arguable will probably form 'Appendex F. SPECIAL OPINIONS'.
The author is limited in surveing documents freely available online.
Specifically he has no access to ISO standards, or their
analogs, except for [ECMA 35].
The main source of information on CJK have been [RFC 1345],
[CJK.INF] and most helpfull replies from Autrijus Tang and
Jungshik Shin on perl-unicode@perl.org.
Therefore contributions from readers who have access to documents that the author of this survey does not have access to may significantly improve it.
Issue #1.
Section (3.4) 'ISO CCS' would benefit from a direct reference
to some ISO standard that gives the ISO definition of "coded
character set".
Issue #2.
Section (3.5) 'OTHER DEFINITIONS REDUCING TO CES DEFINITION'
would benefit from listing more definitions of "character set"
and related identities.
Issue #3.
If for any standard (especially CJK) it's most official
name (given by its original registration body) is not
listed or is misspelled (up to ':' vs '-' differences
and wrong number of spaces) it would be higly appropriate
to correct this. Please do tell me.
Issue #4.
The author of this document will highly wellcome
inclusion of other CJK standards into the classification,
section (8).
Also, if you feel there are strong reasons for inclusion
of non-CJK standards into section (7), please do tell me.
Issue #5.
Is "raw" 7-bit "implied" CES for JIS_C6220-1969-jp
used for any purpose? Has section (8.3) been wrong
in saying it isn't?
Issue #6.
How correct is section (8.4), that lists
CNS 11643-1992 as the only CCS-ONLY stnadard in the CJK world?
Issue #7.
What CES's are used with CCCII?
Does the CCCII standard specify a CES?
If yes, how is it called?
What is the full name of the standard?
Issue #8.
What CES's are used for ANSI Z39.64-1989?
Does the ANSI Z39.64-1989 standard specify a CES?
If yes, how is it called?
Help will be highly welcome!
Thanks to
Autrijus Tang Jungshik Shin
and other posters of perl-unicode@perl.org for
detailed disscussions of CJK standars!
Ken Lunde
for his super-informative [CJK.INF]
And special thanks to
Dan Kogai
for developing and maintaining Perl Encode module that has put me on the with the character encoding issues.
To be continued :-)
It may be worth to understand the CCS definition in a special way:
CCS is a mapping from a set of abstract characters to a set of
integers, a set of integer pairs or a set integers or integer
triplets
Pairs naturally rise from row-column codes of tabels used to present character glyphs and triplets - from arrays of tables.
These muli-demintional indexes easily map to integers.
But this is often done differently: for 94-character CCS's
we regularly use
0x41
Taking Ken Lunde's [CJK.INF] as an example of a document
discussing 94x94 CCS's we'll see two different notations:
06-85
0x21: 0x6161Similar variations should be possible with 94x94x94 CCS's.
C1. [RFC 2130], [RFC 2278]:
Character Encoding Scheme is a mapping from a Coded Character Set (or several) to a set of octets.
C2. [RFC 2130]:
A definition of a character encoding scheme consists of:
<CCS, code value> or to the error state "illegal
octet sequence"CCS's registered by
IANA or in text, of each CCS upon which this CES is based.C3. [RFC 2278]:
The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters... unconditional and unambiguous conversion in the other direction is not required
A notable difference is that C2 allowes several octet sequences to
map to a single <CCS, code value> sequence while C1 does not.
We may of course say that C1 is dominating and outlaw multiple octet
sequences, but then a "charset" according to the C3 definition is not
automatically a CES, which breaks our neat classification. So for the
author prefers to silently reverse the C1 definition (and efficiently
make C2 dominating).
In practice this issue is not that important because CES's try to
avoid the associating several octet sequences to the same
<CCS, code value> sequence.
UTF-8 prescribes to use the shortest possible byte sequence
to represent every Unicode coded point, and calls every other
presentation "malformed".
ISO-2022-* family members do not use the ISO 2022's awaresome
power to its full extent and thus rule out most possible
multiplicities.
Here's an example of multiplicity that occurs however.
All the following sequences of octets produce the same
sequence of <CSS, code-value> in ISO-2022-JP:
ESC $ B 0x50 0x50 ESC $ B ESC $ B 0x50 0x50 ESC $ B ESC ( B ESC $ B 0x50 0x50 ...
Here ESC $ B mean that the following octets should be interpreted
as pairs coding characters in the JIS X 0208-1983 coded character
set. ESC ( B denotes that the following octets should be interpreted
as ASCII charset. The point is that the redundant escape sequences
may be added quite freely. Of course it is easy to establish
normaliztion transformation that will remove redundant escape
sequences, but ISO-2022-JP does not forbid them.
Hence the C1 definition should probably be silently dropped in
favour of C2.
It was [RFC 2130] that introduced the CES (3.3) definition.
The more funny it is to see how [RFC 2130] itself does not
use it to full power and goes tautological:
... in MIME, the Coded Character Set and Character Encoding
Scheme are specified by the Charset parameter to the
Content-Type header field ...
Every CES is already associated with a set of CCS's (3.3).
Regorously, it would be enough to say:
... in MIME, the Character Encoding Scheme is specified by
the Charset parameter to the Content-Type header field ...
or more verbously
"charset", as defined by this document and as specified by the
Charset parameter to the Content-Type header field is a
synonym to a CES. As such the Charset parameter to the
Content-Type header completely defines how to map the result
of de-aplying Transport Encdoing Syntax to the binary
representation of the message body to a sequence of <CCS,
code-value> pairs.
Of course both variants are much less intuitive then
the original RFC's text.
IANA Charset Registration Procedures.IAB Character Set Workshop heldMIME)ECMA-356th Edition. December 1994.ISO 2022)IANA registers charaset values according to RFC 2278)CJK.INF Version 2.1RE: modification to registration of charset ks_c_5601-1987INTERNET-DRAFTRE: codes:chars is many:one?| script for generating this HTML from text |