*      *
"CHARACTER SET" TERMINOLOGY SURVEY, VERSION 0.5

Anton Tagunov
http://tagunov.tripod.com
tagunov@motor.ru

Moscow State University
Scientific Computer Research Center
http://srcc.msu.su

TABLE OF CONTENTS
1.  INTRODUCTION
2.  CLASSIFICATIONS
    2.1 DUAL, CES-ONLY, CCS-ONLY
    2.2 CCS, CES, ISO-CCS
3.  CURRENT TERMINOLOGY SURVEY
    3.1 CHARACTER
    3.2 RFC 2130 CCS
    3.3 CES
        3.3.1 BASIC DEFINITION
        3.3.2 EXTENDED DEFINITION
    3.4 ISO-CCS
    3.5 OTHER DEFINITIONS REDUCING TO CES DEFINITION
4.  RECOMMENDED TERMINOLOGY USAGE
    4.1 CES
    4.2 CCS
    4.3 CODED CHARACTER SET
    4.4 CHARACTER SET
    4.5 CHARSET
    4.6 ENCODING
5.  ADVOCATING NEW TERMINOLOGY
6.  IANA CHARACTER SET REGISTRY
7.  CLASSIFICATION EXAMPLE. NON-CJK STANDARDS
    7.1 DUAL
    7.2 CES-ONLY
    7.3 CCS-ONLY
8.  CLASSIFICATION EXAMPLE. CJK STANDARDS.
    8.1 CES-ONLY
        8.1.1 MULTIBYTE
        8.1.2 SINGLE BYTE
    8.2 DUAL
        8.2.1 Big5
        8.2.2 CP950
        8.2.3 CP932
        8.2.4 CP936, GBK
        8.2.5 CP949
        8.2.6 GB 18030-2000
        8.2.7 ISO646-* SINGLE-BYTES
    8.3 CONTROVERCIAL: DUAL OR CCS-ONLY?
    8.4 CCS-ONLY: CNS 11643-1992
    8.5 STANDARDS' NAMES REFERENCE
9.  CALL FOR FEEDBACK AND CONTRIBUTION
    9.1 COMMENTS MORE THEN WELCOME
    9.2 EXTENDING THE LIMITS
    9.3 DOUBTS AND GAPS
Appendix A. THANKS
Appendix B. UPGRADING CCS DEFINITION
Appendix C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS
Appendix D. RFC 2130 ON MIME CHARSET
Appendix E. REFERENCES

1. INTRODUCTION

Binary representation of textual data is ruled by
character set standards.

Globally unique names (like KS_C_5601-1987) refer
to specification.

The goal of this survey is to classify named specifications
(JIS_C6220-1969-jp, JIS_C6220-1969-ro, JIS_X0208-1983, etc.).

To achieve this the current character set related terminology
is surveyd (3) and abridged to three category classificaton (2.2).

The categories of the (2.2) classification are however overlapping
and section (5) advocates adoption of a new classification (2.1).
The (2.1) specification differs from (2.2) only in that it has
disjoint categories.

Sections (6), (7), (8) practice this classification on a large body
of standards.

Section (8) 'CLASSIFICATION EXAMPLE. CJK STANDARDS' has
unintentionally become a brief reference on CJK standards.

Its (8.3) subsection is notably a point from which the whole
survey has grown: it emphasises a certain confusion happening
around most basic CJK standards. Please don't miss 8.3 :-)

Section (9) reflects the great author's desire to collect as
much feedback as possible (tagunov@motor.ru).

Subsections (9.2), (9.3) also express a desire to fill any
gaps and discrepancies that the readers are likely to notice,
and point out the areas where readers' help is desirable.
Let us consider this version 0.1 of the servey! ;-)

Section (4) discusses how [RFC 2130] undermines "coded
character set" industry terminology and what are the
best ways to get around that.

Appendixes B, C, D contain miscellaneous discussions excluded
from the main body of the survey.

Have a good reading! :-)

2. CLASSIFICATIONS

2.1 DUAL, CES-ONLY, CCS-ONLY

This is the classification proposed by the current survey.
It is derived from classification introduced in (2.2) by
making the categories disjoint.

See section (5) for an explanation of why this survey finds
this classification most fruitfull and advocates it over
the (2.2) one.

  DUAL      specifications that define
            a CCS and a CES
            (using this CCS and
             possibly other CCS's
             defined elsewhere to define the CES)

  CES-ONLY  specifications that define
            a CES
            using CCS's defined elsewhere
            (and not specifing any CCS's
             on their own)

  CCS-ONLY  specifications that define
            a CCS
            (and no CES's)

This classification is complete and disjoint:
every character set specification fits into one of these categories,
but none fits into two.

(See subsections (3.2), (3.3) for definitions of CES and CCS.)

2.2 CCS, CES, ISO-CCS

This is the classification derived in section (3) from
definitions found in ISO and RFC standards. These standards
operate notions as

  "character set"
  "coded character set"
  "charset",
  "character encoding scheme"

the whole diversity of definitions given to these
terms reduces to 3 categories:

  CCS      Coded Character Set       after [RFC 2130]
  CES      Character Encoding Scheme after [RFC 2130]
  ISO-CCS  Coded Character Set       after ISO, cited
                                     by [RFC 1345]

This classfication is complete but not disjoint.
Here is how it relates to the (2.1) classification:

  CES = DUAL + CES-ONLY
  CCS = DUAL + CCS-ONLY
  ISO-CCS = DUAL


3. CURRENT TERMINOLOGY SURVEY

3.1 CHARACTER

The term "character" rises little disagreement.
Both [RFC 2278] and [ECMA 35] define it as

  a member of a set of elements used for the organization,
  control, or representation of data.

This is close to definition in [UNICODE] and in other standards.

3.2 RFC 2130 CCS

[RFC 2130], [RFC 2278] formulate

  Coded Character Set (CCS) is a mapping from a set of abstract
  characters to a set of integers

To avoid ambiguity we shall refer to this defintion as 'CCS',
because this abbrevation seems not to be used in any other meaning.

Examples (after these RFC's): ISO 10646, US-ASCII, ISO-8859-*.

Please refer to Appendix B for a remark on CCS definition.

Specifications that match this definition are classified as CCS
in (2.2).

3.3 CES

3.3.1 BASIC DEFINITION

[RFC 2130] and [RFC 2278] formulate:

  Character Encoding Scheme (CES) is a mapping from a Coded
  Character Set or several coded character sets to a set of
  octets.

3.3.2 EXTENDED DEFINITION

[RFC 2130] also contains another abstract that we shall use as a CES
definition instead:

  A definition of a character encoding scheme consists of:
  - A description of an algorithm which transforms every
    possible sequence of octets to either a sequence of
    pairs <CCS, code value> or to the error state
    "illegal octet sequence"
  - Specifications, either by reference to CCS's registered by
    IANA or in text, of each CCS upon which this CES is based.

Please refer to Appendix C for an moderate discrepacy between these
definitions.

Specifications that match these definitions are classified as CES
in (2.2).

Following the later definition we find two flavours of CES
specifications:

Those that use CCS's defined only elswhere, CES-ONLY in (2.1).

And those that contain "specifications .. in text of ... CCS
upon which this CES is based". It is reasonable to assumed that
every specification defines no more then one CCS on its own.
Hence the definition of category DUAL in the (2.1) classification.

The ISO-8859-* specifications seem both to define a there own
94-character CCS and to reference another CCS (ASCII). This is
what (2.1)'s DUAL description means by
"using this CCS and possibly other CCS's defined elsewhere."

3.4 ISO-CCS

[RFC 1345] cites ISO definitions of a "coded character set":

  The ISO definition of the term "coded character set" is as
  follows: "A set of unambiguous rules that establishes a
  character set and the one-to-one relationship between the
  characters of the set and their coded representation."

As you can see a specification defines ISO 'coded character set'
if and only if it defines both a CCS and a CES for it. This is
exactly category DUAL in (2.1). Hence category ISO-CCS in (2.2)
is exactly identical to DUAL in (2.1).

3.5 OTHER DEFINITIONS REDUCING TO CES DEFINITION

[RFC 1345] also has it's own definition of a 'coded character set':

  "A coded character set is a set of rules that unambiguously
  and completely determines which sequence of characters, if
  any, is represented by each possible sequence of ... bytes"

[RFC 2045] defines 'character set' term:

  The term "character set" is used in MIME to refer to a method
  of converting a sequence of octets into a sequence of
  characters ...

  from simple ... mappings such as US-ASCII to complex table
  switching methods such as those that use ISO 2022's
  techniques.

Even the newest [RFC 2278] that gives a CES definition parallels it
with:

  The term "charset" ... is used here to refer to a method of
  converting a sequence of octets into a sequence of
  characters ...

All these definitions are essentially equivalent to
the CES definition (3.3).

4. RECOMMENDED TERMINOLOGY USAGE

The introduction of CES/CCS terminology by [RFC 2130]
helps a lot in understanding the character set world.
But it also has caused certain ambiguity:
'Coded Character Set' has been used for years in the
ISO meaning of DUAL (2.1) and introduction of a CCS
with a different meaning forces a terminology revision.

This section one evaluates old and new terms one by one.

4.1 CES

Totally unambiguous and fine.

4.2 CCS

Certainly usage of 'CCS' 3-letter abbrevation in a meaning
different from the full form 'coded character set'
is artificial. However this is probably the best we can do
now.

4.3 CODED CHARACTER SET

The full 'coded character set' term is currently in a
conflict between the older ISO definition and the newer,
equivalent to CCS one. Probably every speaker who choses
to use it should first clarify which meaning he/she adhers
to.

4.4 CHARACTER SET

The term 'character set' is quite ambigeous. It may be accepted as
- abbrevation from 'coded character set' (in ISO meaning),
  equivalent of CES
- abbrevation from 'coded character set' (in [RFC 2130] meaning),
  equivalent of CCS
- expansion of 'charset' parameter of Content-Type MIME header,
  also probably an equivalent of CES (4.5)

This survey's point of view is that it is best to use this
term only inside the 'character set standards' word-combination,
meaning the whole body of heterogenous standards.

4.5 CHARSET

Many standards (including MIME and HTTP ones) treat 'charset'
as a shorthand for 'character set' or 'coded character set'.
Both of these have become ambiguous.

But over years 'charset' has been used to mean "what you may pass
as the 'charset' parameter of the Content-Typed MIME or HTTP header".
In this meaning 'charset' is equivalent to CES.

Recognising this common usage [RFC 2278] has proposed to detach
'charset' from any other terms and use it on its own, effitiently
as a synonim to CES. [RFC 2278]:

  The term "charset" (see historical note below) is used here to
  refer to a method of converting a sequence of octets into a
  sequence of characters.

A related discussion on MIME charset parameter and [RFC 2130]
may also be found in the Appendix D.

The author of this article will probably continue to use this
term and exactly in the same meaning he did it before.

4.6 ENCODING

A very general term and probably fine as it is.
The terminology revision has not touched it.

May mean different things depending on context. In certain
circumstances it may be equivalent to CES,
in others to a CES combined with 'Transport Encoding Scheme'
[RFC 2130].

And it not necessirely related to binary representing of textual
data.

5. ADVOCATING NEW TERMINOLOGY

This section will show that the current [RFC 2130] CES and CCS
definitons may not be narrow enough for certain purposes.

If a specification is classified as a CES,
it is not clear if that specification also defines a CCS.

If a specification is classified as a CCS,
it is not clear if that specification also defines a CES.

CES specification is explicitly allowed to include or not include
a "private" CCS specification (3.3.2).

Its perfectly okay per se, but this survey calls to establish
different family names for these two categories of CES specifications.

This survey's opinion is that adoption of terminology based on
DUAL,CES-ONLY,CCS-ONLY classification (2.1) would allow to
establish a better taxonomy for character set standards.

6. IANA CHARACTER SET REGISTRY

As an epigraph, [RFC 2130]:

  ... the MIME registry of character sets ... contains items
  that may differ greatly in their applicability and semantics
  in various Internet protocols.

[RFC 2278] says:
   MIME ... and various other modern Internet protocols are
   capable of using many different charsets. This registration
   procedure exists ... to associate a specific name or names
   with a given charset and to give an indication of whether
   or not a given charset can be used in MIME text objects.

[IANA REG] is the registry established by this RFC.

[IANA REG] contains records both for DUAL (2.1) and CES-ONLY (2.1)
standards.

[IANA REG] does not contain records for CCS-ONLY (2.1) standards.

The probable reason is that the ISO "coded character set"
definition (3.4) does not allow to create a CCS-ONLY (2.1) standard,
but only a DUAL (2.1).

This explains why many of the standards (including those registered
at [IANA REG]) are DUAL (2.1) only formally, and function as
CCS-ONLY (2.1).

Naturally, the entries labeled MIME applicable (that have a
"preferred MIME name") are full weight CES's and are either
CES-ONLY's or DUALS both de facto and de jure. The above remark
is applicable to a subset of entries in the MIME registry that
do not have a "preferred MIME name".

The following sections (7) and (8) classify a number of stnadards,
including those registered in [IANA REG].

Section (8.3) is specially devoted to standards
that are DUAL by their form but CCS-ONLY by there nature.

7. CLASSIFICATION EXAMPLE. NON-CJK STANDARDS

This section classifies some of the non-CJK and Unicode-related
standards against DUAL, CCS-ONLY, CES-ONLY classification (2.1).

Standards are refered to by [IANA REG] prefered MIME name if they
have one.

7.1 DUAL

  US-ASCII, ISO-8859-*, KOI8-R

7.2 CES-ONLY

  UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4

7.3 CCS-ONLY

  Unicode 3.2

8. CLASSIFICATION EXAMPLE. CJK STANDARDS.

Classification of CJK standards used to be a great source of
confusion, at least for the author of this survey (and was
the original incentive to write this survey.)

Note: this section references character specifications by
[IANA REG] names if such names exist. For completeness (8.5)
relates [IANA REG] names back to the names of official
standards that contain the specifications.

8.1 CES-ONLY

8.1.1 MULTIBYTE

Specifications like

  ISO-2022-KR (reportedly unused)
  EUC-KR
  ISO-2022-JP  ISO-2022-JP-1 ISO-2022-JP-2
  Shift_JIS   (as defined by Appendix 1 of JIS X 0208:1997)
  EUC-JP
  GB2312      (8-bit EUC-style CES)
  EUC-TW

  ISO-2022-CN-EXT

easily classify as CES-ONLY (2.1).

8.1.2 SINGLE BYTE

JIS_X0201 [RFC 1345] is an 8-bit composition of two
7-bit DUALS's (2.1):

  JIS_C6220-1969-ro (ISO646-JP)
  JIS_C6220-1969-jp (katakana)

8.2 DUAL

8.2.1 Big5

This standard defines its onw CCS and a CES, hence classified as
DUAL.

8.2.2 CP950

Microsoft extension to Big5.
Erroneously refered to as Big5 by Microsoft products.
Adds characters to Big5 thus defining.

8.2.3 CP932

Microsoft extension to Shift_JIS.
Erroneously(?) referred to as Shift_JIS by Microsoft products.

The official Shift_JIS includes only JIS X 0201 and JIS X 0208
repertoire, while Microsoft has always been meaning Shift_JIS to
encode a wider repertoire with Shift_JIS.

As a hystorical predecessor Microsoft's variant
probably has more rights for the name, albeit it may be objected
that Microsoft shouldn't have used JIS as part of the name
in the first place.

8.2.4 CP936, GBK

Microsoft extension to GB2312.
Erroneously referred to as GB2312 by Microsoft products.
Backword compatible with GB2312.
Extends GB2312 with unified Han characters from ISO 10646-1:1993
not already present in GB_2312-80.

New characters are inserted to unused postions in EUC-CN natural
0x8181 - 0xFEFE range and to newly allocated 0x8140-0xFE7E range.

8.2.5 CP949

Microsoft extension to EUC-KR.
Also known as Unified Hangul Code, UHC.
Erroneously referred to as KS_C_5601-1987 by Microsoft products.
Adds 8822 pre-combined Hangul syllables to EUC-KR.

Uses extension technique same as GBK.

8.2.6 GB 18030-2000

China national standard.
Extends GBK.
Introduces 4 byte codings for characters.
Provides space for all assigned and unassigned Unicode 3.2
(BMP plus 16 extension planes) code points.

See also [DUERST].

8.2.7 ISO646-* SINGLE-BYTES

Single byte

  JIS_C6220-1969-ro    alias ISO646-JP
  GB_1988-80           alias ISO646-CN
  KSC5636              alias ISO646-KR

all serve as "raw material" to CES-ONLY's listed in (8.1), by
defining 94-character CCS's and also define ASCII-like 7-bit
CES's. The difference from ASCII is Yen, Yuan, Won symbols
replacing "$" (0x24) or "\" (0x5C), and other minor changes,
as it is natural for the ISO646-* family.

8.3 CONTROVERCIAL: DUAL OR CCS-ONLY?

The following standards cited by [RFC 1345] (and registered by
[IANA REG])

  JIS_C6220-1969-jp    alias JIS_C6220-1969   alias iso-ir-13

  JIS_C6226-1983       alias JIS_X0208-1983   alias iso-ir-87
  JIS_X0212-1990                              alias iso-ir-159
  GB_2312-80                                  alias iso-ir-58
  KS_C_5601-1987       alias KS_C_5601-1989   alias iso-ir-149
                       alias KSC_5601


are in a peculiar position. Because ISO "coded character set"
definition (all these are ISO standards as the iso-ir-xyz alias
shows) does not allow to create a CCS-ONLY standard (2.1), but
only a DUAL (2.1). So each of these standard additionally defines
a 7-bit CES. We shall call this CES "raw" or "implied".

Natural ("cooked") role of these standards is to help building
CES's listed in (8.1). (Hence we will call this CES "raw".)

The "raw" CES usage is so rare that the we may suspect that it
even was not the standards creators' intension. (Hence we will
also call this CES "implied".)

For example JIS_C6220-1969-jp's "raw" CES is a 7-bit CES that
encodes
- regular (ISO 6429) control characters at 0x00 - 0x20
- SPACE and DELETE                      at 0x21,  0x7F
- Katakana                              at 0x21 - 0x7E.
It is not clear if this CES is used at all, but it is probably not
(see also section (9)).

JIS_C6226-1983 (aka JIS_X0208-1983), GB_2312-80 and KS_C_5601-1987
have even less usable "raw" 7-bit CES's. These CES's are double byte
7-bit CES that have neither control characters nor SPACE and delete.
Naturally CES's that do not encode CR, LF and SPACE are only of
a limited use.

They are know however to be used with fonts in X Window system.

Other 94x94 standards should be in the same position
of implicitly defining mostly uselessly mutibyte 7-bit CES's
without SPACE, BACKSPACE, CR, LF and other control characters.
These probably include

  ISO-IR-165
  GB/T 12345-90

8.4 CCS-ONLY: CNS 11643-1992

CNS 11643-1992 defines 16 94x94 planes.
First two planes have almost the same set of characters as Big5,
but at different code points. Is used only inside EUC-TW. Does
not define an implicit "raw" CES like those described in 8.3,
because only each of it's planes might be vulnarable to this
"implicit" creation not the standard as a whole.

See also (9.3).

8.5 STANDARDS' NAMES REFERENCE

This material probably does not belong here, but as soon as
section (8) has already become a CJK standard listing
for completeness here's a [IANA REG] name to original standard
name reference. (It is not clear, why this info is missing from
[IANA REG]). Multiple standard names are given if a standard has
several names or several standard versions are identical at
definition of the given CES. [IANA REG] aliases are marked with '='.

94 DUALs

  [IANA REG]                                   original name(s)

  JIS_C6220-1969-ro = ISO646-JP,               JIS C 6220-1969
  JIS_C6220-1969-jp = katakana =
  JIS_C6220-1969

  GB_1988-80        = ISO646-CN                GB 1988-80

  KSC5636           = ISO646-KR                KS C 5636-1993
                                               KS C 5636-1989
94x94 DUALs

  JIS_C6226-1983 = JIS_X0208-1983              JIS C 6226-1983
                                               JIS X 0208:1983
  JIS_X0212-1990                               JIS X 0212:1990
  GB_2312-80                                   GB 2312-80

  KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601   KS C 5601-1987
                                               KS C 5601-1992
                                               KS X 1001:1997

Extensions/revisions of the mentioned 94x94 DUAL's (2.1) not
registered at [IANA REG]:

  JIS X 0208:1990
  JIS X 0208:1997
  KS X 1001:1998  (euro and one other char added)


9.  CALL FOR FEEDBACK AND CONTRIBUTION

9.1 COMMENTS MORE THEN WELCOME

The author of this survey would very much like to receive
as much feedback on this article as possible.

Please send me all kinds of comments on this survey, your
opinion on its topicality, all factual mistakes,
any statements that you find controvercial!

You corrections will be incorporated into this document,
as soon as possible, or if the author will consider them
arguable will probably form 'Appendex F. SPECIAL OPINIONS'.

9.2 EXTENDING THE LIMITS

The author is limited in surveing documents freely
available online.

Specifically he has no access to ISO standards, or their
analogs, except for [ECMA 35].

The main source of information on CJK have been [RFC 1345],
[CJK.INF] and most helpfull replies from Autrijus Tang and
Jungshik Shin on perl-unicode@perl.org.

Therefore contributions from readers who have access to
documents that the author of this survey does not have
access to may significantly improve it.

Issue #1.
Section (3.4) 'ISO CCS' would benefit from a direct reference
to some ISO standard that gives the ISO definition of "coded
character set".

Issue #2.
Section (3.5) 'OTHER DEFINITIONS REDUCING TO CES DEFINITION'
would benefit from listing more definitions of "character set"
and related identities.

Issue #3.
If for any standard (especially CJK) it's most official
name (given by its original registration body) is not
listed or is misspelled (up to ':' vs '-' differences
and wrong number of spaces) it would be higly appropriate
to correct this. Please do tell me.

Issue #4.
The author of this document will highly wellcome
inclusion of other CJK standards into the classification,
section (8).

Also, if you feel there are strong reasons for inclusion
of non-CJK standards into section (7), please do tell me.

9.3 DOUBTS AND GAPS

Issue #5.
Is "raw" 7-bit "implied" CES for JIS_C6220-1969-jp
used for any purpose? Has section (8.3) been wrong
in saying it isn't?

Issue #6.
How correct is section (8.4), that lists
CNS 11643-1992 as the only CCS-ONLY stnadard in the CJK world?

Issue #7.
What CES's are used with CCCII?
Does the CCCII standard specify a CES?
If yes, how is it called?
What is the full name of the standard?

Issue #8.
What CES's are used for ANSI Z39.64-1989?
Does the ANSI Z39.64-1989 standard specify a CES?
If yes, how is it called?

Help will be highly welcome!

APPENDIX A. THANKS

Thanks to

  Autrijus Tang
  Jungshik Shin

and other posters of perl-unicode@perl.org for
detailed disscussions of CJK standars!

  Ken Lunde

for his super-informative [CJK.INF]

And special thanks to
  Dan Kogai

for developing and maintaining Perl Encode module that has
put me on the with the character encoding issues.

To be continued :-)

APPENDIX B. UPGRADING CCS DEFINITION

It may be worth to understand the CCS definition in a special way:

  CCS is a mapping from a set of abstract characters to a set of
  integers, a set of integer pairs or a set integers or integer
  triplets

Pairs naturally rise from row-column codes of tabels used to present
character glyphs and triplets - from arrays of tables.

These muli-demintional indexes easily map to integers.
But this is often done differently: for 94-character CCS's
we regularly use
- hexademical notation: 0x41
Taking Ken Lunde's [CJK.INF] as an example of a document
discussing 94x94 CCS's we'll see two different notations:
- decimal notation, items dash sparated,
  counting from 1: 06-85
- hexademical notation, items glued togehther,
  each counted from 0x21: 0x6161

Similar variations should be possible with 94x94x94 CCS's.

APPENDIX C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS

C1. [RFC 2130], [RFC 2278]:
  Character Encoding Scheme is a mapping from a Coded Character
  Set (or several) to a set of octets.

C2. [RFC 2130]:
  A definition of a character encoding scheme consists of:
  - A description of an algorithm which transforms every
    possible sequence of octets to either a sequence of
    pairs <CCS, code value> or to the error state "illegal
    octet sequence"
  - Specifications, either by reference to CCS's registered by
    IANA or in text, of each CCS upon which this CES is based.
C3. [RFC 2278]:
   The term "charset" ... is used here to refer
   to a method of converting a sequence of octets into a
   sequence of characters... unconditional and unambiguous
   conversion in the other direction is not required

A notable difference is that C2 allowes several octet sequences to
map to a single <CCS, code value> sequence while C1 does not.
We may of course say that C1 is dominating and outlaw multiple octet
sequences, but then a "charset" according to the C3 definition is not
automatically a CES, which breaks our neat classification. So for the
author prefers to silently reverse the C1 definition (and efficiently
make C2 dominating).

In practice this issue is not that important because CES's try to
avoid the associating several octet sequences to the same
<CCS, code value> sequence.
UTF-8 prescribes to use the shortest possible byte sequence
to represent every Unicode coded point, and calls every other
presentation "malformed".
ISO-2022-* family members do not use the ISO 2022's awaresome
power to its full extent and thus rule out most possible
multiplicities.

Here's an example of multiplicity that occurs however.
All the following sequences of octets produce the same
sequence of <CSS, code-value> in ISO-2022-JP:
  ESC $ B  0x50 0x50
  ESC $ B  ESC $ B  0x50 0x50
  ESC $ B  ESC ( B  ESC $ B 0x50 0x50
  ...

Here ESC $ B mean that the following octets should be interpreted
as pairs coding characters in the JIS X 0208-1983 coded character
set. ESC ( B denotes that the following octets should be interpreted
as ASCII charset. The point is that the redundant escape sequences
may be added quite freely. Of course it is easy to establish
normaliztion transformation that will remove redundant escape
sequences, but ISO-2022-JP does not forbid them.
Hence the C1 definition should probably be silently dropped in
favour of C2.

APPENDIX D. RFC 2130 ON MIME CHARSET

It was [RFC 2130] that introduced the CES (3.3) definition.
The more funny it is to see how [RFC 2130] itself does not
use it to full power and goes tautological:

  ... in MIME, the Coded Character Set and Character Encoding
  Scheme are specified by the Charset parameter to the
  Content-Type header field ...

Every CES is already associated with a set of CCS's (3.3).
Regorously, it would be enough to say:

  ... in MIME, the Character Encoding Scheme is specified by
  the Charset parameter to the Content-Type header field ...

or more verbously

  "charset", as defined by this document and as specified by the
  Charset parameter to the Content-Type header field is a
  synonym to a CES. As such the Charset parameter to the
  Content-Type header completely defines how to map the result
  of de-aplying Transport Encdoing Syntax to the binary
  representation of the message body to a sequence of <CCS,
  code-value> pairs.

Of course both variants are much less intuitive then
the original RFC's text.

APPENDIX E. REFERENCES

[RFC 2278] IANA Charset Registration Procedures.
           N. Freed, J. Postel. January 1998.
           http://www.ietf.org/rfc/rfc2278.txt

[RFC 2130] The Report of the IAB Character Set Workshop held
           29 February - 1 March, 1996.
           C. Weider, C. Preston, K. Simonsen, H. Alvestrand,
           R. Atkinson, M. Crispin, P. Svanberg. April 1997.
           http://www.ietf.org/rfc/rfc2130.txt

[RFC 2045] Multipurpose Internet Mail Extensions (MIME)
           Part One: Format of Internet Message Bodies.
           N. Freed, N. Borenstein. November 1996.
           http://www.ietf.org/rfc/rfc2045.txt

[RFC 1345] Character Mnemonics and Character Sets.
           K. Simonsen. June 1992.
           http://www.ietf.org/rfc/rfc1345.txt

[ECMA 35]  Character Code Structure and Extension Techniques
           Standard ECMA-35
           6th Edition. December 1994.
           http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM
           (This is a freely accessible analog of ISO 2022)

[UNICODE]  The Online Edition of The Unicode Standard,
           Version 3.0.
           http://www.unicode.org/unicode/uni2book/u2.html

[UNICODE CHAPTER 3]
           The Online Edition of The Unicode Standard,
           Version 3.0. Chapter 3. Conformance.
           http://www.unicode.org/unicode/uni2book/ch03.pdf

[IANA REG] The Character Sets Registry
           (IANA registers charaset values according to RFC 2278)
           http://www.iana.org/assignments/character-sets

[CJK.INF]  CJK.INF Version 2.1
           Online Companion to
           "Understanding Japanese Information Processing"
           Ken Lunde. July 12, 1996
           http://www.oreilly.com/people/authors/lunde/cjk_inf.html

[DUERST]   RE: modification to registration of charset ks_c_5601-1987
           Martin Duerst. Jun 13 2001
           Message in ietf-charsets archive
           http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html

[Connoly]  Character Set Considered Harmful
           INTERNET-DRAFT
           May 2, 1995
           http://www.w3.org/MarkUp/html-spec/charset-harmful.html

[Lee]      RE: codes:chars is many:one?
           Message in www-archive@w3.org
           Liam Quin. Jan 30 2002
           http://lists.w3.org/Archives/Public/www-archive/2002Jan/0152.html