"CHARACTER SET" TERMINOLOGY SURVEY, VERSION 0.5

Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru	Moscow State University Scientific Computer Research Center http://srcc.msu.su
Apr 4 2002	text version of this document

4. RECOMMENDED TERMINOLOGY USAGE

4.1 CES

4.2 CCS

6. IANA CHARACTER SET REGISTRY

7. CLASSIFICATION EXAMPLE. NON-CJK STANDARDS

8. CLASSIFICATION EXAMPLE. CJK STANDARDS.

8.3 CONTROVERCIAL: DUAL OR CCS-ONLY?

8.5 STANDARDS' NAMES REFERENCE

9. CALL FOR FEEDBACK AND CONTRIBUTION

9.1 COMMENTS MORE THEN WELCOME

Appendix A. THANKS

Appendix B. UPGRADING CCS DEFINITION

Appendix C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS

Appendix D. RFC 2130 ON MIME CHARSET

Appendix E. REFERENCES

1. INTRODUCTION

Binary representation of textual data is ruled by character set standards.

Globally unique names (like KS_C_5601-1987) refer to specification.

The goal of this survey is to classify named specifications (JIS_C6220-1969-jp, JIS_C6220-1969-ro, JIS_X0208-1983, etc.).

To achieve this the current character set related terminology is surveyd (3) and abridged to three category classificaton (2.2).

The categories of the (2.2) classification are however overlapping and section (5) advocates adoption of a new classification (2.1). The (2.1) specification differs from (2.2) only in that it has disjoint categories.

Sections (6), (7), (8) practice this classification on a large body of standards.

Section (8) 'CLASSIFICATION EXAMPLE. CJK STANDARDS' has unintentionally become a brief reference on CJK standards.

Its (8.3) subsection is notably a point from which the whole survey has grown: it emphasises a certain confusion happening around most basic CJK standards. Please don't miss 8.3 :-)

Section (9) reflects the great author's desire to collect as much feedback as possible (tagunov@motor.ru).

Subsections (9.2), (9.3) also express a desire to fill any gaps and discrepancies that the readers are likely to notice, and point out the areas where readers' help is desirable. Let us consider this version 0.1 of the servey! ;-)

Section (4) discusses how [RFC 2130] undermines "coded character set" industry terminology and what are the best ways to get around that.

Appendixes B, C, D contain miscellaneous discussions excluded from the main body of the survey.

Have a good reading! :-)

2. CLASSIFICATIONS

2.1 DUAL, CES-ONLY, CCS-ONLY

This is the classification proposed by the current survey. It is derived from classification introduced in (2.2) by making the categories disjoint.

See section (5) for an explanation of why this survey finds this classification most fruitfull and advocates it over the (2.2) one.

DUAL      specifications that define
          a CCS and a CES
          (using this CCS and
           possibly other CCS's
           defined elsewhere to define the CES)

CES-ONLY  specifications that define
          a CES
          using CCS's defined elsewhere
          (and not specifing any CCS's
           on their own)

CCS-ONLY  specifications that define
          a CCS
          (and no CES's)

This classification is complete and disjoint: every character set specification fits into one of these categories, but none fits into two.

(See subsections (3.2), (3.3) for definitions of CES and CCS.)

2.2 CCS, CES, ISO-CCS

This is the classification derived in section (3) from definitions found in ISO and RFC standards. These standards operate notions as

"character set"
"coded character set"
"charset",
"character encoding scheme"

the whole diversity of definitions given to these terms reduces to 3 categories:

CCS      Coded Character Set       after [RFC 2130]
CES      Character Encoding Scheme after [RFC 2130]
ISO-CCS  Coded Character Set       after ISO, cited
                                   by [RFC 1345]

This classfication is complete but not disjoint. Here is how it relates to the (2.1) classification:

CES = DUAL + CES-ONLY
CCS = DUAL + CCS-ONLY
ISO-CCS = DUAL

3. CURRENT TERMINOLOGY SURVEY

3.1 CHARACTER

The term "character" rises little disagreement. Both [RFC 2278] and [ECMA 35] define it as

a member of a set of elements used for the organization, control, or representation of data.

This is close to definition in [UNICODE] and in other standards.

3.2 RFC 2130 CCS

[RFC 2130], [RFC 2278] formulate

Coded Character Set (CCS) is a mapping from a set of abstract characters to a set of integers

To avoid ambiguity we shall refer to this defintion as 'CCS', because this abbrevation seems not to be used in any other meaning.

Examples (after these RFC's): ISO 10646, US-ASCII, ISO-8859-*.

Please refer to Appendix B for a remark on CCS definition.

Specifications that match this definition are classified as CCS in (2.2).

3.3 CES

3.3.1 BASIC DEFINITION

[RFC 2130] and [RFC 2278] formulate:

Character Encoding Scheme (CES) is a mapping from a Coded Character Set or several coded character sets to a set of octets.

3.3.2 EXTENDED DEFINITION

[RFC 2130] also contains another abstract that we shall use as a CES definition instead:

A definition of a character encoding scheme consists of:

A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs <CCS, code value> or to the error state "illegal octet sequence"
Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based.

Please refer to Appendix C for an moderate discrepacy between these definitions.

Specifications that match these definitions are classified as CES in (2.2).

Following the later definition we find two flavours of CES specifications:

Those that use CCS's defined only elswhere, CES-ONLY in (2.1).

And those that contain "specifications .. in text of ... CCS upon which this CES is based". It is reasonable to assumed that every specification defines no more then one CCS on its own. Hence the definition of category DUAL in the (2.1) classification.

The ISO-8859-* specifications seem both to define a there own 94-character CCS and to reference another CCS (ASCII). This is what (2.1)'s DUAL description means by "using this CCS and possibly other CCS's defined elsewhere."

3.4 ISO-CCS

[RFC 1345] cites ISO definitions of a "coded character set":

The ISO definition of the term "coded character set" is as follows: "A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their coded representation."

As you can see a specification defines ISO 'coded character set' if and only if it defines both a CCS and a CES for it. This is exactly category DUAL in (2.1). Hence category ISO-CCS in (2.2) is exactly identical to DUAL in (2.1).

3.5 OTHER DEFINITIONS REDUCING TO CES DEFINITION

[RFC 1345] also has it's own definition of a 'coded character set':

"A coded character set is a set of rules that unambiguously and completely determines which sequence of characters, if any, is represented by each possible sequence of ... bytes"

[RFC 2045] defines 'character set' term:

The term "character set" is used in MIME to refer to a method of converting a sequence of octets into a sequence of characters ...

from simple ... mappings such as US-ASCII to complex table switching methods such as those that use ISO 2022's techniques.

Even the newest [RFC 2278] that gives a CES definition parallels it with:

The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters ...

All these definitions are essentially equivalent to the CES definition (3.3).

4. RECOMMENDED TERMINOLOGY USAGE

The introduction of CES/CCS terminology by [RFC 2130] helps a lot in understanding the character set world. But it also has caused certain ambiguity: 'Coded Character Set' has been used for years in the ISO meaning of DUAL (2.1) and introduction of a CCS with a different meaning forces a terminology revision.

This section one evaluates old and new terms one by one.

4.1 CES

Totally unambiguous and fine.

4.2 CCS

Certainly usage of 'CCS' 3-letter abbrevation in a meaning different from the full form 'coded character set' is artificial. However this is probably the best we can do now.

4.3 CODED CHARACTER SET

The full 'coded character set' term is currently in a conflict between the older ISO definition and the newer, equivalent to CCS one. Probably every speaker who choses to use it should first clarify which meaning he/she adhers to.

4.4 CHARACTER SET

The term 'character set' is quite ambigeous. It may be accepted as

abbrevation from 'coded character set' (in ISO meaning), equivalent of CES
abbrevation from 'coded character set' (in [RFC 2130] meaning), equivalent of CCS
expansion of 'charset' parameter of Content-Type MIME header, also probably an equivalent of CES (4.5)

This survey's point of view is that it is best to use this term only inside the 'character set standards' word-combination, meaning the whole body of heterogenous standards.

4.5 CHARSET

Many standards (including MIME and HTTP ones) treat 'charset' as a shorthand for 'character set' or 'coded character set'. Both of these have become ambiguous.

But over years 'charset' has been used to mean "what you may pass as the 'charset' parameter of the Content-Typed MIME or HTTP header". In this meaning 'charset' is equivalent to CES.

Recognising this common usage [RFC 2278] has proposed to detach 'charset' from any other terms and use it on its own, effitiently as a synonim to CES. [RFC 2278]:

The term "charset" (see historical note below) is used here to refer to a method of converting a sequence of octets into a sequence of characters.

A related discussion on MIME charset parameter and [RFC 2130] may also be found in the Appendix D.

The author of this article will probably continue to use this term and exactly in the same meaning he did it before.

4.6 ENCODING

A very general term and probably fine as it is. The terminology revision has not touched it.

May mean different things depending on context. In certain circumstances it may be equivalent to CES, in others to a CES combined with 'Transport Encoding Scheme' [RFC 2130].

And it not necessirely related to binary representing of textual data.

5. ADVOCATING NEW TERMINOLOGY

This section will show that the current [RFC 2130] CES and CCS definitons may not be narrow enough for certain purposes.

If a specification is classified as a CES, it is not clear if that specification also defines a CCS.

If a specification is classified as a CCS, it is not clear if that specification also defines a CES.

CES specification is explicitly allowed to include or not include a "private" CCS specification (3.3.2).

Its perfectly okay per se, but this survey calls to establish different family names for these two categories of CES specifications.

This survey's opinion is that adoption of terminology based on DUAL,CES-ONLY,CCS-ONLY classification (2.1) would allow to establish a better taxonomy for character set standards.

6. IANA CHARACTER SET REGISTRY

As an epigraph, [RFC 2130]:

... the MIME registry of character sets ... contains items that may differ greatly in their applicability and semantics in various Internet protocols.

[RFC 2278] says:

MIME ... and various other modern Internet protocols are capable of using many different charsets. This registration procedure exists ... to associate a specific name or names with a given charset and to give an indication of whether or not a given charset can be used in MIME text objects.

[IANA REG] is the registry established by this RFC.

[IANA REG] contains records both for DUAL (2.1) and CES-ONLY (2.1) standards.

[IANA REG] does not contain records for CCS-ONLY (2.1) standards.

The probable reason is that the ISO "coded character set" definition (3.4) does not allow to create a CCS-ONLY (2.1) standard, but only a DUAL (2.1).

This explains why many of the standards (including those registered at [IANA REG]) are DUAL (2.1) only formally, and function as CCS-ONLY (2.1).

Naturally, the entries labeled MIME applicable (that have a "preferred MIME name") are full weight CES's and are either CES-ONLY's or DUALS both de facto and de jure. The above remark is applicable to a subset of entries in the MIME registry that do not have a "preferred MIME name".

The following sections (7) and (8) classify a number of stnadards, including those registered in [IANA REG].

Section (8.3) is specially devoted to standards that are DUAL by their form but CCS-ONLY by there nature.

7. CLASSIFICATION EXAMPLE. NON-CJK STANDARDS

This section classifies some of the non-CJK and Unicode-related standards against DUAL, CCS-ONLY, CES-ONLY classification (2.1).

Standards are refered to by [IANA REG] prefered MIME name if they have one.

7.1 DUAL

US-ASCII, ISO-8859-*, KOI8-R

7.2 CES-ONLY

UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4

7.3 CCS-ONLY

Unicode 3.2

8. CLASSIFICATION EXAMPLE. CJK STANDARDS.

Classification of CJK standards used to be a great source of confusion, at least for the author of this survey (and was the original incentive to write this survey.)

Note: this section references character specifications by [IANA REG] names if such names exist. For completeness (8.5) relates [IANA REG] names back to the names of official standards that contain the specifications.

8.1 CES-ONLY

8.1.1 MULTIBYTE

Specifications like

ISO-2022-KR (reportedly unused)
EUC-KR
ISO-2022-JP  ISO-2022-JP-1 ISO-2022-JP-2
Shift_JIS   (as defined by Appendix 1 of JIS X 0208:1997)
EUC-JP
GB2312      (8-bit EUC-style CES)
EUC-TW

ISO-2022-CN-EXT

easily classify as CES-ONLY (2.1).

8.1.2 SINGLE BYTE

JIS_X0201 [RFC 1345] is an 8-bit composition of two 7-bit DUALS's (2.1):

JIS_C6220-1969-ro (ISO646-JP)
JIS_C6220-1969-jp (katakana)

8.2 DUAL

8.2.1 Big5

This standard defines its onw CCS and a CES, hence classified as DUAL.

8.2.2 CP950

Microsoft extension to Big5. Erroneously refered to as Big5 by Microsoft products. Adds characters to Big5 thus defining.

8.2.3 CP932

Microsoft extension to Shift_JIS. Erroneously(?) referred to as Shift_JIS by Microsoft products.

The official Shift_JIS includes only JIS X 0201 and JIS X 0208 repertoire, while Microsoft has always been meaning Shift_JIS to encode a wider repertoire with Shift_JIS.

As a hystorical predecessor Microsoft's variant probably has more rights for the name, albeit it may be objected that Microsoft shouldn't have used JIS as part of the name in the first place.

8.2.4 CP936, GBK

Microsoft extension to GB2312. Erroneously referred to as GB2312 by Microsoft products. Backword compatible with GB2312. Extends GB2312 with unified Han characters from ISO 10646-1:1993 not already present in GB_2312-80.

New characters are inserted to unused postions in EUC-CN natural 0x8181 - 0xFEFE range and to newly allocated 0x8140-0xFE7E range.

8.2.5 CP949

Microsoft extension to EUC-KR. Also known as Unified Hangul Code, UHC. Erroneously referred to as KS_C_5601-1987 by Microsoft products. Adds 8822 pre-combined Hangul syllables to EUC-KR.

Uses extension technique same as GBK.

8.2.6 GB 18030-2000

China national standard. Extends GBK. Introduces 4 byte codings for characters. Provides space for all assigned and unassigned Unicode 3.2 (BMP plus 16 extension planes) code points.

8.2.7 ISO646-* SINGLE-BYTES

Single byte

JIS_C6220-1969-ro    alias ISO646-JP
GB_1988-80           alias ISO646-CN
KSC5636              alias ISO646-KR

all serve as "raw material" to CES-ONLY's listed in (8.1), by defining 94-character CCS's and also define ASCII-like 7-bit CES's. The difference from ASCII is Yen, Yuan, Won symbols replacing "$" (0x24) or "\" (0x5C), and other minor changes, as it is natural for the ISO646-* family.

8.3 CONTROVERCIAL: DUAL OR CCS-ONLY?

The following standards cited by [RFC 1345] (and registered by [IANA REG])

JIS_C6220-1969-jp    alias JIS_C6220-1969   alias iso-ir-13

JIS_C6226-1983       alias JIS_X0208-1983   alias iso-ir-87
JIS_X0212-1990                              alias iso-ir-159
GB_2312-80                                  alias iso-ir-58
KS_C_5601-1987       alias KS_C_5601-1989   alias iso-ir-149
                     alias KSC_5601

are in a peculiar position. Because ISO "coded character set" definition (all these are ISO standards as the iso-ir-xyz alias shows) does not allow to create a CCS-ONLY standard (2.1), but only a DUAL (2.1). So each of these standard additionally defines a 7-bit CES. We shall call this CES "raw" or "implied".

Natural ("cooked") role of these standards is to help building CES's listed in (8.1). (Hence we will call this CES "raw".)

The "raw" CES usage is so rare that the we may suspect that it even was not the standards creators' intension. (Hence we will also call this CES "implied".)

For example JIS_C6220-1969-jp's "raw" CES is a 7-bit CES that encodes

regular (ISO 6429) control characters at 0x00 - 0x20
SPACE and DELETE at 0x21, 0x7F
Katakana at 0x21 - 0x7E.

It is not clear if this CES is used at all, but it is probably not (see also section (9)).

JIS_C6226-1983 (aka JIS_X0208-1983), GB_2312-80 and KS_C_5601-1987 have even less usable "raw" 7-bit CES's. These CES's are double byte 7-bit CES that have neither control characters nor SPACE and delete. Naturally CES's that do not encode CR, LF and SPACE are only of a limited use.

They are know however to be used with fonts in X Window system.

Other 94x94 standards should be in the same position of implicitly defining mostly uselessly mutibyte 7-bit CES's without SPACE, BACKSPACE, CR, LF and other control characters. These probably include

ISO-IR-165
GB/T 12345-90

8.4 CCS-ONLY: CNS 11643-1992

CNS 11643-1992 defines 16 94x94 planes. First two planes have almost the same set of characters as Big5, but at different code points. Is used only inside EUC-TW. Does not define an implicit "raw" CES like those described in 8.3, because only each of it's planes might be vulnarable to this "implicit" creation not the standard as a whole.

8.5 STANDARDS' NAMES REFERENCE

This material probably does not belong here, but as soon as section (8) has already become a CJK standard listing for completeness here's a [IANA REG] name to original standard name reference. (It is not clear, why this info is missing from [IANA REG]). Multiple standard names are given if a standard has several names or several standard versions are identical at definition of the given CES. [IANA REG] aliases are marked with '='.

94 DUALs

[IANA REG]                                   original name(s)

JIS_C6220-1969-ro = ISO646-JP,               JIS C 6220-1969
JIS_C6220-1969-jp = katakana =
JIS_C6220-1969

GB_1988-80        = ISO646-CN                GB 1988-80

KSC5636           = ISO646-KR                KS C 5636-1993
                                             KS C 5636-1989

94x94 DUALs

JIS_C6226-1983 = JIS_X0208-1983              JIS C 6226-1983
                                             JIS X 0208:1983
JIS_X0212-1990                               JIS X 0212:1990
GB_2312-80                                   GB 2312-80

KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601   KS C 5601-1987
                                             KS C 5601-1992
                                             KS X 1001:1997

Extensions/revisions of the mentioned 94x94 DUAL's (2.1) not registered at [IANA REG]:

JIS X 0208:1990
JIS X 0208:1997
KS X 1001:1998  (euro and one other char added)

9. CALL FOR FEEDBACK AND CONTRIBUTION

9.1 COMMENTS MORE THEN WELCOME

The author of this survey would very much like to receive as much feedback on this article as possible.

Please send me all kinds of comments on this survey, your opinion on its topicality, all factual mistakes, any statements that you find controvercial!

You corrections will be incorporated into this document, as soon as possible, or if the author will consider them arguable will probably form 'Appendex F. SPECIAL OPINIONS'.

9.2 EXTENDING THE LIMITS

The author is limited in surveing documents freely available online.

Specifically he has no access to ISO standards, or their analogs, except for [ECMA 35].

The main source of information on CJK have been [RFC 1345], [CJK.INF] and most helpfull replies from Autrijus Tang and Jungshik Shin on perl-unicode@perl.org.

Therefore contributions from readers who have access to documents that the author of this survey does not have access to may significantly improve it.

Issue #1. Section (3.4) 'ISO CCS' would benefit from a direct reference to some ISO standard that gives the ISO definition of "coded character set".

Issue #2. Section (3.5) 'OTHER DEFINITIONS REDUCING TO CES DEFINITION' would benefit from listing more definitions of "character set" and related identities.

Issue #3. If for any standard (especially CJK) it's most official name (given by its original registration body) is not listed or is misspelled (up to ':' vs '-' differences and wrong number of spaces) it would be higly appropriate to correct this. Please do tell me.

Issue #4. The author of this document will highly wellcome inclusion of other CJK standards into the classification, section (8).

Also, if you feel there are strong reasons for inclusion of non-CJK standards into section (7), please do tell me.

9.3 DOUBTS AND GAPS

Issue #5. Is "raw" 7-bit "implied" CES for JIS_C6220-1969-jp used for any purpose? Has section (8.3) been wrong in saying it isn't?

Issue #6. How correct is section (8.4), that lists CNS 11643-1992 as the only CCS-ONLY stnadard in the CJK world?

Issue #7. What CES's are used with CCCII? Does the CCCII standard specify a CES? If yes, how is it called? What is the full name of the standard?

Issue #8. What CES's are used for ANSI Z39.64-1989? Does the ANSI Z39.64-1989 standard specify a CES? If yes, how is it called?

Help will be highly welcome!

Appendix A. THANKS

Thanks to

Autrijus Tang
Jungshik Shin

and other posters of perl-unicode@perl.org for detailed disscussions of CJK standars!

Ken Lunde

for his super-informative [CJK.INF]

And special thanks to

Dan Kogai

for developing and maintaining Perl Encode module that has put me on the with the character encoding issues.

To be continued :-)

Appendix B. UPGRADING CCS DEFINITION

It may be worth to understand the CCS definition in a special way:

CCS is a mapping from a set of abstract characters to a set of integers, a set of integer pairs or a set integers or integer triplets

Pairs naturally rise from row-column codes of tabels used to present character glyphs and triplets - from arrays of tables.

These muli-demintional indexes easily map to integers. But this is often done differently: for 94-character CCS's we regularly use

hexademical notation: 0x41

Taking Ken Lunde's [CJK.INF] as an example of a document discussing 94x94 CCS's we'll see two different notations:

decimal notation, items dash sparated, counting from 1: 06-85
hexademical notation, items glued togehther, each counted from 0x21: 0x6161

Similar variations should be possible with 94x94x94 CCS's.

Appendix C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS

C1. [RFC 2130], [RFC 2278]:

Character Encoding Scheme is a mapping from a Coded Character Set (or several) to a set of octets.

C2. [RFC 2130]:

A definition of a character encoding scheme consists of:

A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs <CCS, code value> or to the error state "illegal octet sequence"
Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based.

C3. [RFC 2278]:

The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters... unconditional and unambiguous conversion in the other direction is not required

A notable difference is that C2 allowes several octet sequences to map to a single <CCS, code value> sequence while C1 does not. We may of course say that C1 is dominating and outlaw multiple octet sequences, but then a "charset" according to the C3 definition is not automatically a CES, which breaks our neat classification. So for the author prefers to silently reverse the C1 definition (and efficiently make C2 dominating).

In practice this issue is not that important because CES's try to avoid the associating several octet sequences to the same <CCS, code value> sequence. UTF-8 prescribes to use the shortest possible byte sequence to represent every Unicode coded point, and calls every other presentation "malformed". ISO-2022-* family members do not use the ISO 2022's awaresome power to its full extent and thus rule out most possible multiplicities.

Here's an example of multiplicity that occurs however. All the following sequences of octets produce the same sequence of <CSS, code-value> in ISO-2022-JP:

ESC $ B  0x50 0x50
ESC $ B  ESC $ B  0x50 0x50
ESC $ B  ESC ( B  ESC $ B 0x50 0x50
...

Here ESC $ B mean that the following octets should be interpreted as pairs coding characters in the JIS X 0208-1983 coded character set. ESC ( B denotes that the following octets should be interpreted as ASCII charset. The point is that the redundant escape sequences may be added quite freely. Of course it is easy to establish normaliztion transformation that will remove redundant escape sequences, but ISO-2022-JP does not forbid them. Hence the C1 definition should probably be silently dropped in favour of C2.

Appendix D. RFC 2130 ON MIME CHARSET

It was [RFC 2130] that introduced the CES (3.3) definition. The more funny it is to see how [RFC 2130] itself does not use it to full power and goes tautological:

... in MIME, the Coded Character Set and Character Encoding Scheme are specified by the Charset parameter to the Content-Type header field ...

Every CES is already associated with a set of CCS's (3.3). Regorously, it would be enough to say:

... in MIME, the Character Encoding Scheme is specified by the Charset parameter to the Content-Type header field ...

or more verbously

"charset", as defined by this document and as specified by the Charset parameter to the Content-Type header field is a synonym to a CES. As such the Charset parameter to the Content-Type header completely defines how to map the result of de-aplying Transport Encdoing Syntax to the binary representation of the message body to a sequence of <CCS, code-value> pairs.

Of course both variants are much less intuitive then the original RFC's text.

Appendix E. REFERENCES

[RFC 2278]: IANA Charset Registration Procedures.
N. Freed, J. Postel. January 1998.
http://www.ietf.org/rfc/rfc2278.txt

[RFC 2130]: The Report of the IAB Character Set Workshop held
29 February - 1 March, 1996.
C. Weider, C. Preston, K. Simonsen, H. Alvestrand,
R. Atkinson, M. Crispin, P. Svanberg. April 1997.
http://www.ietf.org/rfc/rfc2130.txt

[RFC 2045]: Multipurpose Internet Mail Extensions (MIME)
Part One: Format of Internet Message Bodies.
N. Freed, N. Borenstein. November 1996.
http://www.ietf.org/rfc/rfc2045.txt

[RFC 1345]: Character Mnemonics and Character Sets.
K. Simonsen. June 1992.
http://www.ietf.org/rfc/rfc1345.txt

[ECMA 35]: Character Code Structure and Extension Techniques
Standard ECMA-35
6th Edition. December 1994.
http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM
(This is a freely accessible analog of ISO 2022)

[UNICODE]: The Online Edition of The Unicode Standard,
Version 3.0.
http://www.unicode.org/unicode/uni2book/u2.html

[UNICODE CHAPTER 3]: The Online Edition of The Unicode Standard,
Version 3.0. Chapter 3. Conformance.
http://www.unicode.org/unicode/uni2book/ch03.pdf

[IANA REG]: The Character Sets Registry
(IANA registers charaset values according to RFC 2278)
http://www.iana.org/assignments/character-sets

[CJK.INF]: CJK.INF Version 2.1
Online Companion to
"Understanding Japanese Information Processing"
Ken Lunde. July 12, 1996
http://www.oreilly.com/people/authors/lunde/cjk_inf.html

[DUERST]: RE: modification to registration of charset ks_c_5601-1987
Martin Duerst. Jun 13 2001
Message in ietf-charsets archive
http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html

[Connoly]: Character Set Considered Harmful
INTERNET-DRAFT
May 2, 1995
http://www.w3.org/MarkUp/html-spec/charset-harmful.html

[Lee]: RE: codes:chars is many:one?
Message in www-archive@w3.org
Liam Quin. Jan 30 2002
http://lists.w3.org/Archives/Public/www-archive/2002Jan/0152.html

"CHARACTER SET" TERMINOLOGY SURVEY, VERSION 0.5

TABLE OF CONTENTS

1. INTRODUCTION

2. CLASSIFICATIONS

2.1 DUAL, CES-ONLY, CCS-ONLY

2.2 CCS, CES, ISO-CCS

3. CURRENT TERMINOLOGY SURVEY

3.1 CHARACTER

3.2 RFC 2130 CCS

3.3 CES

3.3.1 BASIC DEFINITION

3.3.2 EXTENDED DEFINITION

3.4 ISO-CCS

3.5 OTHER DEFINITIONS REDUCING TO CES DEFINITION

4. RECOMMENDED TERMINOLOGY USAGE

4.1 CES

4.2 CCS

4.3 CODED CHARACTER SET

4.4 CHARACTER SET

4.5 CHARSET

4.6 ENCODING

5. ADVOCATING NEW TERMINOLOGY

6. IANA CHARACTER SET REGISTRY

7. CLASSIFICATION EXAMPLE. NON-CJK STANDARDS

7.1 DUAL

7.2 CES-ONLY

7.3 CCS-ONLY

8. CLASSIFICATION EXAMPLE. CJK STANDARDS.

8.1 CES-ONLY

8.1.1 MULTIBYTE

8.1.2 SINGLE BYTE

8.2 DUAL

8.2.1 Big5

8.2.2 CP950

8.2.3 CP932

8.2.4 CP936, GBK

8.2.5 CP949

8.2.6 GB 18030-2000

8.2.7 ISO646-* SINGLE-BYTES

8.3 CONTROVERCIAL: DUAL OR CCS-ONLY?

8.4 CCS-ONLY: CNS 11643-1992

8.5 STANDARDS' NAMES REFERENCE

9. CALL FOR FEEDBACK AND CONTRIBUTION

9.1 COMMENTS MORE THEN WELCOME

9.2 EXTENDING THE LIMITS

9.3 DOUBTS AND GAPS

Appendix A. THANKS

Appendix B. UPGRADING CCS DEFINITION

Appendix C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS

Appendix D. RFC 2130 ON MIME CHARSET

Appendix E. REFERENCES

1. INTRODUCTION

2. CLASSIFICATIONS

2.1 DUAL, CES-ONLY, CCS-ONLY

2.2 CCS, CES, ISO-CCS

3. CURRENT TERMINOLOGY SURVEY

3.1 CHARACTER

3.2 RFC 2130 CCS

3.3 CES

3.3.1 BASIC DEFINITION

3.3.2 EXTENDED DEFINITION

3.4 ISO-CCS

3.5 OTHER DEFINITIONS REDUCING TO CES DEFINITION

4. RECOMMENDED TERMINOLOGY USAGE

4.1 CES

4.2 CCS

4.3 CODED CHARACTER SET

4.4 CHARACTER SET

4.5 CHARSET

4.6 ENCODING

5. ADVOCATING NEW TERMINOLOGY

6. IANA CHARACTER SET REGISTRY

7. CLASSIFICATION EXAMPLE. NON-CJK STANDARDS

7.1 DUAL

7.2 CES-ONLY

7.3 CCS-ONLY

8. CLASSIFICATION EXAMPLE. CJK STANDARDS.

8.1 CES-ONLY

8.1.1 MULTIBYTE

8.1.2 SINGLE BYTE