* * "CHARACTER SET" TERMINOLOGY SURVEY, VERSION 0.5 Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru Moscow State University Scientific Computer Research Center http://srcc.msu.su TABLE OF CONTENTS 1. INTRODUCTION 2. CLASSIFICATIONS 2.1 DUAL, CES-ONLY, CCS-ONLY 2.2 CCS, CES, ISO-CCS 3. CURRENT TERMINOLOGY SURVEY 3.1 CHARACTER 3.2 RFC 2130 CCS 3.3 CES 3.3.1 BASIC DEFINITION 3.3.2 EXTENDED DEFINITION 3.4 ISO-CCS 3.5 OTHER DEFINITIONS REDUCING TO CES DEFINITION 4. RECOMMENDED TERMINOLOGY USAGE 4.1 CES 4.2 CCS 4.3 CODED CHARACTER SET 4.4 CHARACTER SET 4.5 CHARSET 4.6 ENCODING 5. ADVOCATING NEW TERMINOLOGY 6. IANA CHARACTER SET REGISTRY 7. CLASSIFICATION EXAMPLE. NON-CJK STANDARDS 7.1 DUAL 7.2 CES-ONLY 7.3 CCS-ONLY 8. CLASSIFICATION EXAMPLE. CJK STANDARDS. 8.1 CES-ONLY 8.1.1 MULTIBYTE 8.1.2 SINGLE BYTE 8.2 DUAL 8.2.1 Big5 8.2.2 CP950 8.2.3 CP932 8.2.4 CP936, GBK 8.2.5 CP949 8.2.6 GB 18030-2000 8.2.7 ISO646-* SINGLE-BYTES 8.3 CONTROVERCIAL: DUAL OR CCS-ONLY? 8.4 CCS-ONLY: CNS 11643-1992 8.5 STANDARDS' NAMES REFERENCE 9. CALL FOR FEEDBACK AND CONTRIBUTION 9.1 COMMENTS MORE THEN WELCOME 9.2 EXTENDING THE LIMITS 9.3 DOUBTS AND GAPS Appendix A. THANKS Appendix B. UPGRADING CCS DEFINITION Appendix C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS Appendix D. RFC 2130 ON MIME CHARSET Appendix E. REFERENCES 1. INTRODUCTION Binary representation of textual data is ruled by character set standards. Globally unique names (like KS_C_5601-1987) refer to specification. The goal of this survey is to classify named specifications (JIS_C6220-1969-jp, JIS_C6220-1969-ro, JIS_X0208-1983, etc.). To achieve this the current character set related terminology is surveyd (3) and abridged to three category classificaton (2.2). The categories of the (2.2) classification are however overlapping and section (5) advocates adoption of a new classification (2.1). The (2.1) specification differs from (2.2) only in that it has disjoint categories. Sections (6), (7), (8) practice this classification on a large body of standards. Section (8) 'CLASSIFICATION EXAMPLE. CJK STANDARDS' has unintentionally become a brief reference on CJK standards. Its (8.3) subsection is notably a point from which the whole survey has grown: it emphasises a certain confusion happening around most basic CJK standards. Please don't miss 8.3 :-) Section (9) reflects the great author's desire to collect as much feedback as possible (tagunov@motor.ru). Subsections (9.2), (9.3) also express a desire to fill any gaps and discrepancies that the readers are likely to notice, and point out the areas where readers' help is desirable. Let us consider this version 0.1 of the servey! ;-) Section (4) discusses how [RFC 2130] undermines "coded character set" industry terminology and what are the best ways to get around that. Appendixes B, C, D contain miscellaneous discussions excluded from the main body of the survey. Have a good reading! :-) 2. CLASSIFICATIONS 2.1 DUAL, CES-ONLY, CCS-ONLY This is the classification proposed by the current survey. It is derived from classification introduced in (2.2) by making the categories disjoint. See section (5) for an explanation of why this survey finds this classification most fruitfull and advocates it over the (2.2) one. DUAL specifications that define a CCS and a CES (using this CCS and possibly other CCS's defined elsewhere to define the CES) CES-ONLY specifications that define a CES using CCS's defined elsewhere (and not specifing any CCS's on their own) CCS-ONLY specifications that define a CCS (and no CES's) This classification is complete and disjoint: every character set specification fits into one of these categories, but none fits into two. (See subsections (3.2), (3.3) for definitions of CES and CCS.) 2.2 CCS, CES, ISO-CCS This is the classification derived in section (3) from definitions found in ISO and RFC standards. These standards operate notions as "character set" "coded character set" "charset", "character encoding scheme" the whole diversity of definitions given to these terms reduces to 3 categories: CCS Coded Character Set after [RFC 2130] CES Character Encoding Scheme after [RFC 2130] ISO-CCS Coded Character Set after ISO, cited by [RFC 1345] This classfication is complete but not disjoint. Here is how it relates to the (2.1) classification: CES = DUAL + CES-ONLY CCS = DUAL + CCS-ONLY ISO-CCS = DUAL 3. CURRENT TERMINOLOGY SURVEY 3.1 CHARACTER The term "character" rises little disagreement. Both [RFC 2278] and [ECMA 35] define it as a member of a set of elements used for the organization, control, or representation of data. This is close to definition in [UNICODE] and in other standards. 3.2 RFC 2130 CCS [RFC 2130], [RFC 2278] formulate Coded Character Set (CCS) is a mapping from a set of abstract characters to a set of integers To avoid ambiguity we shall refer to this defintion as 'CCS', because this abbrevation seems not to be used in any other meaning. Examples (after these RFC's): ISO 10646, US-ASCII, ISO-8859-*. Please refer to Appendix B for a remark on CCS definition. Specifications that match this definition are classified as CCS in (2.2). 3.3 CES 3.3.1 BASIC DEFINITION [RFC 2130] and [RFC 2278] formulate: Character Encoding Scheme (CES) is a mapping from a Coded Character Set or several coded character sets to a set of octets. 3.3.2 EXTENDED DEFINITION [RFC 2130] also contains another abstract that we shall use as a CES definition instead: A definition of a character encoding scheme consists of: - A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs or to the error state "illegal octet sequence" - Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based. Please refer to Appendix C for an moderate discrepacy between these definitions. Specifications that match these definitions are classified as CES in (2.2). Following the later definition we find two flavours of CES specifications: Those that use CCS's defined only elswhere, CES-ONLY in (2.1). And those that contain "specifications .. in text of ... CCS upon which this CES is based". It is reasonable to assumed that every specification defines no more then one CCS on its own. Hence the definition of category DUAL in the (2.1) classification. The ISO-8859-* specifications seem both to define a there own 94-character CCS and to reference another CCS (ASCII). This is what (2.1)'s DUAL description means by "using this CCS and possibly other CCS's defined elsewhere." 3.4 ISO-CCS [RFC 1345] cites ISO definitions of a "coded character set": The ISO definition of the term "coded character set" is as follows: "A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their coded representation." As you can see a specification defines ISO 'coded character set' if and only if it defines both a CCS and a CES for it. This is exactly category DUAL in (2.1). Hence category ISO-CCS in (2.2) is exactly identical to DUAL in (2.1). 3.5 OTHER DEFINITIONS REDUCING TO CES DEFINITION [RFC 1345] also has it's own definition of a 'coded character set': "A coded character set is a set of rules that unambiguously and completely determines which sequence of characters, if any, is represented by each possible sequence of ... bytes" [RFC 2045] defines 'character set' term: The term "character set" is used in MIME to refer to a method of converting a sequence of octets into a sequence of characters ... from simple ... mappings such as US-ASCII to complex table switching methods such as those that use ISO 2022's techniques. Even the newest [RFC 2278] that gives a CES definition parallels it with: The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters ... All these definitions are essentially equivalent to the CES definition (3.3). 4. RECOMMENDED TERMINOLOGY USAGE The introduction of CES/CCS terminology by [RFC 2130] helps a lot in understanding the character set world. But it also has caused certain ambiguity: 'Coded Character Set' has been used for years in the ISO meaning of DUAL (2.1) and introduction of a CCS with a different meaning forces a terminology revision. This section one evaluates old and new terms one by one. 4.1 CES Totally unambiguous and fine. 4.2 CCS Certainly usage of 'CCS' 3-letter abbrevation in a meaning different from the full form 'coded character set' is artificial. However this is probably the best we can do now. 4.3 CODED CHARACTER SET The full 'coded character set' term is currently in a conflict between the older ISO definition and the newer, equivalent to CCS one. Probably every speaker who choses to use it should first clarify which meaning he/she adhers to. 4.4 CHARACTER SET The term 'character set' is quite ambigeous. It may be accepted as - abbrevation from 'coded character set' (in ISO meaning), equivalent of CES - abbrevation from 'coded character set' (in [RFC 2130] meaning), equivalent of CCS - expansion of 'charset' parameter of Content-Type MIME header, also probably an equivalent of CES (4.5) This survey's point of view is that it is best to use this term only inside the 'character set standards' word-combination, meaning the whole body of heterogenous standards. 4.5 CHARSET Many standards (including MIME and HTTP ones) treat 'charset' as a shorthand for 'character set' or 'coded character set'. Both of these have become ambiguous. But over years 'charset' has been used to mean "what you may pass as the 'charset' parameter of the Content-Typed MIME or HTTP header". In this meaning 'charset' is equivalent to CES. Recognising this common usage [RFC 2278] has proposed to detach 'charset' from any other terms and use it on its own, effitiently as a synonim to CES. [RFC 2278]: The term "charset" (see historical note below) is used here to refer to a method of converting a sequence of octets into a sequence of characters. A related discussion on MIME charset parameter and [RFC 2130] may also be found in the Appendix D. The author of this article will probably continue to use this term and exactly in the same meaning he did it before. 4.6 ENCODING A very general term and probably fine as it is. The terminology revision has not touched it. May mean different things depending on context. In certain circumstances it may be equivalent to CES, in others to a CES combined with 'Transport Encoding Scheme' [RFC 2130]. And it not necessirely related to binary representing of textual data. 5. ADVOCATING NEW TERMINOLOGY This section will show that the current [RFC 2130] CES and CCS definitons may not be narrow enough for certain purposes. If a specification is classified as a CES, it is not clear if that specification also defines a CCS. If a specification is classified as a CCS, it is not clear if that specification also defines a CES. CES specification is explicitly allowed to include or not include a "private" CCS specification (3.3.2). Its perfectly okay per se, but this survey calls to establish different family names for these two categories of CES specifications. This survey's opinion is that adoption of terminology based on DUAL,CES-ONLY,CCS-ONLY classification (2.1) would allow to establish a better taxonomy for character set standards. 6. IANA CHARACTER SET REGISTRY As an epigraph, [RFC 2130]: ... the MIME registry of character sets ... contains items that may differ greatly in their applicability and semantics in various Internet protocols. [RFC 2278] says: MIME ... and various other modern Internet protocols are capable of using many different charsets. This registration procedure exists ... to associate a specific name or names with a given charset and to give an indication of whether or not a given charset can be used in MIME text objects. [IANA REG] is the registry established by this RFC. [IANA REG] contains records both for DUAL (2.1) and CES-ONLY (2.1) standards. [IANA REG] does not contain records for CCS-ONLY (2.1) standards. The probable reason is that the ISO "coded character set" definition (3.4) does not allow to create a CCS-ONLY (2.1) standard, but only a DUAL (2.1). This explains why many of the standards (including those registered at [IANA REG]) are DUAL (2.1) only formally, and function as CCS-ONLY (2.1). Naturally, the entries labeled MIME applicable (that have a "preferred MIME name") are full weight CES's and are either CES-ONLY's or DUALS both de facto and de jure. The above remark is applicable to a subset of entries in the MIME registry that do not have a "preferred MIME name". The following sections (7) and (8) classify a number of stnadards, including those registered in [IANA REG]. Section (8.3) is specially devoted to standards that are DUAL by their form but CCS-ONLY by there nature. 7. CLASSIFICATION EXAMPLE. NON-CJK STANDARDS This section classifies some of the non-CJK and Unicode-related standards against DUAL, CCS-ONLY, CES-ONLY classification (2.1). Standards are refered to by [IANA REG] prefered MIME name if they have one. 7.1 DUAL US-ASCII, ISO-8859-*, KOI8-R 7.2 CES-ONLY UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4 7.3 CCS-ONLY Unicode 3.2 8. CLASSIFICATION EXAMPLE. CJK STANDARDS. Classification of CJK standards used to be a great source of confusion, at least for the author of this survey (and was the original incentive to write this survey.) Note: this section references character specifications by [IANA REG] names if such names exist. For completeness (8.5) relates [IANA REG] names back to the names of official standards that contain the specifications. 8.1 CES-ONLY 8.1.1 MULTIBYTE Specifications like ISO-2022-KR (reportedly unused) EUC-KR ISO-2022-JP ISO-2022-JP-1 ISO-2022-JP-2 Shift_JIS (as defined by Appendix 1 of JIS X 0208:1997) EUC-JP GB2312 (8-bit EUC-style CES) EUC-TW ISO-2022-CN-EXT easily classify as CES-ONLY (2.1). 8.1.2 SINGLE BYTE JIS_X0201 [RFC 1345] is an 8-bit composition of two 7-bit DUALS's (2.1): JIS_C6220-1969-ro (ISO646-JP) JIS_C6220-1969-jp (katakana) 8.2 DUAL 8.2.1 Big5 This standard defines its onw CCS and a CES, hence classified as DUAL. 8.2.2 CP950 Microsoft extension to Big5. Erroneously refered to as Big5 by Microsoft products. Adds characters to Big5 thus defining. 8.2.3 CP932 Microsoft extension to Shift_JIS. Erroneously(?) referred to as Shift_JIS by Microsoft products. The official Shift_JIS includes only JIS X 0201 and JIS X 0208 repertoire, while Microsoft has always been meaning Shift_JIS to encode a wider repertoire with Shift_JIS. As a hystorical predecessor Microsoft's variant probably has more rights for the name, albeit it may be objected that Microsoft shouldn't have used JIS as part of the name in the first place. 8.2.4 CP936, GBK Microsoft extension to GB2312. Erroneously referred to as GB2312 by Microsoft products. Backword compatible with GB2312. Extends GB2312 with unified Han characters from ISO 10646-1:1993 not already present in GB_2312-80. New characters are inserted to unused postions in EUC-CN natural 0x8181 - 0xFEFE range and to newly allocated 0x8140-0xFE7E range. 8.2.5 CP949 Microsoft extension to EUC-KR. Also known as Unified Hangul Code, UHC. Erroneously referred to as KS_C_5601-1987 by Microsoft products. Adds 8822 pre-combined Hangul syllables to EUC-KR. Uses extension technique same as GBK. 8.2.6 GB 18030-2000 China national standard. Extends GBK. Introduces 4 byte codings for characters. Provides space for all assigned and unassigned Unicode 3.2 (BMP plus 16 extension planes) code points. See also [DUERST]. 8.2.7 ISO646-* SINGLE-BYTES Single byte JIS_C6220-1969-ro alias ISO646-JP GB_1988-80 alias ISO646-CN KSC5636 alias ISO646-KR all serve as "raw material" to CES-ONLY's listed in (8.1), by defining 94-character CCS's and also define ASCII-like 7-bit CES's. The difference from ASCII is Yen, Yuan, Won symbols replacing "$" (0x24) or "\" (0x5C), and other minor changes, as it is natural for the ISO646-* family. 8.3 CONTROVERCIAL: DUAL OR CCS-ONLY? The following standards cited by [RFC 1345] (and registered by [IANA REG]) JIS_C6220-1969-jp alias JIS_C6220-1969 alias iso-ir-13 JIS_C6226-1983 alias JIS_X0208-1983 alias iso-ir-87 JIS_X0212-1990 alias iso-ir-159 GB_2312-80 alias iso-ir-58 KS_C_5601-1987 alias KS_C_5601-1989 alias iso-ir-149 alias KSC_5601 are in a peculiar position. Because ISO "coded character set" definition (all these are ISO standards as the iso-ir-xyz alias shows) does not allow to create a CCS-ONLY standard (2.1), but only a DUAL (2.1). So each of these standard additionally defines a 7-bit CES. We shall call this CES "raw" or "implied". Natural ("cooked") role of these standards is to help building CES's listed in (8.1). (Hence we will call this CES "raw".) The "raw" CES usage is so rare that the we may suspect that it even was not the standards creators' intension. (Hence we will also call this CES "implied".) For example JIS_C6220-1969-jp's "raw" CES is a 7-bit CES that encodes - regular (ISO 6429) control characters at 0x00 - 0x20 - SPACE and DELETE at 0x21, 0x7F - Katakana at 0x21 - 0x7E. It is not clear if this CES is used at all, but it is probably not (see also section (9)). JIS_C6226-1983 (aka JIS_X0208-1983), GB_2312-80 and KS_C_5601-1987 have even less usable "raw" 7-bit CES's. These CES's are double byte 7-bit CES that have neither control characters nor SPACE and delete. Naturally CES's that do not encode CR, LF and SPACE are only of a limited use. They are know however to be used with fonts in X Window system. Other 94x94 standards should be in the same position of implicitly defining mostly uselessly mutibyte 7-bit CES's without SPACE, BACKSPACE, CR, LF and other control characters. These probably include ISO-IR-165 GB/T 12345-90 8.4 CCS-ONLY: CNS 11643-1992 CNS 11643-1992 defines 16 94x94 planes. First two planes have almost the same set of characters as Big5, but at different code points. Is used only inside EUC-TW. Does not define an implicit "raw" CES like those described in 8.3, because only each of it's planes might be vulnarable to this "implicit" creation not the standard as a whole. See also (9.3). 8.5 STANDARDS' NAMES REFERENCE This material probably does not belong here, but as soon as section (8) has already become a CJK standard listing for completeness here's a [IANA REG] name to original standard name reference. (It is not clear, why this info is missing from [IANA REG]). Multiple standard names are given if a standard has several names or several standard versions are identical at definition of the given CES. [IANA REG] aliases are marked with '='. 94 DUALs [IANA REG] original name(s) JIS_C6220-1969-ro = ISO646-JP, JIS C 6220-1969 JIS_C6220-1969-jp = katakana = JIS_C6220-1969 GB_1988-80 = ISO646-CN GB 1988-80 KSC5636 = ISO646-KR KS C 5636-1993 KS C 5636-1989 94x94 DUALs JIS_C6226-1983 = JIS_X0208-1983 JIS C 6226-1983 JIS X 0208:1983 JIS_X0212-1990 JIS X 0212:1990 GB_2312-80 GB 2312-80 KS_C_5601-1987 = KS_C_5601-1989 = KSC_5601 KS C 5601-1987 KS C 5601-1992 KS X 1001:1997 Extensions/revisions of the mentioned 94x94 DUAL's (2.1) not registered at [IANA REG]: JIS X 0208:1990 JIS X 0208:1997 KS X 1001:1998 (euro and one other char added) 9. CALL FOR FEEDBACK AND CONTRIBUTION 9.1 COMMENTS MORE THEN WELCOME The author of this survey would very much like to receive as much feedback on this article as possible. Please send me all kinds of comments on this survey, your opinion on its topicality, all factual mistakes, any statements that you find controvercial! You corrections will be incorporated into this document, as soon as possible, or if the author will consider them arguable will probably form 'Appendex F. SPECIAL OPINIONS'. 9.2 EXTENDING THE LIMITS The author is limited in surveing documents freely available online. Specifically he has no access to ISO standards, or their analogs, except for [ECMA 35]. The main source of information on CJK have been [RFC 1345], [CJK.INF] and most helpfull replies from Autrijus Tang and Jungshik Shin on perl-unicode@perl.org. Therefore contributions from readers who have access to documents that the author of this survey does not have access to may significantly improve it. Issue #1. Section (3.4) 'ISO CCS' would benefit from a direct reference to some ISO standard that gives the ISO definition of "coded character set". Issue #2. Section (3.5) 'OTHER DEFINITIONS REDUCING TO CES DEFINITION' would benefit from listing more definitions of "character set" and related identities. Issue #3. If for any standard (especially CJK) it's most official name (given by its original registration body) is not listed or is misspelled (up to ':' vs '-' differences and wrong number of spaces) it would be higly appropriate to correct this. Please do tell me. Issue #4. The author of this document will highly wellcome inclusion of other CJK standards into the classification, section (8). Also, if you feel there are strong reasons for inclusion of non-CJK standards into section (7), please do tell me. 9.3 DOUBTS AND GAPS Issue #5. Is "raw" 7-bit "implied" CES for JIS_C6220-1969-jp used for any purpose? Has section (8.3) been wrong in saying it isn't? Issue #6. How correct is section (8.4), that lists CNS 11643-1992 as the only CCS-ONLY stnadard in the CJK world? Issue #7. What CES's are used with CCCII? Does the CCCII standard specify a CES? If yes, how is it called? What is the full name of the standard? Issue #8. What CES's are used for ANSI Z39.64-1989? Does the ANSI Z39.64-1989 standard specify a CES? If yes, how is it called? Help will be highly welcome! APPENDIX A. THANKS Thanks to Autrijus Tang Jungshik Shin and other posters of perl-unicode@perl.org for detailed disscussions of CJK standars! Ken Lunde for his super-informative [CJK.INF] And special thanks to Dan Kogai for developing and maintaining Perl Encode module that has put me on the with the character encoding issues. To be continued :-) APPENDIX B. UPGRADING CCS DEFINITION It may be worth to understand the CCS definition in a special way: CCS is a mapping from a set of abstract characters to a set of integers, a set of integer pairs or a set integers or integer triplets Pairs naturally rise from row-column codes of tabels used to present character glyphs and triplets - from arrays of tables. These muli-demintional indexes easily map to integers. But this is often done differently: for 94-character CCS's we regularly use - hexademical notation: 0x41 Taking Ken Lunde's [CJK.INF] as an example of a document discussing 94x94 CCS's we'll see two different notations: - decimal notation, items dash sparated, counting from 1: 06-85 - hexademical notation, items glued togehther, each counted from 0x21: 0x6161 Similar variations should be possible with 94x94x94 CCS's. APPENDIX C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS C1. [RFC 2130], [RFC 2278]: Character Encoding Scheme is a mapping from a Coded Character Set (or several) to a set of octets. C2. [RFC 2130]: A definition of a character encoding scheme consists of: - A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs or to the error state "illegal octet sequence" - Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based. C3. [RFC 2278]: The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters... unconditional and unambiguous conversion in the other direction is not required A notable difference is that C2 allowes several octet sequences to map to a single sequence while C1 does not. We may of course say that C1 is dominating and outlaw multiple octet sequences, but then a "charset" according to the C3 definition is not automatically a CES, which breaks our neat classification. So for the author prefers to silently reverse the C1 definition (and efficiently make C2 dominating). In practice this issue is not that important because CES's try to avoid the associating several octet sequences to the same sequence. UTF-8 prescribes to use the shortest possible byte sequence to represent every Unicode coded point, and calls every other presentation "malformed". ISO-2022-* family members do not use the ISO 2022's awaresome power to its full extent and thus rule out most possible multiplicities. Here's an example of multiplicity that occurs however. All the following sequences of octets produce the same sequence of in ISO-2022-JP: ESC $ B 0x50 0x50 ESC $ B ESC $ B 0x50 0x50 ESC $ B ESC ( B ESC $ B 0x50 0x50 ... Here ESC $ B mean that the following octets should be interpreted as pairs coding characters in the JIS X 0208-1983 coded character set. ESC ( B denotes that the following octets should be interpreted as ASCII charset. The point is that the redundant escape sequences may be added quite freely. Of course it is easy to establish normaliztion transformation that will remove redundant escape sequences, but ISO-2022-JP does not forbid them. Hence the C1 definition should probably be silently dropped in favour of C2. APPENDIX D. RFC 2130 ON MIME CHARSET It was [RFC 2130] that introduced the CES (3.3) definition. The more funny it is to see how [RFC 2130] itself does not use it to full power and goes tautological: ... in MIME, the Coded Character Set and Character Encoding Scheme are specified by the Charset parameter to the Content-Type header field ... Every CES is already associated with a set of CCS's (3.3). Regorously, it would be enough to say: ... in MIME, the Character Encoding Scheme is specified by the Charset parameter to the Content-Type header field ... or more verbously "charset", as defined by this document and as specified by the Charset parameter to the Content-Type header field is a synonym to a CES. As such the Charset parameter to the Content-Type header completely defines how to map the result of de-aplying Transport Encdoing Syntax to the binary representation of the message body to a sequence of pairs. Of course both variants are much less intuitive then the original RFC's text. APPENDIX E. REFERENCES [RFC 2278] IANA Charset Registration Procedures. N. Freed, J. Postel. January 1998. http://www.ietf.org/rfc/rfc2278.txt [RFC 2130] The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996. C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, P. Svanberg. April 1997. http://www.ietf.org/rfc/rfc2130.txt [RFC 2045] Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. N. Freed, N. Borenstein. November 1996. http://www.ietf.org/rfc/rfc2045.txt [RFC 1345] Character Mnemonics and Character Sets. K. Simonsen. June 1992. http://www.ietf.org/rfc/rfc1345.txt [ECMA 35] Character Code Structure and Extension Techniques Standard ECMA-35 6th Edition. December 1994. http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (This is a freely accessible analog of ISO 2022) [UNICODE] The Online Edition of The Unicode Standard, Version 3.0. http://www.unicode.org/unicode/uni2book/u2.html [UNICODE CHAPTER 3] The Online Edition of The Unicode Standard, Version 3.0. Chapter 3. Conformance. http://www.unicode.org/unicode/uni2book/ch03.pdf [IANA REG] The Character Sets Registry (IANA registers charaset values according to RFC 2278) http://www.iana.org/assignments/character-sets [CJK.INF] CJK.INF Version 2.1 Online Companion to "Understanding Japanese Information Processing" Ken Lunde. July 12, 1996 http://www.oreilly.com/people/authors/lunde/cjk_inf.html [DUERST] RE: modification to registration of charset ks_c_5601-1987 Martin Duerst. Jun 13 2001 Message in ietf-charsets archive http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html [Connoly] Character Set Considered Harmful INTERNET-DRAFT May 2, 1995 http://www.w3.org/MarkUp/html-spec/charset-harmful.html [Lee] RE: codes:chars is many:one? Message in www-archive@w3.org Liam Quin. Jan 30 2002 http://lists.w3.org/Archives/Public/www-archive/2002Jan/0152.html