* * "CHARACTER SET" TERMINOLOGY SURVEY VERSION 0.96 Anton Tagunov http://tagunov.tripod.com tagunov@motor.ru Moscow State University Scientific Computer Research Center http://srcc.msu.su TABLE OF CONTENTS 1. INTRODUCTION 2. BASIS 2.1 CCS 2.2 CES 2.3 CCS != "Coded Character Set" 3. OTHER TERMINOLOGY 3.1 ISO Coded Character Set 3.2 "charset" != "character set" 3.3 ASN.1 3.4 Other definitions 4. CLASSIFICATION 4.1 CCS-CES 4.2 pure CES 4.3 pure CCS 4.4 pure CCS? 5. CALL FOR FEEDBACK Appendix A. THANKS Appendix B. UPGRADING the CCS DEFINITION Appendix C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS Appendix D. ISO vs CCS-CES Appendix E. "character set" MEANINGS Appendix F. REFRENCES 1. INTRODUCTION "The Report of the IAB Character Set Workshop ... 1996" [RFC 2130] has introduced the - CCS (2.1) and - CES (2.2) terms. Planted in these definitions (2) the current survey - performs a general character set related terminology survey, (3) - practices classifying a number of character set standards, (4) - comments on "raw" encodings inadvertently planted by the ISO 2022 framework, (Appendix D) and L<(Companion 1.3)|\ cjk.html#A1.3> 2. BASIS The basis, system of coordinates adopted by this survey consists of two [RFC 2130] defintions: CCS and CES. 2.1 CCS [RFC 2130]: Coded Character Set (CCS) is a mapping from a set of abstract characters to a set of integers. See also Appendix B. 2.2 CES [RFC 2130]: Character Encoding Scheme (CES) is a mapping from a Coded Character Set or several coded character sets to a set of octets. [RFC 2130]: A definition of a character encoding scheme consists of: - A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs or to the error state "illegal octet sequence" - Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based. 'Encoding' is used synonymously to 'CES'. See also Appendix C. 2.3 CCS != "Coded Character Set" The 'CCS' acronym is used differently from it's expanded form: - 'CCS' seems to be used solely to mean (2.1) - 'Coded Character Set' may also used in the ISO meaning (3.1) 3. OTHER TERMINOLOGY 3.1 ISO Coded Character Set [ISO 8859-14], [ECMA 35] define: 'coded character set; code': A set of unambiguous rules that establishes a character set and the one-to-one relationship between the characters of the set and their bit combinations. Equivalent to CCS (2.1) + CES (2.2). See also Appendix D. 3.2 "charset" != "character set" This section discusses MIME usage of the 'charset' term. [RFC 2277]: This document uses the term "charset" to mean a set of rules for mapping from a sequence of octets to a sequence of characters, such as the combination of a coded character set and a character encoding scheme; this is also what is used as an identifier in MIME "charset=" parameters, and registered in the IANA charset registry ... (Note that this is NOT a term used by other standards bodies, such as ISO). Equivalent to CES (2.2), encoding 'charset' has been _divorced_ from 'character set' to allow (2.1). 3.3 ASN.1 This [ISO 8859-14] excerpt shows that ASN.1 defines analogs of CCS and CES: 7.2 Identification according to ISO/IEC 8824-1 (ASN.1) In the terminology of ISO/IEC 8824-1 the character set of this part of ISO/IEC 8859 and the corresponding coded representation are distinct, and are known as the "character abstract syntax" and the "character transfer syntax" respectively 3.4 Other definitions [UNICODE GLOSSARY]: - 'Character Set': A collection of elements used to represent textual information. - 'Character Repertoire': The collection of characters included in a character set. - 'Coded Character Set': A character set in which each character is assigned a numeric code point. It is not clear if 'Character Set' is equivalent here to 'Character Repertoire' or to 'Coded Character Set'. Or, if it is distinct from both, what the difference is. Appendix E summarizes ambiguities we have discussed so far. 4. CLASSIFICATION This section classifies a number of character set standards (Appendix E) into pure-CCS, pure-CES and CCS-CES categories. Standards are refered to by [IANA REG] preferred MIME name if applicable. [Companion] goes into deeper details on CJK standards and explains reasons behind the proposed classification. 4.1 CCS-CES ISO-8859-* standards are dual: they specify both a CCS and a CES for it: ISO/IEC 8859 consists of several parts. Each part specifies a set of up to 191 graphic characters and the coded representation of these characters by means of a single 8-bit byte [ISO 8859-14] This duality is both a result of the ISO Coded Character Set definition (3.1) and a natural property of such straightforward schemes as US-ASCII and ISO-8859-1 which consist of a small set of characters and a simple one-to-one mapping from single octets to single characters. - from [RFC 2278]. These include US-ASCII ISO-8859-* KOI8-* JIS_C6220-1969-ro [RFC 1345] GB_1988-80 [RFC 1345] KSC5636 [RFC 1345] and others. See also L<(Companion 1.2)|cjk.html#A1.2>. 4.2 pure CES The following [IANA REG] registered MIME charsets (3.2) are classified by this survey as pure-CES's: UTF-8 UTF-16 UTF-16BE UTF-16LE ISO-10646-UCS-2 ISO-10646-UCS-4 ISO-2022-KR EUC-KR ISO-2022-JP ISO-2022-JP-1 ISO-2022-JP-2 Shift_JIS Windows-31J EUC-JP GB2312 (aka EUC-CN, CN-GB) EUC-TW ISO-2022-CN-EXT See also Appendix D and L<(Companion 1.1)|cjk.html#A1.1>. 4.3 pure CCS The only non-CJK pure-CCS know to the author is Unicode 3.2 4.4 pure CCS? [COMPANION], L<(Companion 1.3)|cjk.html#A1.3> consideres the following standards CCS by nature, but CCS-CES due to the _"raw"_ effect described in Appendix D: JIS_C6220-1969-jp alias JIS_C6220-1969 JIS_C6226-1983 alias JIS_X0208-1983 JIS_X0212-1990 GB_2312-80 KS_C_5601-1987 See [RFC 1345]. Refer to [COMPANION] for details. 5. CALL FOR FEEDBACK The author of this survey would very much like to receive as much feedback on this article as possible. Please send me all kinds of comments on this survey, your opinion on its topicality, all factual mistakes, any statements that you find controvercial! Sincerely yours, "Anton Tagunov" :-) APPENDIX A. THANKS Thanks to Autrijus Tang Jungshik Shin and other posters of perl-unicode@perl.org for detailed disscussions of character set standars, Ken Lunde for his super-informative [CJK.INF], Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK for supplying the [ISO 8859-14] link. Special thanks to Dan Kogai for developing and maintaining Perl Encode module and for putting me on the track with the character encoding issues! To be continued :-) APPENDIX B. UPGRADING the CCS DEFINITION It may be worth to understand the CCS definition in a special way: CCS is a mapping from a set of abstract characters to a set of integers, a set of integer pairs or a set integers or integer triplets Pairs naturally rise from row-column codes of tabels used to present character glyphs and triplets - from arrays of tables. These muli-demintional indexes easily map to integers. But this is often done differently: for 94-character CCS's we regularly use - hexademical notation: 0x41 Taking Ken Lunde's [CJK.INF] as an example of a document discussing 94x94 CCS's we'll see two different notations: - decimal notation, items dash sparated, counting from 1: 06-85 - hexademical notation, items glued togehther, each counted from 0x21: 0x6161 Similar variations should be possible with 94x94x94 CCS's. If we upgrade CCS definition this way it will be also necessary to note that the fact a certain CCS standards enumerates characters in paired hex notation (0x6161) should not yet mean that this standard implies a _"raw"_ CES (see Appendix D). To mark a certain standard CCS-CES it should be necessary to explicitly define both the CCS's and CES's in accordance with (2.2). APPENDIX C. A SLIGHT DISCREPANCY BETWEEN CES DEFINITIONS C1. [RFC 2130], [RFC 2278]: Character Encoding Scheme is a mapping from a Coded Character Set (or several) to a set of octets. C2. [RFC 2130]: A definition of a character encoding scheme consists of: - A description of an algorithm which transforms every possible sequence of octets to either a sequence of pairs or to the error state "illegal octet sequence" - Specifications, either by reference to CCS's registered by IANA or in text, of each CCS upon which this CES is based. C3. [RFC 2278]: The term "charset" ... is used here to refer to a method of converting a sequence of octets into a sequence of characters... unconditional and unambiguous conversion in the other direction is not required A notable difference is that C2 allowes several octet sequences to map to a single sequence while C1 does not. We may of course say that C1 is dominating and outlaw multiple octet sequences, but then a "charset" according to the C3 definition is not automatically a CES, which breaks our neat classification. So for the author prefers to silently reverse the C1 definition (and efficiently make C2 dominating). In practice this issue is not that important because CES's try to avoid the associating several octet sequences to the same sequence. UTF-8 prescribes to use the shortest possible byte sequence to represent every Unicode coded point, and calls every other presentation "malformed". ISO-2022-* family members do not use the ISO 2022's awaresome power to its full extent and thus rule out most possible multiplicities. Here's an example of multiplicity that occurs however. All the following sequences of octets produce the same sequence of in ISO-2022-JP: ESC $ B 0x50 0x50 ESC $ B ESC $ B 0x50 0x50 ESC $ B ESC ( B ESC $ B 0x50 0x50 ... Here ESC $ B mean that the following octets should be interpreted as pairs coding characters in the JIS X 0208-1983 coded character set. ESC ( B denotes that the following octets should be interpreted as ASCII charset. The point is that the redundant escape sequences may be added quite freely. Of course it is easy to establish normaliztion transformation that will remove redundant escape sequences, but ISO-2022-JP does not forbid them. Hence the C1 definition should probably be silently dropped in favour of C2. APPENDIX D. ISO vs CCS-CES ISO character set framework has one fundumental concept: ISO Coded Character Set (3.1). This appendix discusses the benefits of "Character Set Calculus" based on two concepts - CCS (2.1) and CES (2.2). As a result of the ISO "coded character set" definition (3.1) - some standards built in accordance with this definition also define a useless and awkward _"raw"_ CES's while their sole intent is to define CCS's. See (4.4), [COMPANION], L<(Companion 1.3)|cjk.html#A1.3> - "coded character sets" produced in accordance with ISO 2022 [ECMA 35] are considered to create a new CCS-CES from a set of per-existing CCS-CES's. This includes creating a new CCS from a set of pre-existing CCS's. (2.1) - (2.2) on the other hand allow - several CES's (UTF-8, UCS-16BE, ...) _encode_ the _same_ CCS (ISO 10642), not _reincarnate_ it - single CCS (JIS X 0208) encoded by several CES's (ISO-2022-JP, EUC-JP) be treated as _the same_ CCS Look how [RFC 2130] mandates clear CCS, CES separation for future charset registrations: Most 'charsets' have a well defined CCS and CES, they should ... be teased apart for the registration. Currently this has been done for two entries in the [IANA REG]: Name: Shift_JIS Source: ... The CCS's are JIS X0201:1997 and JIS X0208:1997 Name: Windows-31J Source: ... NEC special characters ... NEC selection of IBM extensions ... and IBM extensions ... The CCS's are JIS X0201:1997, JIS X0208:1997, and these extensions. It totally depends on the point view whether to think, for example, whether EUC-JP defines a new CCS or reuses ASCII and JIS X 0208. ISO (3.1) point of view is that EUC-JP defines a new CCS ('character set' in it's wording). [RFC 2130] (2.2) point of view is that 'EUC-JP' encodes two pre-existing CCS's: The author's opinion is that it is optimal to keep to the minimum the number of entities involeved into our "Character Set Calculus". We can not reduce the number of encodings used in the world, but we can reduce the number of "secondary supplements" such as CCS's :) Thats why the CCS, CES framework (2) has been chosen as a basis for and is advocated for by this survey. APPENDIX E. "character set" MEANINGS [RFC 2130]: The term 'Character Set' means many things to many people. Truly so: both 'Coded Character Set' and 'Character Set' may mean - ISO coded character set (3.1) = CCS-CES (2.1), (2.2) - MIME charset (3.2) = encoding = CES (2.2) - CCS (2.1) (3.1) and (3.2) are very close. Rigorously (3.1) is a subset of (3.2): Every ISO 'coded character set' is a CES. 'Coded Character Set' means 'ISO coded character set' in ISO standards, for example [ISO 8859-14], [ECMA 35]. 'Coded Character Set', 'Character Set' mean 'encoding' in all RFC's prior to [RFC 2130], for example [RFC 1345] (the second definition), [RFC 2045]. 'Coded Character Set' means CCS in [RFC 2130], [RFC 2278]. 'Character Set' means CCS in the ISO standards, [ISO 8859-14] for example, and probably in the [UNICODE GLOSSARY]. The later is so short on the subject that in it's 'Character Set' may also mean - unordered set of characters (3.3) However it is the high degree of 'character set' term overload that allowes us to say 'Character set standards' meaning the whole body of diverse standards :-) Appendix F. REFRENCES [RFC 2278] IANA Charset Registration Procedures. N. Freed, J. Postel. January 1998. http://www.ietf.org/rfc/rfc2278.txt [RFC 2277] IETF Policy on Character Sets and Languages Network Working Group. January 1998 http://www.ietf.org/rfc/rfc2277.txt [RFC 2130] The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996. C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, P. Svanberg. April 1997. http://www.ietf.org/rfc/rfc2130.txt [ISO 8859-14] Final proof for Latin alphabet No. 8 (Celtic) Michael Everson. Sept 1998 http://www.evertype.com/standards/iso8859/8859-14-en.pdf (ISO standards are not easy to find online. This one is a fortunate exception :-) [RFC 1345] Character Mnemonics and Character Sets. K. Simonsen. June 1992. http://www.ietf.org/rfc/rfc1345.txt [ECMA 35] Character Code Structure and Extension Techniques Standard ECMA-35 6th Edition. December 1994. http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM (This is a freely accessible analog of ISO 2022) [UNICODE GLOSSARY] Unicode Glossary http://www.unicode.org/glossary/ [IANA REG] The Character Sets Registry (IANA registers charaset values according to RFC 2278) http://www.iana.org/assignments/character-sets [DUERST] RE: modification to registration of charset ks_c_5601-1987 Martin Duerst. Jun 13 2001 Message in ietf-charsets archive http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html [COMPANION] CJK CHARACTER SET STNADARDS CLASSIFICATION L>.\ April 2002 L