USING CJK CODED CHARACTER SETS IN RAW ENCODING:
A COMMON SOURCE OF CONFUSION
Common Confusion With CJK Character Set Naming

Anton Tagunov 


Moscow State University
Scientific Computer Research Center
http://tagunov.tripod.com
tagunov@motor.ru


Abstract

This is a technical note on using "raw" encodings for
ISO 2022 compliant _coded character  sets_ 
(especially CJK). It points out that these _coded
character sets_ may be used on their own out
the ISO 2022 or EUC frameworks. Perl is used
as an example.

-----------------------------------

The author would highly value every kind of
feedback to this article. The interest
is on errors and inaccuracies, even the
smallest (a pedantic reply would be more then
welcome :) A special area of interest is the
usage of _character set_, _coded character
set_ and encoding terms. The author feels
that probably his usage of this terms may
be improved and any remarks in this are
are welcome.

The author of this small note, being a post
graduate student would value very much an on-line
publication in any kind of on-line (or not on-line :-)
periodical of this note and/or the forthcoming
article


"PROCESSING CHARACTER INFORMATION CODED IN MULTIPLE CODED
CHARACTER SETS.

(CONSISTENT SIMPLE STRING COLLATION, MATCHING TO CHARACTER
CLASSES, MERGING DATA. PERL AS A PROTOTYPE)"

(see the P.S. of this message for an abstract)

Help on this is being requested and 
every kind of advice on possible publication is eagerly
accepted. Any kind of rework of this and next article
possible.

-----------------------------------
USING CJK CODED CHARACTER SETS IN RAW ENCODING:
A COMMON SOURCE OF CONFUSION
Common Confusion With CJK Character Set Naming

There is a certain confusion around
JIS X 201/208/212, GB 1988/2312, GB/T 18030 (and similar)
standards.

It is well known that all of them
provide 94 or 94x94 _coded character sets_
usable within ISO 2022 and EUC frameworks.
ISO-2022-JP, ISO-2022-CN, ISO-2022-KR, EUC-JP, EUC-CN,
EUC-TW, Shift_JIS (and other) coded character sets make use of 
them.

But this is not the only use of these standards. In practice
they also build up "raw" encodings.

RFC 1345 describes a number of (mostly ISO registered) 
coded character sets including

JIS_C6220-1969-jp = iso-ir-13  = katakana = x0201-7     (7-bit)
JIS_C6220-1969-ro = iso-ir-14  = jp = ISO646-JP         (7-bit)
GB_1988-80        = iso-ir-57  = cn = ISO646-CN         (7-bit)
JIS_C6226-1983    = iso-ir-87  = JIS_X0208-1983 = x0208 (2 x 7-bit)
JIS_X0212-1990    = iso-ir-159 = x0212                  (2 x 7-bit)
GB_2312-80        = iso-ir-58  = chinese                (2 x 7-bit)

JIS_X0201         = X0201                               (8-bit)


All these are JIS and GB standards used in "raw": they
correspond to 94 or 94x94 _character sets_ plugged into the
GL (0x20-0x7F) region.

JIS_X0201 is special: it plugs two 94 character sets,
one into the GL and and one into the GR (0xA0-0xFF).

GB/T 18030 has no "raw" coded character set for it in this RFC,
but, for instance, Perl 5.8.0 provides translation to/from some
coded character set named "GB 18030". The later seems to be 
2 bytes per char 7-bit plugging GB/T 18030 into GL.

My assumption is that whenever we meet a name of a 94 or 94x94 
(pluggable into ISO 2022/EUC)
_character set_ used as a name of encoding we should
understand that _character set_ is in its "raw" encoding
plugging itself into GL (0x20-0x7F).


I do not know how widely such "raw" encodings for
_character sets_ are used, but at least there exist fonts 
for X Window that seem to be designed to display data in 
such encodings:

-cc-song-medium-r-normal-jiantizi-40-400-75-75-c-400-gb2312.1980-0
-cc-song-medium-r-normal-jiantizi-48-480-75-75-c-480-gb2312.1980-0
-isas-fangsong ti-medium-r-normal--16-160-72-72-c-160-gb2312.1980-0
-isas-song ti-medium-r-normal--16-160-72-72-c-160-gb2312.1980-0
-isas-song ti-medium-r-normal--24-240-72-72-c-240-gb2312.1980-0

And I see
no reason to call these "raw" encodings for _character sets_ 
a wrong way to store data: after all this is a efficient.

= Terminology

The term _(coded) character set_ in this note followes
[ECMA 35], a free access analog of ISO 2022:
"A set of unabiguous rules that establishes a character set
and one-to-one relationship between characters of the set
and their bit representation." This is exactly a rewording
from the [RFC 1345] definition of the same term.

It should be noted however that while the all ISO 2022
complient coded character sets are defined as having
0x20-0x7F code points only, they may be shifted, at need
to 0xA0-0xFF rage. This is what possibly happens when
these coded character sets are uses with ISO 2022 and EUC
framworks. In the discussed raw encodings, however, all
these charsets occumpy there natural 0x20-0x7F (or almost
always only 0x21-0x7E, being a 94 character charcter sets).

The term 'encoding' is used for the same thing as the
_character set_, but it is understood that encoding is
somehow prepared from 'raw' coded character sets, possibly
by one of the following recipes:

If we have an ISO 2022 conforming 94 character
character set we may cook from it
- a raw encoding, by plugging the charaset in the GL (0x21-0x7E)
  region
- an ISO 2022 family encoding by possibly mixing this _charcter
  set_ with other ISO 2022 conforming _character sets_, possibly
  shifting them to the 0xA1-0xFE region (plugging into GR)
  and adding ESC codes for designation and invocation as it is 
  done in ISO 2022
- a EUC family encoding by possibly mixing with another
  _character set_ and pluging one of them into GL and the other
  into GR
The main point of this note is that raw encodings exists and
may be used, for example, for input/output from/to Perl 5.8
programs.

I use the term "plug" a _character set_ into GL (GR) to denote
that octets in the 0x20-0x7F (0xA0-0xFF) are mapped to the
characters in the corresponding _character set_. This is
similar to what would happen if in the ISO 2022 framework
we designated the _character set_ into G0 (G1) and then
invoked G0 (G1) into GL (GR).

= Perl naming troubles

Although the forthcoming Perl 5.8 has an Ecode module shipped
that supports conversion between internal Unicode representation
and all the "raw" encodings mentioned in this note, the naming
expected for these encodings currently present in the development
releases of Perl are vulnerable to criticism.

Please consult the following table:

  IANA:          JIS_X0201        X0201
  RFC 1345:      JIS_X0201        X0201
  Perl:          JIS 0201

  IANA:          JIS_C6226-1983   JIS_X0208-1983   x0208  iso-ir-87
  RFC 1345:      JIS_C6226-1983   JIS_X0208-1983   x0208  iso-ir-87
  Perl:          JIS 0208

  IANA:          JIS_X0212-1990   x0212  iso-ir-159
  RFC 1345:      JIS_X0212-1990   x0212  iso-ir-159
  Perl:          JIS 0212

  IANA:          GB_1988-80     cn  iso-ir-57            (7-bit)
  RFC 1345:      GB_1988-80     cn  iso-ir-57  ISO646-CN (7-bit)
  Perl:          GB 1988    (8-bit
                             also includes JIS_C6226-1983-jp ==
                             katakana at 0xA1-0xFE 
                             is this okay???)

  IANA:          GB_2312-80     chinese   iso-ir-58
  RFC 1345:      GB_2312-80     chinese   iso-ir-58
  Perl:          GB 2312

= Questions still vague to the author

RFC 1345 while specifying _coded character sets_ like
JIS_X0208-1983 and GB_2312-80 makes a vague comment:

  If the coded character set is
  a 96-character set, it is tabled with the relevant GL set 
  (normally ISO-IR-6) and with ISO 6429 as C0 and C1 (12).  
  If it is a 94-character set, it is tabled with the C0 set 
  of ISO 6429. If it is a double-octet coded character set, 
  it is tabled without control character sets and accompanying 
  one-octet coded character sets, and the two-octet code is 
  tabled as a G0 set.
  
What does this say about possibility and rules of encoding
control characters (like CR and LF for instance) with two
byte 7-bit 94x94 character _coded characater sets_ when
they are used in their raw encoding? It it at all possible
to encode the control characters? As single or double byte?
Readers' help is more then gladly welcome to update this
subsection.

Thanks to

  Autrijus Tang
  Brian McGuirk
  Alex Potter

for helping to improve this document

References

[ECMA 35] http://www.ecma.ch/ecma1/STAND/ECMA-035.HTM

[RFC1345] Character Mnemonics and Character Sets. K. Simonsen. 
          June 1992.
          http://www.ietf.org/rfc/rfc1345.txt
          
          
P.S.
Another article 

PROCESSING CHARACTER INFORMATION CODED IN MULTIPLE CODED
CHARACTER SETS.

(CONSISTENT SIMPLE STRING COLLATION, MATCHING TO CHARACTER
CLASSES, MERGIN DATA. PERL AS A PROTOTYPE)

Abstract

Software components sometimes have to use cooperatively and
combine textual data coded with different character sets.
Comparison for equality, simple lexicographical ordering, automatic
and explicit conversions, matching characters agaist character 
classes ([[:space::]], [[:alpha:]]..) for textual data
coded in different coded character sets are questions
discussed in this article.

is being prepared by the same author. He also hopes that
possible criticism on this article will help to polish
the wordings in the next one.