A Character Codes

In Japan, there are several Kanji codesets, which are character codesets that contain character codes representing several thousand Kanji and non-Kanji characters. Thus, you must explicitly specify in the locale name the suitable Kanji codeset for the data you want to process. This chapter describes the Tru64 UNIX Japanese character sets and the following supported codesets:

7-Bit JIS Kanji

8-Bit JIS Kanji

Japanese Extended UNIX Code (EUC)

DEC Kanji

Super DEC Kanji

Shift JIS

UTF-8

A.1 Character Sets

A set of correspondences between characters and codes is called a character set. The Japanese Industrial Standard (JIS) specifies the character sets listed in Table A-1. Each character in these character sets is represented by a 1- or 2-byte character code. However, only the seven bits of each byte of a character are used; a 0 (zero) is contained in the most significant bit (MSB).

Table A-1: Classification of Character Sets Based on the JIS

Class	JIS Standard Number	Character Code
JIS Roman Character	JIS X 0201 LH	0xxxxxxx
JIS Katakana Characters	JIS X 0201 RH	0xxxxxxx
JIS Kanji Characters	JIS X 0208 ^{[Footnote 8]}	0xxxxxxx 0xxxxxxx
JIS Supplemental Kanji Characters	JIS X 0212	0xxxxxxx 0xxxxxxx

In addition to the character set based on the Japanese Industrial Standard (JIS), the ASCII character set, which contains 1-byte character codes defined by the American National Standards Institute (ANSI), is used widely in Japan. The ASCII character set is almost the same as JIS X 0201H, except for the graphic characters assigned to the two codes listed in Table A-2.

Table A-2: ASCII Versus JIS X 0201 LH

Character Code	ASCII	JIS X 0201 LH
0x5c	Back Slash (\)	Yen Sign (¥)
0x7e	Tilde ()	Overline ()

A set of two or more of these character sets is called a codeset. However, within a codeset, character codes belonging to different character sets overlap and characters cannot be uniquely associated with codes. For this reason, some code extension is required for the character set included in the codeset.

The International Standards Organization (ISO) specifies ISO 2022, which defines the system of character codes common to ISO countries. The Japanese Industrial Standard translated ISO 2022 into the Japanese version numbered JIS X 0202. The following are the main codesets that were code-extended based on JIS X 0202:

7-Bit JIS Kanji

8-Bit JIS Kanji

Japanese EUC

The following codesets were code-extended specific to Japan, but are not based on JIS X 0202:

DEC Kanji

Super DEC Kanji

Shift JIS

Table A-3 lists the character sets available for each of the codesets. In the table, Yes means that the codeset contains characters included in each character set.

Table A-3: Character Sets Available for Each of the Codesets

Codeset	ASCII or JIS X 0201 LH	JIS X 0201 RH	JIS X 0208	JIS X 0212
7-Bit JIS	Yes	Yes	Yes	Yes
8-Bit JIS	Yes	Yes	Yes	Yes
Japanese EUC	Yes	Yes	Yes	Yes
DEC Kanji	Yes	No	Yes	No
Super DEC Kanji	Yes	Yes	Yes	Yes
Shift JIS	Yes	Yes	Yes	No

For details about each JIS standard, refer to the appropriate JIS table.

In ISO/IEC 10646, the International Standards Organization has defined the Universal Character Set (UCS), which represents all characters in the world by a single codeset. The Japanese Industrial Standard has translated the UCS into JIS X 0221.

Tru64 UNIX supports UTF-8 encoding based on ISO/IEC 10646.

A.1.1 7-Bit JIS Kanji Codeset

In the 7-bit JIS Kanji codeset, all the characters included in the JIS X 0201 LH, 0202 RH, 0208, and 0212 character sets are encoded into 7-bit character codes. Throughout this document, the 7-Bit JIS Kanji codeset is called the 7-Bit JIS Codeset.

Table A-4: 7-Bit JIS Codeset Character Codes

Character Set	Character Code
JIS X 0201 LH	0xxxxxxx
JIS X 0201 RH	0xxxxxxx
JIS X 0208	0xxxxxxx 0xxxxxxx
JIS X 0212	0xxxxxxx 0xxxxxxx
C0 Control Character	000xxxx

The characters encoded as listed in Table A-4 overlap the codes of the other characters, and so, cannot be identified uniquely. For the 7-Bit JIS Codeset, the respective characters are identified by means of the code extension specified in JIS X 0202.

The codes are processed as follows:

All the codes appearing after the Kanji in sequence ("ESC $ B" by default) are processed as JIS X 0208 Kanji characters.

All the codes appearing after the Kanji out sequence ("ESC(B" by default) are processed as ASCII characters.

All the codes appearing after Supplemental Kanji in sequence ("ESC$(D" by default) are processed as JIS X 0212 Supplemental Kanji characters.

All the codes appearing after Kanji out sequence ("ESC(B" by default) are processed as ASCII characters.

All the codes from SO (0x0e) to SI (0x0f) are processed as JIS X 0201 RH Katakana characters.

You also have the option to use the Kana in sequence ("ECS(I" by default), instead of S0 (0x0e) and SI (0x0f), to represent JIS X0201 RH Katakana characters.

Note

The 7-Bit JIS Codeset can be used for the terminal codes, but cannot be used to specify a locale name.

For details about the 7-Bit JIS Codeset, see JIS7(5).

A.1.2 8-Bit JIS Kanji Codeset

In the 8-Bit JIS Kanji codeset, all the characters included in the JIS X 0201 LH, 0208, and 0212 character sets are encoded into 7-bit character codes and all the characters included in the JIS X 0201 RH character set are encoded into 8-bit character codes. Throughout this document, the 8-Bit JIS Kanji codeset is called the 8-Bit JIS Codeset. Table A-5 summarizes encoding of the 8-Bit JIS Codeset characters.

Table A-5: 8-Bit JIS Codeset Character Codes

Character Set	Character Code
JIS X 0201 LH	0xxxxxxx
JIS X 0201 RH	1xxxxxxx
JIS X 0208	0xxxxxxx 0xxxxxxx
JIS X 0212	0xxxxxxx 0xxxxxxx
C0 Control Character	000xxxx

The codes are processed as follows:

All the codes appearing after the Kanji in sequence ("ESC( B" by default) are processed as JIS X 0208 Kanji characters.

All the codes appearing after the Kanji out sequence ("ESC$B" by default) are processed as ASCII characters.

All the codes appearing after Supplemental Kanji in sequence ("ESC$(D" by default) are processed as JIS X 0212 Supplemental Kanji characters.

All the codes appearing after Kanji out sequence ("ESC(B" by default) are processed as ASCII characters.

Data with a 1 in the MSB is processed as a JIS X 0201 RH Katakana character.

SI and SO, used for the 7-Bit JIS Codeset, are ignored.

Note

The 8-Bit JIS Codeset can be used for the terminal codes, but cannot be used to specify a locale name.

For details about the 8-Bit JIS Codeset, see JIS8(5).

A.1.3 Japanese EUC Codeset

The Extended UNIX Code (EUC) is a coding scheme that was extended by AT&T Bell Laboratories and made available worldwide. The Japanese EUC codeset is the result of applying the EUC to Japanese. The extension is based on the UNIX Internal Coding Scheme for Japanese that was defined in the UNIX System Japanese Features Proposal, proposed in 1985 by the AT&T Japanese UNIX System Consultative Committee.

The EUC has four codesets:

CS0 -- Any character belonging to CS0 is encoded so that the MSB is always set to 0 (zero). In any EUC worldwide, CS0 is defined as the ASCII character set.

CS1 -- Any character belonging to CS1 is encoded so that the MSB of each byte of the code is always set to 1. In the Japanese EUC Codeset, CS1 is JIS X 0208.

CS2 -- Any character belonging to CS2 is encoded so that, following the SS2 code (0x8f), the MSB of each byte of the code is always set to 1. In the Japanese EUC Codeset, CS2 is JIS X 0201 RH.

CS3 -- Any character belonging to CS3 is encoded so that, following the SS3 code (0x8f), the MSB of each byte of the code is always set to 1. In the Japanese EUC Codeset, CS3 is JIS X 0212.

Table A-6 summarizes encoding of the Japanese EUC Codeset characters.

Table A-6: Japanese EUC Codeset Character Codes

Code Set	Corresponding Character Set	Encoding
CS0	ASCII	0xxxxxxx
CS1	JIS X 0208	1xxxxxxx 1xxxxxxx
CS2	JIS X 0201RH	SS2 1xxxxxxx
CS3	JIS X 0212	SS3 1xxxxxxx 1xxxxxxx
C0 Control Character		000xxxxx
C1 Control Character		100xxxxx

The following fields of the undefined JIS X 0208 and 0212 fields can be assigned as the user-defined character fields by the vendor or user:

85th to 94th Ku fields of the undefined JIS X 0208 fields

78th to 94th Ku fields of the undefined JIS X 0212 fields

Note

JIS reserves the 78th to 84th fields of the undefined JIS X 0212 fields for future extension. Thus, it is possible for the JIS to assign characters to these fields that will collide with user-defined characters.

For details about the Japanese EUC Codeset, see eucJP(5).

A.1.4 DEC Kanji Codeset

The DEC Kanji Codeset consists of the following character sets:

ASCII or JIS X 0201 LH

DEC Kanji Character Set 1983

User-defined fields in DEC Extended Kanji Character Set

An ASCII character or JIS X 0201 LH is represented by a 1-byte character code. The MSB of each byte is always 0 (zero).

DEC Kanji Character Set 1983 consists of a total of 6,877 characters, which include non-Kanji characters and Level-1 and Level-2 Kanji characters (see Table A-7). One character is represented by 2 bytes, and the MSB of each byte is always 1. The Ku-Ten numbers of this character set are the same as JIS X 0208-1983.

Table A-7: DEC Kanji Character Set 1983

Ku Number	Assignment	Characters
1st to 8	Non-Kanji characters including special characters, digits, Roman characters, Hiragana characters, Katakana characters, Greek characters, Russian characters, and ruled lines	524
9th to 15th	JIS-reserved fields
16th to 47th	Level-1 Kanji characters	2,965
48th to 84th	Level-2 Kanji characters	3,388
85th to 94th	JIS-reserved fields

In the DEC Kanji Codeset, codes of 8,836 user-defined characters can be assigned to user-defined fields in the DEC Extended Kanji Character Set. Any character belonging to this extended character set contains a 1 in the MSB in the first byte and a 0 (zero) in the MSB in the second byte. The 32nd to 94th Ku fields are reserved by Compaq. Therefore, exceptional or user-defined characters can be assigned to the 1st to the 31st Ku fields in the DEC Extended Kanji Character Set.

Table A-8 summarizes DEC Kanji Codeset character codes.

Table A-8: DEC Kanji Codeset Character Codes

Character Set	Character Code
ASCII or JIS X 0201 LH	0xxxxxxx
JIS X 0208	1xxxxxxx 1xxxxxxx
DEC Extended Kanji Character Set	1xxxxxxx 0xxxxxxx
C0 Control Character	000xxxxx
C1 Control Character	100xxxxx

For details about the DEC Kanji Codeset, see the Kanji Code Test and deckanji(5).

A.1.5 Super DEC Kanji Codeset

The Super DEC Kanji Codeset is an extended version of the DEC Kanji Codeset that enables CS2 and CS3 of the Japanese EUC to be handled.

The Super DEC Kanji Codeset consists of the following character codesets:

ASCII or JIS X 0201 LH

DEC Kanji Character Set 1983

DEC Extended Kanji Character Set

JIS X 0201 RH

JIS X 0212-1990

The ASCII (or JIS X 0201 LH) codeset, DEC Kanji Character Set 1983, and DEC Extended Kanji Character Set are encoded in the same manner as the DEC Kanji Codeset.

Any character belonging to JIS X 0201 RH is represented by the 1-byte character data that follows SS2 (0x8e). The MSB of this byte is always 1.

Any character belonging to JIS X 0212-1990 is represented by the 2-byte character data that follows SS3 (0x8f). The MSB of this byte is always 1.

Table A-9 summarizes the Super DEC Kanji Codeset character codes.

Table A-9: Super DEC Kanji Codeset Character Codes

Character Set	Character Code
ASCII or JIS X 0201 LH	0xxxxxxx
JIS X 0208	1xxxxxxx 1xxxxxxx
DEC Extended Kanji Character Set	1xxxxxxx 0xxxxxxx
JIS X 0212 RH	SS2 1xxxxxxx
JIS X 0212	SS3 1xxxxxxx 1xxxxxxx
C0 Control Character	000xxxxx
C1 Control Character	100xxxxx

In the Super DEC Kanji Codeset, exceptional or user-defined characters can be assigned to fields ranging from the first field to the undefined fields in the DEC Extended Kanji Character Set, JIS X0208, and JIS X 0212 (see Table A-10).

Table A-10: Super DEC Kanji User-Defined Fields

Character Set	Assignable Ku Fields
JIS X 0208	85th to 94th
JIS X 0212	78th to 94th
DEC Extended Kanji Character Set	1st to 94th

For details about the Super DEC Kanji Codeset, see sdeckanji(5).

A.1.6 Shift JIS Codeset

The Shift JIS Codeset is widely used in the personal computer world. Table A-11 summarizes encoding of its character sets.

Table A-11: Encoding of Shift JIS Codes

Shift JIS Character Set	Encoding
C0 Control Characters	0x00 to 0x1f
Space	0x20
JIS X 0201 LH	0x21 to 0x7e
Del	0x7f
Undefined	0x80
First byte of JIS X 0208 Kanji character	0x81 to 0x9f
Undefined	0xa0
JIS X 0201 RH	0xa1 to 0xdf
First byte of JIS X 0208 kanji character	0xe0 to 0xef
First byte of user-defined character	0xf0 to 0xfc
Undefined	0xfc to 0xff

Table A-12 summarizes the correspondence between the codes and Kanji characters and the correspondence between the user-defined characters and Ku-Ten numbers.

Table A-12: Correspondence Between Kanji and User-Defined Characters and Ku-Ten Numbers

Kanji and User-Defined Characters	Ku-Ten Numbers
0x8140 . . . 0x819c	JIS X 0208, 1st Ku, 1st Ten to 1st Ku, 94th Ten (except 0x817f)
0x819f . . . 0x81fc	JIS X 0208, 2nd Ku, 1st Ten to 2nd Ku, 94th Ten
0x8240 . . . 0x829e	JIS X 0208, 3rd Ku, 1st Ten to 3rd Ku, 94th Ten (except 0x827f)
0x8140 . . . 0x819c	JIS X 0208, 4th Ku, 1st Ten to 4th Ku, 94th Ten
. . .	. . .
0x9f9f . . . 0x9ffc	JIS X 0208, 62nd Ku, 1st Ten to 62nd Ku, 94th Ten (except 0x817f)
0xe040 . . . 0xe09e	JIS X 0208, 63rd Ku, 1st Ten to 63rd Ku, 94th Ten (except 0xe07f)
. . .	. . .
0xef40 . . . 0xef9e	JIS X 0208, 93rd Ku, 1st Ten to 93rd Ku, 94th Ten (except 0x817f)
0xef9f . . . 0xeffc	JIS X 0208, 94th Ku, 1st Ten to 94th Ku, 94th Ten (except 0xe07f)
0xf040 . . . 0xf0fc	User-defined characters (except 0xf07f)
. . .	. . .
0xfc40 . . . 0xfcfc	User-defined characters (except 0xfc7f)

In the Shift JIS Codeset, user-defined characters can be assigned to 0xf040 to 0xfcfc (unless the second byte is one of 0x00 to 0x3f, 0x7f, and 0xfd to 0xff).

For details about Shift JIS Codeset, see shiftjis(5).

A.1.7 UTF-8 Codeset

The UTF-8 codeset encodes the UCS-4 (4 octets) and UCS-2 (2 octets) data defined in ISO/IEC 10646 into a single codeset to represent the universal characters. UTF-8 codes are represented as variable-length codes in UCS interchange format and represent any ASCII code by one byte. In addition, the standard gave consideration to not using an ASCII code for a multibyte variable-length code field.

Table A-13: UCS-4 Encoding Ranges and UTF-8 Bit Assignments

Encoding Range (Hex)	UTF-8 Octet String (Binary)
0000 0000-0000 007F	0xxxxxxx
0000 0080-0000 07FF	110xxxxx 10xxxxxx
0000 0800-0000 FFFF	1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF	1111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF	11111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

For details about UTF-8, see Unicode(5).