A    Character Codes

In Japan, there are several Kanji codesets, which are character codesets that contain character codes representing several thousand Kanji and non-Kanji characters. Thus, you must explicitly specify in the locale name the suitable Kanji codeset for the data you want to process. This chapter describes the Tru64 UNIX Japanese character sets and the following supported codesets:

A.1    Character Sets

A set of correspondences between characters and codes is called a character set. The Japanese Industrial Standard (JIS) specifies the character sets listed in Table A-1. Each character in these character sets is represented by a 1- or 2-byte character code. However, only the seven bits of each byte of a character are used; a 0 (zero) is contained in the most significant bit (MSB).

Table A-1:  Classification of Character Sets Based on the JIS

Class JIS Standard Number Character Code
JIS Roman Character JIS X 0201 LH 0xxxxxxx
JIS Katakana Characters JIS X 0201 RH 0xxxxxxx
JIS Kanji Characters JIS X 0208 [Footnote 8] 0xxxxxxx 0xxxxxxx
JIS Supplemental Kanji Characters JIS X 0212 0xxxxxxx 0xxxxxxx

In addition to the character set based on the Japanese Industrial Standard (JIS), the ASCII character set, which contains 1-byte character codes defined by the American National Standards Institute (ANSI), is used widely in Japan. The ASCII character set is almost the same as JIS X 0201H, except for the graphic characters assigned to the two codes listed in Table A-2.

Table A-2:  ASCII Versus JIS X 0201 LH

Character Code ASCII JIS X 0201 LH
0x5c Back Slash (\) Yen Sign (¥)
0x7e Tilde () Overline ()

A set of two or more of these character sets is called a codeset. However, within a codeset, character codes belonging to different character sets overlap and characters cannot be uniquely associated with codes. For this reason, some code extension is required for the character set included in the codeset.

The International Standards Organization (ISO) specifies ISO 2022, which defines the system of character codes common to ISO countries. The Japanese Industrial Standard translated ISO 2022 into the Japanese version numbered JIS X 0202. The following are the main codesets that were code-extended based on JIS X 0202:

The following codesets were code-extended specific to Japan, but are not based on JIS X 0202:

Table A-3 lists the character sets available for each of the codesets. In the table, Yes means that the codeset contains characters included in each character set.

Table A-3:  Character Sets Available for Each of the Codesets

Codeset ASCII or JIS X 0201 LH JIS X 0201 RH JIS X 0208 JIS X 0212
7-Bit JIS Yes Yes Yes Yes
8-Bit JIS Yes Yes Yes Yes
Japanese EUC Yes Yes Yes Yes
DEC Kanji Yes No Yes No
Super DEC Kanji Yes Yes Yes Yes
Shift JIS Yes Yes Yes No

For details about each JIS standard, refer to the appropriate JIS table.

In ISO/IEC 10646, the International Standards Organization has defined the Universal Character Set (UCS), which represents all characters in the world by a single codeset. The Japanese Industrial Standard has translated the UCS into JIS X 0221.

Tru64 UNIX supports UTF-8 encoding based on ISO/IEC 10646.

A.1.1    7-Bit JIS Kanji Codeset

In the 7-bit JIS Kanji codeset, all the characters included in the JIS X 0201 LH, 0202 RH, 0208, and 0212 character sets are encoded into 7-bit character codes. Throughout this document, the 7-Bit JIS Kanji codeset is called the 7-Bit JIS Codeset.

Table A-4:  7-Bit JIS Codeset Character Codes

Character Set Character Code
JIS X 0201 LH 0xxxxxxx
JIS X 0201 RH 0xxxxxxx
JIS X 0208 0xxxxxxx 0xxxxxxx
JIS X 0212 0xxxxxxx 0xxxxxxx
C0 Control Character 000xxxx

The characters encoded as listed in Table A-4 overlap the codes of the other characters, and so, cannot be identified uniquely. For the 7-Bit JIS Codeset, the respective characters are identified by means of the code extension specified in JIS X 0202.

The codes are processed as follows:

You also have the option to use the Kana in sequence ("ECS(I" by default), instead of S0 (0x0e) and SI (0x0f), to represent JIS X0201 RH Katakana characters.

Note

The 7-Bit JIS Codeset can be used for the terminal codes, but cannot be used to specify a locale name.

For details about the 7-Bit JIS Codeset, see JIS7(5).

A.1.2    8-Bit JIS Kanji Codeset

In the 8-Bit JIS Kanji codeset, all the characters included in the JIS X 0201 LH, 0208, and 0212 character sets are encoded into 7-bit character codes and all the characters included in the JIS X 0201 RH character set are encoded into 8-bit character codes. Throughout this document, the 8-Bit JIS Kanji codeset is called the 8-Bit JIS Codeset. Table A-5 summarizes encoding of the 8-Bit JIS Codeset characters.

Table A-5:  8-Bit JIS Codeset Character Codes

Character Set Character Code
JIS X 0201 LH 0xxxxxxx
JIS X 0201 RH 1xxxxxxx
JIS X 0208 0xxxxxxx 0xxxxxxx
JIS X 0212 0xxxxxxx 0xxxxxxx
C0 Control Character 000xxxx

The codes are processed as follows:

Note

The 8-Bit JIS Codeset can be used for the terminal codes, but cannot be used to specify a locale name.

For details about the 8-Bit JIS Codeset, see JIS8(5).

A.1.3    Japanese EUC Codeset

The Extended UNIX Code (EUC) is a coding scheme that was extended by AT&T Bell Laboratories and made available worldwide. The Japanese EUC codeset is the result of applying the EUC to Japanese. The extension is based on the UNIX Internal Coding Scheme for Japanese that was defined in the UNIX System Japanese Features Proposal, proposed in 1985 by the AT&T Japanese UNIX System Consultative Committee.

The EUC has four codesets:

Table A-6 summarizes encoding of the Japanese EUC Codeset characters.

Table A-6:  Japanese EUC Codeset Character Codes

Code Set Corresponding Character Set Encoding
CS0 ASCII 0xxxxxxx
CS1 JIS X 0208 1xxxxxxx 1xxxxxxx
CS2 JIS X 0201RH

    SS2    1xxxxxxx

CS3 JIS X 0212

    SS3    1xxxxxxx 1xxxxxxx

C0 Control Character   000xxxxx
C1 Control Character   100xxxxx

The following fields of the undefined JIS X 0208 and 0212 fields can be assigned as the user-defined character fields by the vendor or user:

Note

JIS reserves the 78th to 84th fields of the undefined JIS X 0212 fields for future extension. Thus, it is possible for the JIS to assign characters to these fields that will collide with user-defined characters.

For details about the Japanese EUC Codeset, see eucJP(5).

A.1.4    DEC Kanji Codeset

The DEC Kanji Codeset consists of the following character sets:

An ASCII character or JIS X 0201 LH is represented by a 1-byte character code. The MSB of each byte is always 0 (zero).

DEC Kanji Character Set 1983 consists of a total of 6,877 characters, which include non-Kanji characters and Level-1 and Level-2 Kanji characters (see Table A-7). One character is represented by 2 bytes, and the MSB of each byte is always 1. The Ku-Ten numbers of this character set are the same as JIS X 0208-1983.

Table A-7:  DEC Kanji Character Set 1983

Ku Number Assignment Characters
1st to 8 Non-Kanji characters including special characters, digits, Roman characters, Hiragana characters, Katakana characters, Greek characters, Russian characters, and ruled lines 524
9th to 15th JIS-reserved fields  
16th to 47th Level-1 Kanji characters 2,965
48th to 84th Level-2 Kanji characters 3,388
85th to 94th JIS-reserved fields  

In the DEC Kanji Codeset, codes of 8,836 user-defined characters can be assigned to user-defined fields in the DEC Extended Kanji Character Set. Any character belonging to this extended character set contains a 1 in the MSB in the first byte and a 0 (zero) in the MSB in the second byte. The 32nd to 94th Ku fields are reserved by Compaq. Therefore, exceptional or user-defined characters can be assigned to the 1st to the 31st Ku fields in the DEC Extended Kanji Character Set.

Table A-8 summarizes DEC Kanji Codeset character codes.

Table A-8:  DEC Kanji Codeset Character Codes

Character Set Character Code
ASCII or JIS X 0201 LH 0xxxxxxx
JIS X 0208 1xxxxxxx 1xxxxxxx
DEC Extended Kanji Character Set 1xxxxxxx 0xxxxxxx
C0 Control Character 000xxxxx
C1 Control Character 100xxxxx

For details about the DEC Kanji Codeset, see the Kanji Code Test and deckanji(5).

A.1.5    Super DEC Kanji Codeset

The Super DEC Kanji Codeset is an extended version of the DEC Kanji Codeset that enables CS2 and CS3 of the Japanese EUC to be handled.

The Super DEC Kanji Codeset consists of the following character codesets:

The ASCII (or JIS X 0201 LH) codeset, DEC Kanji Character Set 1983, and DEC Extended Kanji Character Set are encoded in the same manner as the DEC Kanji Codeset.

Any character belonging to JIS X 0201 RH is represented by the 1-byte character data that follows SS2 (0x8e). The MSB of this byte is always 1.

Any character belonging to JIS X 0212-1990 is represented by the 2-byte character data that follows SS3 (0x8f). The MSB of this byte is always 1.

Table A-9 summarizes the Super DEC Kanji Codeset character codes.

Table A-9:  Super DEC Kanji Codeset Character Codes

Character Set Character Code
ASCII or JIS X 0201 LH 0xxxxxxx
JIS X 0208 1xxxxxxx 1xxxxxxx
DEC Extended Kanji Character Set 1xxxxxxx 0xxxxxxx
JIS X 0212 RH

    SS2    1xxxxxxx

JIS X 0212

    SS3    1xxxxxxx 1xxxxxxx

C0 Control Character 000xxxxx
C1 Control Character 100xxxxx

In the Super DEC Kanji Codeset, exceptional or user-defined characters can be assigned to fields ranging from the first field to the undefined fields in the DEC Extended Kanji Character Set, JIS X0208, and JIS X 0212 (see Table A-10).

Table A-10:  Super DEC Kanji User-Defined Fields

Character Set Assignable Ku Fields
JIS X 0208 85th to 94th
JIS X 0212 78th to 94th
DEC Extended Kanji Character Set 1st to 94th

For details about the Super DEC Kanji Codeset, see sdeckanji(5).

A.1.6    Shift JIS Codeset

The Shift JIS Codeset is widely used in the personal computer world. Table A-11 summarizes encoding of its character sets.

Table A-11:  Encoding of Shift JIS Codes

Shift JIS Character Set Encoding
C0 Control Characters 0x00 to 0x1f
Space 0x20
JIS X 0201 LH 0x21 to 0x7e
Del 0x7f
Undefined 0x80
First byte of JIS X 0208 Kanji character 0x81 to 0x9f
Undefined 0xa0
JIS X 0201 RH 0xa1 to 0xdf
First byte of JIS X 0208 kanji character 0xe0 to 0xef
First byte of user-defined character 0xf0 to 0xfc
Undefined 0xfc to 0xff

Table A-12 summarizes the correspondence between the codes and Kanji characters and the correspondence between the user-defined characters and Ku-Ten numbers.

Table A-12:  Correspondence Between Kanji and User-Defined Characters and Ku-Ten Numbers

Kanji and User-Defined Characters Ku-Ten Numbers
0x8140 . . . 0x819c JIS X 0208, 1st Ku, 1st Ten to 1st Ku, 94th Ten (except 0x817f)
0x819f . . . 0x81fc JIS X 0208, 2nd Ku, 1st Ten to 2nd Ku, 94th Ten
0x8240 . . . 0x829e JIS X 0208, 3rd Ku, 1st Ten to 3rd Ku, 94th Ten (except 0x827f)
0x8140 . . . 0x819c JIS X 0208, 4th Ku, 1st Ten to 4th Ku, 94th Ten

.
.
.

.
.
.
0x9f9f . . . 0x9ffc JIS X 0208, 62nd Ku, 1st Ten to 62nd Ku, 94th Ten (except 0x817f)
0xe040 . . . 0xe09e JIS X 0208, 63rd Ku, 1st Ten to 63rd Ku, 94th Ten (except 0xe07f)

.
.
.

.
.
.
0xef40 . . . 0xef9e JIS X 0208, 93rd Ku, 1st Ten to 93rd Ku, 94th Ten (except 0x817f)
0xef9f . . . 0xeffc JIS X 0208, 94th Ku, 1st Ten to 94th Ku, 94th Ten (except 0xe07f)
0xf040 . . . 0xf0fc User-defined characters (except 0xf07f)

.
.
.

.
.
.
0xfc40 . . . 0xfcfc User-defined characters (except 0xfc7f)

In the Shift JIS Codeset, user-defined characters can be assigned to 0xf040 to 0xfcfc (unless the second byte is one of 0x00 to 0x3f, 0x7f, and 0xfd to 0xff).

For details about Shift JIS Codeset, see shiftjis(5).

A.1.7    UTF-8 Codeset

The UTF-8 codeset encodes the UCS-4 (4 octets) and UCS-2 (2 octets) data defined in ISO/IEC 10646 into a single codeset to represent the universal characters. UTF-8 codes are represented as variable-length codes in UCS interchange format and represent any ASCII code by one byte. In addition, the standard gave consideration to not using an ASCII code for a multibyte variable-length code field.

Table A-13:  UCS-4 Encoding Ranges and UTF-8 Bit Assignments

Encoding Range (Hex) UTF-8 Octet String (Binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001F FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03FF FFFF 1111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7FFF FFFF 11111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

For details about UTF-8, see Unicode(5).