In Japan, there are several Kanji codesets, which are character codesets that contain character codes representing several thousand Kanji and non-Kanji characters. Thus, you must explicitly specify in the locale name the suitable Kanji codeset for the data you want to process. This chapter describes the Tru64 UNIX Japanese character sets and the following supported codesets:
7-Bit JIS Kanji
8-Bit JIS Kanji
Japanese Extended UNIX Code (EUC)
DEC Kanji
Super DEC Kanji
Shift JIS
UTF-8
A set of correspondences between characters and codes is called a character
set.
The Japanese Industrial Standard (JIS) specifies the character sets listed
in
Table A-1.
Each character in these character sets is
represented by a 1- or 2-byte character code.
However, only the seven bits
of each byte of a character are used; a 0 (zero) is contained in the most
significant bit (MSB).
Table A-1: Classification of Character Sets Based on the JIS
Class | JIS Standard Number | Character Code |
JIS Roman Character | JIS X 0201 LH | 0xxxxxxx |
JIS Katakana Characters | JIS X 0201 RH | 0xxxxxxx |
JIS Kanji Characters | JIS X 0208 [Footnote 8] | 0xxxxxxx 0xxxxxxx |
JIS Supplemental Kanji Characters | JIS X 0212 | 0xxxxxxx 0xxxxxxx |
In addition
to the character set based on the Japanese Industrial Standard (JIS), the
ASCII character set, which contains 1-byte character codes defined by the
American National Standards Institute (ANSI), is used widely in Japan.
The
ASCII character set is almost the same as JIS X 0201H, except for the graphic
characters assigned to the two codes listed in
Table A-2.
Table A-2: ASCII Versus JIS X 0201 LH
Character Code | ASCII | JIS X 0201 LH |
0x5c | Back Slash (\) | Yen Sign (¥) |
0x7e | Tilde ( |
Overline ( |
A set of two or more of these character sets is called a codeset. However, within a codeset, character codes belonging to different character sets overlap and characters cannot be uniquely associated with codes. For this reason, some code extension is required for the character set included in the codeset.
The International Standards Organization (ISO) specifies ISO 2022, which defines the system of character codes common to ISO countries. The Japanese Industrial Standard translated ISO 2022 into the Japanese version numbered JIS X 0202. The following are the main codesets that were code-extended based on JIS X 0202:
7-Bit JIS Kanji
8-Bit JIS Kanji
Japanese EUC
The following codesets were code-extended specific to Japan, but are not based on JIS X 0202:
DEC Kanji
Super DEC Kanji
Shift JIS
Table A-3
lists the character sets available
for each of the codesets.
In the table, Yes means that the codeset contains
characters included in each character set.
Table A-3: Character Sets Available for Each of the Codesets
Codeset | ASCII or JIS X 0201 LH | JIS X 0201 RH | JIS X 0208 | JIS X 0212 |
7-Bit JIS | Yes | Yes | Yes | Yes |
8-Bit JIS | Yes | Yes | Yes | Yes |
Japanese EUC | Yes | Yes | Yes | Yes |
DEC Kanji | Yes | No | Yes | No |
Super DEC Kanji | Yes | Yes | Yes | Yes |
Shift JIS | Yes | Yes | Yes | No |
For details about each JIS standard, refer to the appropriate JIS table.
In ISO/IEC 10646, the International Standards Organization has defined the Universal Character Set (UCS), which represents all characters in the world by a single codeset. The Japanese Industrial Standard has translated the UCS into JIS X 0221.
Tru64 UNIX supports UTF-8 encoding based on ISO/IEC 10646.
A.1.1 7-Bit JIS Kanji Codeset
In the 7-bit JIS Kanji codeset, all the characters included in
the JIS X 0201 LH, 0202 RH, 0208, and 0212 character sets are encoded into
7-bit character codes.
Throughout this document, the 7-Bit JIS Kanji
codeset is called the 7-Bit JIS Codeset.
Table A-4: 7-Bit JIS Codeset Character Codes
Character Set | Character Code |
JIS X 0201 LH | 0xxxxxxx |
JIS X 0201 RH | 0xxxxxxx |
JIS X 0208 | 0xxxxxxx 0xxxxxxx |
JIS X 0212 | 0xxxxxxx 0xxxxxxx |
C0 Control Character | 000xxxx |
The characters encoded as listed in Table A-4 overlap the codes of the other characters, and so, cannot be identified uniquely. For the 7-Bit JIS Codeset, the respective characters are identified by means of the code extension specified in JIS X 0202.
The codes are processed as follows:
All the codes appearing after the Kanji in sequence ("ESC $ B" by default) are processed as JIS X 0208 Kanji characters.
All the codes appearing after the Kanji out sequence ("ESC(B" by default) are processed as ASCII characters.
All the codes appearing after Supplemental Kanji in sequence ("ESC$(D" by default) are processed as JIS X 0212 Supplemental Kanji characters.
All the codes appearing after Kanji out sequence ("ESC(B" by default) are processed as ASCII characters.
All the codes from SO (0x0e) to SI (0x0f) are processed as JIS X 0201 RH Katakana characters.
You also have the option to use the Kana in sequence ("ECS(I" by default), instead of S0 (0x0e) and SI (0x0f), to represent JIS X0201 RH Katakana characters.
Note
The 7-Bit JIS Codeset can be used for the terminal codes, but cannot be used to specify a locale name.
For details about the 7-Bit JIS Codeset, see
JIS7
(5)A.1.2 8-Bit JIS Kanji Codeset
In the 8-Bit JIS Kanji codeset, all the characters included in the JIS
X 0201 LH, 0208, and 0212 character sets are encoded into 7-bit character
codes and all the characters included in the JIS X 0201 RH character set are
encoded into 8-bit character codes.
Throughout this document, the 8-Bit JIS
Kanji codeset is called the 8-Bit JIS Codeset.
Table A-5
summarizes encoding of the 8-Bit JIS Codeset characters.
Table A-5: 8-Bit JIS Codeset Character Codes
Character Set | Character Code |
JIS X 0201 LH | 0xxxxxxx |
JIS X 0201 RH | 1xxxxxxx |
JIS X 0208 | 0xxxxxxx 0xxxxxxx |
JIS X 0212 | 0xxxxxxx 0xxxxxxx |
C0 Control Character | 000xxxx |
The codes are processed as follows:
All the codes appearing after the Kanji in sequence ("ESC( B" by default) are processed as JIS X 0208 Kanji characters.
All the codes appearing after the Kanji out sequence ("ESC$B" by default) are processed as ASCII characters.
All the codes appearing after Supplemental Kanji in sequence ("ESC$(D" by default) are processed as JIS X 0212 Supplemental Kanji characters.
All the codes appearing after Kanji out sequence ("ESC(B" by default) are processed as ASCII characters.
Data with a 1 in the MSB is processed as a JIS X 0201 RH Katakana character.
SI and SO, used for the 7-Bit JIS Codeset, are ignored.
Note
The 8-Bit JIS Codeset can be used for the terminal codes, but cannot be used to specify a locale name.
For details about the 8-Bit JIS Codeset, see
JIS8
(5)A.1.3 Japanese EUC Codeset
The Extended UNIX Code (EUC) is a coding scheme that was extended by AT&T Bell Laboratories and made available worldwide. The Japanese EUC codeset is the result of applying the EUC to Japanese. The extension is based on the UNIX Internal Coding Scheme for Japanese that was defined in the UNIX System Japanese Features Proposal, proposed in 1985 by the AT&T Japanese UNIX System Consultative Committee.
CS0 -- Any character belonging to CS0 is encoded so that the MSB is always set to 0 (zero). In any EUC worldwide, CS0 is defined as the ASCII character set.
CS1 -- Any character belonging to CS1 is encoded so that the MSB of each byte of the code is always set to 1. In the Japanese EUC Codeset, CS1 is JIS X 0208.
CS2 -- Any character belonging to CS2 is encoded so that, following the SS2 code (0x8f), the MSB of each byte of the code is always set to 1. In the Japanese EUC Codeset, CS2 is JIS X 0201 RH.
CS3 -- Any character belonging to CS3 is encoded so that, following the SS3 code (0x8f), the MSB of each byte of the code is always set to 1. In the Japanese EUC Codeset, CS3 is JIS X 0212.
Table A-6
summarizes encoding of the Japanese EUC Codeset
characters.
Table A-6: Japanese EUC Codeset Character Codes
Code Set | Corresponding Character Set | Encoding |
CS0 | ASCII | 0xxxxxxx |
CS1 | JIS X 0208 | 1xxxxxxx 1xxxxxxx |
CS2 | JIS X 0201RH |
SS2 1xxxxxxx
|
CS3 | JIS X 0212 |
SS3 1xxxxxxx 1xxxxxxx
|
C0 Control Character | 000xxxxx | |
C1 Control Character | 100xxxxx |
The following fields of the undefined JIS X 0208 and 0212 fields can be assigned as the user-defined character fields by the vendor or user:
85th to 94th Ku fields of the undefined JIS X 0208 fields
78th to 94th Ku fields of the undefined JIS X 0212 fields
Note
JIS reserves the 78th to 84th fields of the undefined JIS X 0212 fields for future extension. Thus, it is possible for the JIS to assign characters to these fields that will collide with user-defined characters.
For details about the Japanese EUC Codeset, see
eucJP
(5)A.1.4 DEC Kanji Codeset
The DEC Kanji Codeset consists of the following character sets:
ASCII or JIS X 0201 LH
DEC Kanji Character Set 1983
User-defined fields in DEC Extended Kanji Character Set
An ASCII character or JIS X 0201 LH is represented by a 1-byte character code. The MSB of each byte is always 0 (zero).
DEC Kanji Character Set 1983 consists of a total of 6,877 characters,
which include non-Kanji characters and Level-1 and Level-2 Kanji characters
(see
Table A-7).
One character is represented by 2 bytes,
and the MSB of each byte is always 1.
The Ku-Ten numbers of this character
set are the same as JIS X 0208-1983.
Table A-7: DEC Kanji Character Set 1983
Ku Number | Assignment | Characters |
1st to 8 | Non-Kanji characters including special characters, digits, Roman characters, Hiragana characters, Katakana characters, Greek characters, Russian characters, and ruled lines | 524 |
9th to 15th | JIS-reserved fields | |
16th to 47th | Level-1 Kanji characters | 2,965 |
48th to 84th | Level-2 Kanji characters | 3,388 |
85th to 94th | JIS-reserved fields |
In the DEC Kanji Codeset, codes of 8,836 user-defined characters can be assigned to user-defined fields in the DEC Extended Kanji Character Set. Any character belonging to this extended character set contains a 1 in the MSB in the first byte and a 0 (zero) in the MSB in the second byte. The 32nd to 94th Ku fields are reserved by Compaq. Therefore, exceptional or user-defined characters can be assigned to the 1st to the 31st Ku fields in the DEC Extended Kanji Character Set.
Table A-8
summarizes DEC Kanji Codeset character
codes.
Table A-8: DEC Kanji Codeset Character Codes
Character Set | Character Code |
ASCII or JIS X 0201 LH | 0xxxxxxx |
JIS X 0208 | 1xxxxxxx 1xxxxxxx |
DEC Extended Kanji Character Set | 1xxxxxxx 0xxxxxxx |
C0 Control Character | 000xxxxx |
C1 Control Character | 100xxxxx |
For details about the DEC Kanji Codeset, see the Kanji Code Test and
deckanji
(5)A.1.5 Super DEC Kanji Codeset
The Super DEC Kanji Codeset is an extended version of the DEC Kanji Codeset that enables CS2 and CS3 of the Japanese EUC to be handled.
The Super DEC Kanji Codeset consists of the following character codesets:
ASCII or JIS X 0201 LH
DEC Kanji Character Set 1983
DEC Extended Kanji Character Set
JIS X 0201 RH
JIS X 0212-1990
The ASCII (or JIS X 0201 LH) codeset, DEC Kanji Character Set 1983, and DEC Extended Kanji Character Set are encoded in the same manner as the DEC Kanji Codeset.
Any character belonging to JIS X 0201 RH is represented by the 1-byte character data that follows SS2 (0x8e). The MSB of this byte is always 1.
Any character belonging to JIS X 0212-1990 is represented by the 2-byte character data that follows SS3 (0x8f). The MSB of this byte is always 1.
Table A-9
summarizes the Super DEC Kanji
Codeset character codes.
Table A-9: Super DEC Kanji Codeset Character Codes
Character Set | Character Code |
ASCII or JIS X 0201 LH | 0xxxxxxx |
JIS X 0208 | 1xxxxxxx 1xxxxxxx |
DEC Extended Kanji Character Set | 1xxxxxxx 0xxxxxxx |
JIS X 0212 RH |
SS2 1xxxxxxx
|
JIS X 0212 |
SS3 1xxxxxxx 1xxxxxxx
|
C0 Control Character | 000xxxxx |
C1 Control Character | 100xxxxx |
In the Super DEC Kanji Codeset, exceptional or user-defined characters
can be assigned to fields ranging from the first field to the undefined fields
in the DEC Extended Kanji Character Set, JIS X0208, and JIS X 0212 (see
Table A-10).
Table A-10: Super DEC Kanji User-Defined Fields
Character Set | Assignable Ku Fields |
JIS X 0208 | 85th to 94th |
JIS X 0212 | 78th to 94th |
DEC Extended Kanji Character Set | 1st to 94th |
For details about the Super DEC Kanji Codeset, see
sdeckanji
(5)A.1.6 Shift JIS Codeset
The Shift JIS Codeset is widely used in the personal computer world.
Table A-11
summarizes encoding of its character sets.
Table A-11: Encoding of Shift JIS Codes
Shift JIS Character Set | Encoding |
C0 Control Characters | 0x00 to 0x1f |
Space | 0x20 |
JIS X 0201 LH | 0x21 to 0x7e |
Del | 0x7f |
Undefined | 0x80 |
First byte of JIS X 0208 Kanji character | 0x81 to 0x9f |
Undefined | 0xa0 |
JIS X 0201 RH | 0xa1 to 0xdf |
First byte of JIS X 0208 kanji character | 0xe0 to 0xef |
First byte of user-defined character | 0xf0 to 0xfc |
Undefined | 0xfc to 0xff |
Table A-12
summarizes the correspondence
between the codes and Kanji characters and the correspondence between the
user-defined characters and Ku-Ten numbers.
Table A-12: Correspondence Between Kanji and User-Defined Characters and Ku-Ten Numbers
Kanji and User-Defined Characters | Ku-Ten Numbers |
0x8140 . . . 0x819c | JIS X 0208, 1st Ku, 1st Ten to 1st Ku, 94th Ten (except 0x817f) |
0x819f . . . 0x81fc | JIS X 0208, 2nd Ku, 1st Ten to 2nd Ku, 94th Ten |
0x8240 . . . 0x829e | JIS X 0208, 3rd Ku, 1st Ten to 3rd Ku, 94th Ten (except 0x827f) |
0x8140 . . . 0x819c | JIS X 0208, 4th Ku, 1st Ten to 4th Ku, 94th Ten |
. . . |
. . . |
0x9f9f . . . 0x9ffc | JIS X 0208, 62nd Ku, 1st Ten to 62nd Ku, 94th Ten (except 0x817f) |
0xe040 . . . 0xe09e | JIS X 0208, 63rd Ku, 1st Ten to 63rd Ku, 94th Ten (except 0xe07f) |
. . . |
. . . |
0xef40 . . . 0xef9e | JIS X 0208, 93rd Ku, 1st Ten to 93rd Ku, 94th Ten (except 0x817f) |
0xef9f . . . 0xeffc | JIS X 0208, 94th Ku, 1st Ten to 94th Ku, 94th Ten (except 0xe07f) |
0xf040 . . . 0xf0fc | User-defined characters (except 0xf07f) |
. . . |
. . . |
0xfc40 . . . 0xfcfc | User-defined characters (except 0xfc7f) |
In the Shift JIS Codeset, user-defined characters can be assigned to 0xf040 to 0xfcfc (unless the second byte is one of 0x00 to 0x3f, 0x7f, and 0xfd to 0xff).
For details about Shift JIS Codeset, see
shiftjis
(5)A.1.7 UTF-8 Codeset
The UTF-8 codeset
encodes the UCS-4 (4 octets) and UCS-2 (2 octets) data defined in ISO/IEC
10646 into a single codeset to represent the universal characters.
UTF-8 codes
are represented as variable-length codes in UCS interchange format and represent
any ASCII code by one byte.
In addition, the standard gave consideration to
not using an ASCII code for a multibyte variable-length code field.
Table A-13: UCS-4 Encoding Ranges and UTF-8 Bit Assignments
Encoding Range (Hex) | UTF-8 Octet String (Binary) |
0000 0000-0000 007F | 0xxxxxxx |
0000 0080-0000 07FF | 110xxxxx 10xxxxxx |
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx |
0001 0000-001F FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
0020 0000-03FF FFFF | 1111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
0400 0000-7FFF FFFF | 11111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx |
For details about UTF-8, see
Unicode
(5)