This manual
describes Digital UNIX interfaces and utilities that help you develop internationalized
programs. These interfaces and utilities conform to specifications in the
X/Open UNIX standard. This standard allows for implementation-defined
behavior in certain areas. This manual identifies those software characteristics
that are vendor specific.
Language has implications for processing text for such things as character
handling and word ordering. Digital UNIX provides interfaces that allow
internationalized programs to manipulate text according to the language requirements
of individual users.
Language differences require the separation of message text from program
code. Digital UNIX provides facilities that allow message text to be separated
from the code, translated into different languages, and accessed by the program
at run time. Chapter 3 explains how an internationalized
program that uses the Worldwide Portability Interfaces (WPI) generates and
accesses messages.
An internationalized program that uses X and Motif interfaces can separate
message text from program code in the following ways:
For information about separating message text from program code for
X and Motif interfaces, refer to the following books:
An internationalized program cannot assume how these formats are set
in advance and uses system facilities to determine formats at run time. This
capability is provided through a language information database that programs
can query for the required formats of cultural data items.
For a program to be able to handle text recorded in different codesets,
the program cannot make assumptions about the size or bit assignment of character
encodings. In particular, the program cannot assume that any part of an area
used to store a character is available for other uses.
Locales do not solve all of the problems that localization must address.
For example, the localization process means making sure that translations
are available for software messages; appropriate fonts, and measurement systems
are supported and available for display and printing devices; and, in some
cases, additional software is written to handle local requirements.
The mnemonic L10N is frequently
used as an abbreviation for localization.
Each locale contains collating sequence-information that informs string
comparison functions about the relative ordering of characters defined in
the associated codeset. Internationalized regular expressions also use the
collating sequence for implementing character ranges, collating symbols, and
equivalence classes.
Typically, internationalized programs read locale variables at run time
and use them to attach a particular instance of localization data to the programs'
operational environment. However, programs can also set these variables internally
when appropriate. Therefore, the binding to a particular locale need not be
general for all parts of a program. Within one execution cycle, different
parts of the program can request different localizations.
A character string or string is a contiguous sequence of
bytes terminated by and including the null byte. A string is an array of
type char in the C programming language. The null byte is a value
with all bits set to zero (0).
A wide character is an
integral type that is large enough to hold any member of the extended execution
character set. In program terms, a wide character is an object of type wchar_t, which is defined in the header files /usr/include/stddef.h (for conformance to the X/Open UNIX standard) and /usr/include/stdlib.h (for conformance to the ANSI C standard). The file locations where
this data type is defined are determined by standards organizations; however,
the definition itself is implementation specific. For example, implementations
that support only single-byte codesets (not the case for Digital UNIX)
might define wchar_t as a byte value.
A wide-character string
is a contiguous sequence of wide characters terminated by and including the
null wide character. A wide-character string is an array of type wchar_t. The null wide character is a wchar_t value with all bits
set to zero (0).
An empty string
is a character string whose first element is the null byte. Similarly, an empty wide-character string is a wide-character string whose
first element is the null wide character.
The Portable Character Set as defined by X/Open is similar to the basic
source and basic execution character sets defined in ISO/IEC 9899:1990, except that the X/Open version also includes the dollar sign
($), commercial at sign (@), and grave accent ( [grave ]) characters.
Some locales (for example, ISO 646 variants) may make substitutions
for one or more of the preceding characters. In such cases, the
substituted
character has the same syntactic meaning as the character it replaces in the
Portable Character Set. An example of a character substitution might be the
British pound sign ([pound ]) for the number sign (#) that
is the default.
1.1 Language
An internationalized program makes no assumptions about the
language of character data (text) that the program is designed to handle.
The term data refers to data generated internally,
data extracted from or written to files, and message text used for communication
with the program's user.1.2 Cultural Data
Cultural data refers to the conventions of a geographic area or
territory for such things as date, time, and currency
formats.1.3 Character Sets
A character set is a set of alphabetic or other
characters used to
construct the words and other elementary units of a
native language or computer language. A coded character set (or codeset) is a set of unambiguous rules that establishes a character
set and the one-to-one relationship between each character of the set and
its bit representation.1.4 Localization
Localization refers to the process of implementing local
requirements within a computer system.
Some of these requirements are addressed by locales. Each locale is a set of data that supports a particular
combination of native language, cultural data, and codeset. The type of information
a locale can contain and the interfaces that use a locale are subject to standardization.
However, where locales reside on the system and how they are named can vary
from one vendor to another.1.4.1 Collating Sequence
The ordering of characters may be implicit in underlying hardware but
can be defined for software to conform to the way language is used in a particular
territory. Many languages have more complex rules for sorting than English.
The following list shows why some English assumptions about character sorting
do not apply to other languages:
1.4.2 Character Classification
Character classification information provides details about the type
of character associated with each valid character code; that is, whether the
code defines an alphabetic, uppercase, lowercase, punctuation, control, space,
or other kind of character. Both character classification functions and internationalized
regular expressions use this information to determine character classes.1.4.3 Case Conversion
Case conversion refers to information that identifies the possible alternative
case of each valid character code. Case conversion functions use this information
to change characters from uppercase to lowercase or from lowercase to uppercase.
Note that case is not a characteristic of all of the letters, or even of any
characters, in some languages.1.4.4 Language Information
Language information (or langinfo
database) refers to localization data that describes the format
and setting of cultural data that can vary from one locale to another. This
information includes the appropriate formats and characters for date and time,
currency, and numeric values.1.4.5 Message Catalogs
A message catalog is a file or storage area that contains program
messages, command prompts, and responses to
prompts for a particular language. Motif applications also use resource files
and User Interface Language (UIL) files in addition to or in place of message
catalogs for text and other values that can vary from one locale to another. Chapter 3 describes the messaging system.1.5 Language Announcement
Language announcement is the
mechanism by which language, cultural
data, and codeset requirements
are set either for the system as a whole, by an application, or by individual
users. Language announcement is performed by setting a locale name in a set
of reserved environment variables. On Digital UNIX systems, system managers
can set the default values for these variables for different shell environments;
refer to the System Administration book for information about setting locale defaults
for shells. Users can also set locale variables on a per-process basis.1.6 Terms and Definitions
This section defines terms used extensively in this guide. Less common
terms are defined when they first appear.1.6.1 Characters and Strings
A character is a sequence of one or more bytes that represent a
single graphic symbol or control code. Do not confuse the term character with the C programming language data type char, which represents an object large enough to store any member of the
basic execution character set and which is usually mapped as an 8-bit value.
Unlike the char data type in C, a character as used in this guide
can be represented by a multibyte or
single-byte value. The expression multibyte character is synonymous with the term character; that is, both refer to character values of any length, including
single-byte values.1.6.2 Portable Character Set
The Portable Character Set
(PCS) is supported in both compile-time
(source) and run-time (executable) environments. The PCS contains:
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~