One of the primary functions of most computer programs is to manipulate data, some or all of which may involve interaction between the program and a computer user. In commercial situations, it is important that such interactions take place in the native language of each user. Cultural data should also observe the correct customs.
When you write programs to support multilanguage operation, you must also consider the fact that languages can be represented within the computer system by one or more codesets. Because of the requirements of different languages, characters in codesets may vary in both size (8-bit, 16-bit, and so on) and binary representation.
You can satisfy the preceding requirements by writing programs that make no hard-coded assumptions about language, cultural data, or character encodings. Such programs are said to be internationalized. Data specific to each supported language, territory, and codeset combination are held separately from the program code and can be bound to the run-time environment by language-initialization functions.
Digital UNIX provides the following facilities for developing internationalized software, defining localization data, and announcing specific language requirements:
The rest of this chapter describes each of the these facilities in more detail.
The discussion and examples in this chapter focus on functions provided in the Standard C Library. Refer to Chapter 4 and Chapter 5 for information about using functions in the curses, X, and Motif libraries.
The ISO codesets cover the major European languages. Several
of these codesets allow for the mixing of major
languages within a single codeset. All ISO codesets are a superset of the
ASCII codeset and therefore allow systems to support languages other than
English without invalidating existing software that is not internationalized.
The Digital UNIX operating system
provides locales
that use the ISO 8859-1 (Latin 1) and ISO 8859-7 (Latin/Greek) codesets.
Subsets that support localized variants of the operating
system may include
locales based on additional ISO codesets. For example, the optional language
variant subsets included with Digital UNIX to support Czech, Hungarian,
Polish, Russian, Slovak, and Slovene provide locales based on the ISO 8859-2
(Latin 2) codeset. Following is a complete list of ISO codesets, along with
the languages that they support:
Western European languages, including Catalan
Eastern European languages
Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese,
Spanish, and Turkish
Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian,
and Lithuanian
Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croatian,
and Ukranian
Arabic
Greek
Hebrew
Danish, Dutch, English, Finnish, French, German, Irish, Italian, Norwegian,
Portuguese, Spanish, Swedish, and Turkish
Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic,
Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese, and Swedish
Another ISO codeset supported by utilities on a standard Digital UNIX
operating
system is ISO 6937:1983. This codeset, which accommodates both 7-bit and 8-bit
characters, is used for text communication over communication networks and
interchange media, such as magnetic tape and disks.
The codesets discussed up to this point address the requirements of
languages whose characters can be stored in a single byte.
Such codesets
do not meet the needs of Asian languages, whose characters can occupy multiple
bytes. Digital UNIX supplies the following codesets through subsets that
support Asian languages and countries:
These codesets
are supplied when you install Asian-language
variant subsets of the Digital UNIX product. A specialized terminal driver
and associated utilities must be available on your system to support the input
and display of Asian characters at run time. These components are also supplied
when you install one of the Asian-language variant subsets.
The Unicode and ISO/IEC 10646 standards
specify the Universal Character Set (UCS), a character set that allows character
units to be processed for all languages, including Asian languages, by using
the same set of rules. Digital UNIX supports the USC-4 encoding of this
character set. An application parses UCS-4 character encoding in 32-bit
units.
Reference pages are available for all the character sets that Digital UNIX
supports. For more information on a particular character set, refer to its
reference page.
The following sections discuss important issues that affect the way
you write source code when your program must process characters in different
codesets:
Another legacy of the historical dependence of UNIX systems on
7-bit ASCII character encoding is that some programs use the most
significant bit of a byte for their own internal purposes. This was a dubious
programming practice, although quite safe when characters in the underlying
codeset always mapped to the remaining 7 bits of the byte. In the world of
international codesets, the practice of using the most significant bit of
a byte for program purposes must be avoided.
This condition assumes that lowercase a is always represented
by a fixed octal value, which may not be true for all codesets. The following
statement represents an improvement in that it substitutes a character constant
for the octal value:
This example still presents problems, however, because the getchar() function operates on bytes. The statement would not work correctly
if the next character in the input stream were a multibyte value.
The following statement substitutes the getwchar() function for
the getchar() function. The statement works correctly with any
codeset because a is a member of the Portable Character Set and
is transformed into the same wide-character value in all locales.
The X/Open UNIX standard specifies that each member of the source character
set and each
escape sequence
in character constants and string literals is converted to the same member
of the execution character set in all locales. It is therefore safe for you
to use any of the characters in the Portable Character Set as a character
constant or in string literals. Note that non-English characters are not
included in the Portable Character Set and may not translate correctly when
used as literals. Consider the following example:
The accented character [agrave] may not be represented in
the codeset's source character set, execution character set, or both; or the
binary value of the accented character may not be translatable from one set
to the other. When source files specify non-English characters in constants,
the results are undefined.
The following example shows how to construct a test for a constant
that
for whatever reason may be a non-English character. The constant has been
defined in a message catalog with the symbolic identifier MSG_ID.
Statements in the example retrieve the value for MSG_ID from the
message catalog, which is locale specific and bound
to the program at run time.
The catgets() function returns a value as an array of bytes
so the value is returned to the schar variable. If the accented
character is not available in the locale's codeset, the test is made against
the unaccented base character (a).
If schar does not contain a valid multibyte character, signals
an error.
Refer to Chapter 3 for more information about
message catalogs and the catgets() function. See Section 2.1.4
for information about converting multibyte characters and strings to wide-character
data that your program can process.
Multibyte encoding is typically the encoding used when
data is stored in a file or generated for external use or data interchange.
Multibyte encoding has the following disadvantages:
Because of the disadvantages of multibyte encoding, wide-character encoding,
which allocates a fixed number of bytes per character, is typically used for
internal processing
by programs; in fact, internal
process code is another way of referring to data in wide-character
format.
The size of a wide character
varies from one system implementation to another. On Digital UNIX systems,
the default size for a wide character is set to 4 bytes (32 bits), a setting
that optimizes performance for the Alpha processor.
Library routines that print, scan, input, or output text have the capability
of automatically converting data from multibyte characters to wide characters
or from wide characters to multibyte characters, as appropriate for the operation.
However, applications almost always have additional statements or requirements
for which conversion to and from multibyte characters needs to be explicit.
The following example is from a program module that reads records from
a database of employee data. In this case, the programmer wants to apply a
locale-independent format to a
record retrieved from a data file and uses the mbstowcs() function to explicitly convert an employee's first
and last names from multibyte-character to wide-character encoding.
Refer to Section A.9 for a complete
list of functions that work with multibyte data directly.
The source variant of a codeset must observe the following additional
rules:
The C language
compiler also supports trigraph sequences when you
specify the -std1 or -std flag on the cc
command line. Trigraph sequences, which are part of the ANSI C specification,
allow users to enter the full range of basic characters in programs, even
if their keyboards do not support all characters in the source codeset. The
following trigraph sequences are currently defined, each of which is replaced
by the corresponding single character:
2.1 Using Codesets
In the past, most UNIX systems
were based on the 7-bit ASCII codeset.
However, most languages other than English include
characters in addition to those contained in the ASCII codeset. The X/Open
UNIX standard does not require an operating system to supply any particular
codesets in addition to ASCII. The guide does specify requirements for the
interfaces that manipulate characters so that programs are able to handle
characters from whatever codeset is available on a given system.2.1.1 Ensuring Data Transparency
As discussed in Section 2.1, internationalized
software must accommodate a wide variety of
character-encoding schemes. Programs cannot assume that a particular codeset
is on all systems that conform to requirements in the X/Open UNIX CAE specifications,
nor that individual characters occupy a fixed number of bits.2.1.2 Using In-Code Literals
When writing internationalized software, using in-code literals
can cause problems. Consider, for example, the following conditional
statement:
if ((c = getchar()) == \141)
if ((c = getchar()) == 'a')
if ((c = getwchar()) == L'a')
if ((c = getwchar()) == L'[agrave]')
.
.
.
char *schar; (1)
wchar_t wchar; (2)
.
.
.
schar = catgets(catd,NL_SETD,MSG_ID,"a"); (3)
if (mbtowc (&wchar,schar,MB_CUR_MAX) == -1) (4)
error();
if ((c = getwchar()) == wchar) (5)
.
.
.2.1.3 Manipulating Multibyte Characters
Digital UNIX provides
all the interfaces (such as putwc(), getwc(), fputws(), and fgetws()) that are needed to support multibyte
Asian codesets. Language variant subsets of the operating system must be installed
to supply the locales and facilities that make this support operational.
On systems where multibyte locales are not available, or are available and
not bound to the program at run time, the *ws* and *wc*
functions are merely synonyms for the associated single-byte functions (such
as putc(), getc(), fputs(), and fgets()). The interfaces provided for multibyte support are therefore appropriate
for use with all locales, not just those with multibyte characters.2.1.4 Converting Between Multibyte-Character and Wide-Character Data
Some languages, particularly Asian languages, can be encoded as either
multibyte-character or wide-character data.
/*
* The employee record is normalized with the following format, which
* is locale independent: Badge number, First Name, Surname,
* Cost Center, Date of Join in the `yy/mm/dd' format. Each field is
* separated by a TAB. The space character is allowed in the First
* Name and Surname fields.
*/
static const char *dbOutFormat = "%ld\t%S\t%S\t%S\t%02d/%02d/%02d\n";
static const char *dbInFormat = "%ld %[^\t] %[^\t] %S %02d/%02d/%02d\n";
.
.
.
sscanf(record, dbInFormat,
&emp->badge_num,
firstname,
surname,
emp->cost_center,
&emp->date_of_join.tm_year,
&emp->date_of_join.tm_mon,
&emp->date_of_join.tm_mday);
(void) mbstowcs(emp->first_name, firstname, FIRSTNAME_MAX+1);
(void) mbstowcs(emp->surname, surname, SURNAME_MAX+1);
.
.
.2.1.5 Rules for Multibyte Characters in Source and Execution Codesets
You should be
aware that both the source and execution character set variants of the same
codeset can contain multibyte characters. The encodings do not have to be
the same, but both set variants must observe the following rules:
Trigraph Sequence | Single Character |
---|---|
??= | # |
??( | [ |
??/ | \ |
??' | ^ |
??< | { |
??) | ] |
??! | | |
??> | } |
??- | ~ |
In the past, many programs
classified characters according to whether the character's value fell between
certain numerical limits. For example, the following statement tests for all
uppercase alphabetic characters:
This statement is valid for the ASCII codeset, in which all uppercase
letters have values in the range 0x41 to 0x5a (A to
Z). However, the statement is not valid for the codeset ISO 8859-1:1987, in
which uppercase letters occupy the ranges 0x41 to 0x5a, 0xc0 to 0xd6, and 0xd8 to 0xdf. In
the EBCDIC codeset, character values are different again and, in this case,
even the uppercase English letters have a different encoding.
When you write internationalized programs, classify characters by calling
the appropriate internationalization function. For example:
Internationalization functions classify wide-character code values according
to type information in the user's locale and are independent of
the language and codeset. Refer to Section A.2 for
a complete list and description of character classification functions.
The preceding statements are not safe to use in internationalized programs
because they:
The correct way to handle
case conversion is to call the towlower() function for conversion
to lowercase and the towupper() function for conversion to uppercase.
For example:
Certain languages, however, require
collation algorithms that must make multiple passes through the codeset. Multiple
passes may be required for the following reasons:
String comparison in an international environment thus depends on the
codeset and language. This dependency means that additional
functions are required to
compare strings according to collating sequence information in the user's
locale. These functions include:
If two strings are being
compared only for equality, you can use strcmp() or wcscmp(), which are faster in most environments than wcscoll().
You cannot make assumptions about cultural data when writing internationalized
programs. Your program must operate according to the local customs of users.
The X/Open UNIX standard specifies that this requirement be met through a
database of cultural data items that a program can access at run time, plus
a set of associated interfaces. The following sections discuss this database
and the functions used to extract and process its data items.
In the
following example, the strftime() function generates a date string
as defined by the D_FMT item in the langinfo database:
The buf argument is a pointer to a string variable in which
the date string is returned. The size argument contains the maximum
size of buf. The "%x" argument specifies conversion
specifications, similar to the format strings used with the printf() and scanf() functions. The "%x" argument is
replaced in the output string by representation appropriate for the locale.
The following example shows how to use strftime() and nl_langinfo() in combination to generate a date and time string. Assume
that the same calls to the setlocale(), time(), and localtime() interfaces have been made here as shown in the preceding
example.
The only difference is that a call to nl_langinfo() has replaced
the format string argument in the call to strftime():
To convert a string to a date/time value, the reverse of the operation
performed by strftime(), you can use the strptime() function. The strptime() supports a number of conversion
specifiers that behave in a locale-dependent manner.
The money program demonstrates how the strfmon()
function works. The source file for this sample program is available in the /usr/i18n/examples/money directory.
You can use the following features, which are contained in the structure lconv, in program-defined routines:
These considerations are discussed in the following sections.
The X/Open UNIX standard specifies a native-language message system
that contains a
definition of message
text source files, the gencat command to generate message catalogs
from these source files, and a set of Standard C Library functions to retrieve
individual messages from one or more catalogs at run time.
The following example shows how an internationalized
program
retrieves a message from a catalog:
A message catalog can contain one or more message sets and individual
messages are ordered within each set.
This descriptor is returned by the function that opens the catalog.
The descriptor is also passed as an argument to the function that closes the
catalog.
The locale name set for the LC_MESSAGES variable is the locale
used by the catopen() and catgets() functions in this
example. Typically, the system manager or user sets only the LANG
environment variable to a particular locale name and the same locale name
is used for LC_MESSAGES.
The NL_CAT_LOCALE argument specifies that the program will
use the locale name set for LC_MESSAGES.
The catopen() function uses the value set for the NLSPATH environment
variable to determine the location of the message catalog. The call returns
the message catalog descriptor to the catd variable.
The first argument to this call is a call to the catgets()
function, which retrieves the appropriate text for the message with the HELLO_MSG identifier. This message is contained in the message set identified
by the SETN constant. Note that the catgets() function
allows one message translation to be held within the program source. This
is the translation that will be used in the event that the program cannot
get the message from the message catalog.
Refer to Chapter 3 for information about creating
and using message catalogs.
Consider
the
following example:
To
overcome the problems imposed by fixed ordering of message elements, the format
specifiers for the printf() routine have been extended so that
format conversion applies to the nth argument in an argument list
rather than to the next unused argument. To apply the format conversion extension,
replace the % conversion character with the sequence %digit $, where digit specifies the position
of the argument in the argument list. The following example illustrates how
the programmer applies this feature to the format string "%s is owned
by %s\n":
The
construction of the string "%1$s is owned by %2$s", which is the
default value for the WRONG_OWNER_MSG entry in the program's message
file, can then be changed by the translator to the non-English equivalent
of:
Consider the following example:
The format string in this statement is
governed by the assumption that all users use a United States English format
(mm/dd/yy) to input dates. In an internationalized program, you use extended
format specifiers to support requirements that language may impose on the
order of string elements. For example:
The default "%1$d/%2$d/%3$d" value for the DATE_STRING message
is still
appropriate only for countries where users use the format mm/dd/yy to enter
dates. However, for countries in which the order or formatting would be different,
the translator can change the entry in the program's message file. For example:
The category argument can be one of the following:
The locale_name argument is one of the following values:
To make sure that the variable is large enough to accommodate locale
names on different systems, you should set its maximum size to the constant BUFSIZ, which is defined by the system vendor in /usr/include/stdio.h.
The CAT_NAME constant is defined in the program's own header
file.
The NL_SETD constant specifies the default message set number
in a message catalog and is defined in /usr/include/nl_types.h.
The identifier LOCALE_PROMPT_MSG specifies the prompt string translation
in the default message set.
Sometimes
a program needs to vary the locale only for a particular category of data.
For example, consider a program that processes different country-specific
files that contain monetary values. Before processing data in each file, the
program might reinitialize a program variable to a new locale name and then
use that variable value to reset only the LC_MONETARY category
of the locale.
2.1.6 Classifying Characters
Another feature of program
operation that depends on the codeset is character classification; that is,
determining whether a particular character code refers to an uppercase alphabetic,
lowercase alphabetic, digit, punctuation, control, or space character.
if (c >= 'A' && c <= 'Z')
if (iswupper (c))
2.1.7 Converting Characters
You can do case conversion
of ASCII characters with statements like the following ones, which convert
the character in a_var first to lowercase and then to uppercase:
a_var |= 0x20;
.
.
.
a_var &= 0xdf;
a_var = towlower(a_var);
These functions use information specified
in the user's locale and are independent of the codeset where characters are
defined. The functions return the argument unchanged if input is invalid.
Refer to Section A.3 for more detailed discussion
of case conversion functions.
.
.
.
a_var = towupper(a_var);2.1.8 Comparing Strings
UNIX systems have always provided functions
for comparing character strings. The following statement, for example, compares
the strings s1 and s2, returning an integer greater
than, equal to, or less than zero, depending on whether the value of s1 is greater than, equal to, or less than the value of s2
in the machine-collating sequence:
.
.
.
int cmp_val;
char *s1;
char *s2;
.
.
.
cmp_val = strcmp(s1, s2);
.
.
.2.2 Handling Cultural Data
Cultural
data refers to items of information that can vary between languages or territories.
For example:
2.2.1 The langinfo Database
The language
information database, named langinfo, contains items that represent
the cultural details of each locale supported on the system. The langinfo database on Digital UNIX systems contains the following information
for each locale, as required by the X/Open UNIX standard:
2.2.2 Querying the langinfo Database
You can extract
cultural data items from the langinfo database by calling the nl_langinfo() function. This function takes an item argument
that is one of several constants defined in the header file /usr/include/langinfo.h. The function returns a pointer to the string with the associated value
for item. The following example shows a call to nl_langinfo() that extracts the string for formatting date and time information.
This
value is associated with the constant D_T_FMT.
nl_langinfo(D_T_FMT);
2.2.3 Generating and Interpreting Date and Time Strings That Observe Local
Customs
Programs
often generate date and time strings. Internationalized programs generate
strings that observe the local customs of the
user. You can
meet this requirement by calling the strftime() function, which
makes indirect use of the langinfo database.
.
.
.
setlocale(LC_ALL, ); (1)
.
.
.
clock = time((time_t*)NULL); (2)
tm = localtime(&clock); (3)
.
.
.
strftime(buf, size, "%x", tm); (4)
puts(buf); (5)
.
.
.
.
.
.
strftime(buf, size, nl_langinfo(D_T_FMT), tm);
puts(buf);
.
.
.2.2.4 Formatting Monetary Values
The strfmon()
function formats monetary values according to information in the locale that
is bound to the program at run time. For example:
strfmon(buf, size, "%n", value); (1)
2.2.5 Formatting Numeric Values in Program-Specific Ways
You may want to perform
your own conversions of numeric quantities, monetary or otherwise, by using
specific formatting details in the user's locale. The localeconv()
function, which has no arguments, returns all the number formatting details
defined in the locale to a structure declared in your program. For example:
struct lconv *app_conv;
2.2.6 Using the langinfo Database for Other Tasks
Functions in addition to the ones discussed so far use the langinfo database to determine settings for specific items of cultural data.
For example, the scanf(), printf(), and wcstod() functions determine the appropriate radix character from information
in the langinfo database.2.3 Handling Text Presentation and Input
The language of the program user affects:
2.3.1 Creating and Using Messages
Programs need to communicate with users in their own language.
This requirement places
some constraints on the way program messages are defined and accessed. More
specifically, messages are defined in a file that is independent of the program
source code and are not compiled into object files. Because messages are in
a separate file, they can be translated into different languages and stored
in a form that is linked to the program at run time. Programs can then retrieve
message text translations that are appropriate for the user's language.
#include <stdio.h> (1)
#include <locale.h> (2)
#include <nl_types.h> (3)
#include "prog_msg.h" (4)
main()
{
nl_catd catd; (5)
setlocale(LC_ALL, ); (6)
catd = catopen("prog.cat", NL_CAT_LOCALE); (7)
puts(catgets(catd, SETN, HELLO_MSG, "Hello, world!")); (8)
catclose(catd); (9)
}
.
.
.2.3.2 Formatting Output Text
Successful translation of messages into different languages
depends not only
on making messages independent of the program source code but also on careful
construction of message strings within the program.
printf(catgets(catd, set_id, WRONG_OWNER_MSG,
"%s is owned by %s\n"),
folder_name, user_name);
The preceding statement
uses a message catalog but
assumes a particular language
construction (a noun followed by a verb in passive voice followed by a noun).
Passive-verb constructions are not part of all languages; therefore,
message translation might mean printing user_name before folder_name. In other words, the translator might need to change the
construction of the message so that the user sees the translated equivalent
of "John_Smith owns JULY_REVENUE" rather than "JULY_REVENUE
is owned by John_Smith."
printf(catgets(catd, set_id, WRONG_OWNER_MSG,
"%1$s is owned by %2$s\n"),
folder_name, user_name);
WRONG_OWNER_MSG "%2$s owns %1$s\n"
2.3.3 Scanning Input Text
The
string construction issues that are discussed for output text in Section 2.3.2
also apply to input text. For example, in different countries there are different
conventions that apply to the order in which users specify the elements of
a date or there are differences in characters that are input to delimit parts
of monetary or other numeric strings. Therefore, the scanf() family
of functions also support extended format conversion specifiers to allow for
variation in the way that users enter elements of a string.
.
.
.
int day;
int month;
int year;
.
.
.
scanf("%d/%d/%d", &month, &day, &year);
.
.
.
.
.
.
scanf(catgets(catd, NL_SETD, DATE_STRING,
"%1$d/%2$d/%3$d"), &month, &day, &year);
.
.
.
DATE_STRING "%2$d/%1$d/%3$d"
DATE_STRING "%2$d.%1$d.%3$d"
2.4 Binding a Locale to the Run-Time Environment
For an internationalized
program to operate correctly, it must bind to localized data that is appropriate
for the user at run time. The setlocale() function performs this
task. You can call setlocale() to:
The call takes two arguments: category and locale_name.2.4.1 Binding to the Locale Set for the System or User
Typically,
the system manager or user sets the LANG environment variable to
the name of a locale; setting the LANG variable automatically sets
all portions, or categories, of the locale to the same locale name. On occasion,
system managers or individual users may set different locale categories to
different locale names. Usually, internationalized programs contain the following
call, which initializes all locale categories in the program to settings already
in effect for the user:
setlocale(LC_ALL, "");
2.4.2 Changing Locales During Program Execution
Some internationalized
programs may need to prompt the user for a locale name or change locales during
program execution. The following example shows how to call setlocale() when you want to explicitly initialize or reinitialize all locale categories
to the same locale name:
.
.
.
nl_catd catd; (1)
char buf[BUFSIZ]; (2)
.
.
.
setlocale(LC_ALL, ); (3)
catd = catopen(CAT_NAME, 0); (4)
.
.
.
printf(catgets(catd, NL_SETD, LOCALE_PROMPT_MSG,
"Enter locale name: ")); (5)
gets(buf); (6)
setlocale(LC_ALL, buf); (7)
.
.
.