One of the primary functions of most computer programs is to manipulate data, some or all of which may involve interaction between the program and a computer user. In commercial situations, it is important that such interactions take place in the native language of each user. Cultural data should also observe the correct customs.
When you write programs to support multilanguage operation, you must also consider the fact that languages can be represented within the computer system by one or more codesets. Because of the requirements of different languages, characters in codesets may vary in both size (8-bit, 16-bit, and so on) and binary representation.
You can satisfy the preceding requirements by writing programs that make no hard-coded assumptions about language, cultural data, or character encodings. Such programs are said to be internationalized. Data specific to each supported language, territory, and codeset combination are held separately from the program code and can be bound to the run-time environment by language-initialization functions.
Digital UNIX provides the following facilities for developing internationalized software, defining localization data, and announcing specific language requirements:
The rest of this chapter describes each of the these facilities in more detail.
The discussion and examples in this chapter focus on functions provided in the Standard C Library. Refer to Chapter 4 and Chapter 5 for information about using functions in the curses, X, and Motif libraries.
The ISO codesets cover the major European languages. Several of these codesets allow for the mixing of major languages within a single codeset. All ISO codesets are a superset of the ASCII codeset and therefore allow systems to support languages other than English without invalidating existing software that is not internationalized. The Digital UNIX operating system provides locales that use the ISO 8859-1 (Latin 1) and ISO 8859-7 (Latin/Greek) codesets.
Subsets that support localized variants of the operating system may include locales based on additional ISO codesets. For example, the optional language variant subsets included with Digital UNIX to support Czech, Hungarian, Polish, Russian, Slovak, and Slovene provide locales based on the ISO 8859-2 (Latin 2) codeset. Following is a complete list of ISO codesets, along with the languages that they support:
Western European languages, including Catalan
Eastern European languages
Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, and Turkish
Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, and Lithuanian
Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croatian, and Ukranian
Arabic
Greek
Hebrew
Danish, Dutch, English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, and Turkish
Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic, Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese, and Swedish
Another ISO codeset supported by utilities on a standard Digital UNIX operating system is ISO 6937:1983. This codeset, which accommodates both 7-bit and 8-bit characters, is used for text communication over communication networks and interchange media, such as magnetic tape and disks.
The codesets discussed up to this point address the requirements of languages whose characters can be stored in a single byte. Such codesets do not meet the needs of Asian languages, whose characters can occupy multiple bytes. Digital UNIX supplies the following codesets through subsets that support Asian languages and countries:
These codesets are supplied when you install Asian-language variant subsets of the Digital UNIX product. A specialized terminal driver and associated utilities must be available on your system to support the input and display of Asian characters at run time. These components are also supplied when you install one of the Asian-language variant subsets.
The Unicode and ISO/IEC 10646 standards specify the Universal Character Set (UCS), a character set that allows character units to be processed for all languages, including Asian languages, by using the same set of rules. Digital UNIX supports the USC-4 encoding of this character set. An application parses UCS-4 character encoding in 32-bit units.
Another legacy of the historical dependence of UNIX systems on 7-bit ASCII character encoding is that some programs use the most significant bit of a byte for their own internal purposes. This was a dubious programming practice, although quite safe when characters in the underlying codeset always mapped to the remaining 7 bits of the byte. In the world of international codesets, the practice of using the most significant bit of a byte for program purposes must be avoided.
if ((c = getchar()) == \141)
This condition assumes that lowercase a is always represented by a fixed octal value, which may not be true for all codesets. The following statement represents an improvement in that it substitutes a character constant for the octal value:
if ((c = getchar()) == 'a')
This example still presents problems, however, because the getchar() function operates on bytes. The statement would not work correctly if the next character in the input stream were a multibyte value. The following statement substitutes the getwchar() function for the getchar() function. The statement works correctly with any codeset because a is a member of the Portable Character Set and is transformed into the same wide-character value in all locales.
if ((c = getwchar()) == L'a')
The X/Open UNIX standard specifies that each member of the source character set and each escape sequence in character constants and string literals is converted to the same member of the execution character set in all locales. It is therefore safe for you to use any of the characters in the Portable Character Set as a character constant or in string literals. Note that non-English characters are not included in the Portable Character Set and may not translate correctly when used as literals. Consider the following example:
if ((c = getwchar()) == L'[agrave]')
The following example shows how to construct a test for a constant that for whatever reason may be a non-English character. The constant has been defined in a message catalog with the symbolic identifier MSG_ID. Statements in the example retrieve the value for MSG_ID from the message catalog, which is locale specific and bound to the program at run time.
.
.
.
char *schar; (1) wchar_t wchar; (2)
.
.
.
schar = catgets(catd,NL_SETD,MSG_ID,"a"); (3) if (mbtowc (&wchar,schar,MB_CUR_MAX) == -1) (4) error(); if ((c = getwchar()) == wchar) (5)
.
.
.
The catgets() function returns a value as an array of bytes so the value is returned to the schar variable. If the accented character is not available in the locale's codeset, the test is made against the unaccented base character (a).
If schar does not contain a valid multibyte character, signals an error.
Refer to Chapter 3 for more information about message catalogs and the catgets() function. See Section 2.1.4 for information about converting multibyte characters and strings to wide-character data that your program can process.
Multibyte encoding is typically the encoding used when data is stored in a file or generated for external use or data interchange. Multibyte encoding has the following disadvantages:
Because of the disadvantages of multibyte encoding, wide-character encoding, which allocates a fixed number of bytes per character, is typically used for internal processing by programs; in fact, internal process code is another way of referring to data in wide-character format. The size of a wide character varies from one system implementation to another. On Digital UNIX systems, the default size for a wide character is set to 4 bytes (32 bits), a setting that optimizes performance for the Alpha processor.
The following example is from a program module that reads records from a database of employee data. In this case, the programmer wants to apply a locale-independent format to a record retrieved from a data file and uses the mbstowcs() function to explicitly convert an employee's first and last names from multibyte-character to wide-character encoding.
/* * The employee record is normalized with the following format, which * is locale independent: Badge number, First Name, Surname, * Cost Center, Date of Join in the `yy/mm/dd' format. Each field is * separated by a TAB. The space character is allowed in the First * Name and Surname fields. */ static const char *dbOutFormat = "%ld\t%S\t%S\t%S\t%02d/%02d/%02d\n"; static const char *dbInFormat = "%ld %[^\t] %[^\t] %S %02d/%02d/%02d\n";
.
.
.
sscanf(record, dbInFormat, &emp->badge_num, firstname, surname, emp->cost_center, &emp->date_of_join.tm_year, &emp->date_of_join.tm_mon, &emp->date_of_join.tm_mday); (void) mbstowcs(emp->first_name, firstname, FIRSTNAME_MAX+1); (void) mbstowcs(emp->surname, surname, SURNAME_MAX+1);
.
.
.
Refer to Section A.9 for a complete list of functions that work with multibyte data directly.
The source variant of a codeset must observe the following additional rules:
The C language compiler also supports trigraph sequences when you specify the -std1 or -std flag on the cc command line. Trigraph sequences, which are part of the ANSI C specification, allow users to enter the full range of basic characters in programs, even if their keyboards do not support all characters in the source codeset. The following trigraph sequences are currently defined, each of which is replaced by the corresponding single character:
Trigraph Sequence | Single Character |
---|---|
??= | # |
??( | [ |
??/ | \ |
??' | ^ |
??< | { |
??) | ] |
??! | | |
??> | } |
??- | ~ |
In the past, many programs classified characters according to whether the character's value fell between certain numerical limits. For example, the following statement tests for all uppercase alphabetic characters:
if (c >= 'A' && c <= 'Z')
if (iswupper (c))
Internationalization functions classify wide-character code values according to type information in the user's locale and are independent of the language and codeset. Refer to Section A.2 for a complete list and description of character classification functions.
a_var |= 0x20;
.
.
.
a_var &= 0xdf;
The preceding statements are not safe to use in internationalized programs because they:
The correct way to handle case conversion is to call the towlower() function for conversion to lowercase and the towupper() function for conversion to uppercase. For example:
a_var = towlower(a_var);These functions use information specified in the user's locale and are independent of the codeset where characters are defined. The functions return the argument unchanged if input is invalid. Refer to Section A.3 for more detailed discussion of case conversion functions.
.
.
.
a_var = towupper(a_var);
.
.
.
int cmp_val; char *s1; char *s2;
.
.
.
cmp_val = strcmp(s1, s2);
.
.
.
Certain languages, however, require collation algorithms that must make multiple passes through the codeset. Multiple passes may be required for the following reasons:
String comparison in an international environment thus depends on the codeset and language. This dependency means that additional functions are required to compare strings according to collating sequence information in the user's locale. These functions include:
If two strings are being compared only for equality, you can use strcmp() or wcscmp(), which are faster in most environments than wcscoll().
nl_langinfo(D_T_FMT);
In the following example, the strftime() function generates a date string as defined by the D_FMT item in the langinfo database:
.
.
.
setlocale(LC_ALL, ); (1)
.
.
.
clock = time((time_t*)NULL); (2) tm = localtime(&clock); (3)
.
.
.
strftime(buf, size, "%x", tm); (4) puts(buf); (5)
.
.
.
The buf argument is a pointer to a string variable in which the date string is returned. The size argument contains the maximum size of buf. The "%x" argument specifies conversion specifications, similar to the format strings used with the printf() and scanf() functions. The "%x" argument is replaced in the output string by representation appropriate for the locale.
The following example shows how to use strftime() and nl_langinfo() in combination to generate a date and time string. Assume that the same calls to the setlocale(), time(), and localtime() interfaces have been made here as shown in the preceding example. The only difference is that a call to nl_langinfo() has replaced the format string argument in the call to strftime():
.
.
.
strftime(buf, size, nl_langinfo(D_T_FMT), tm); puts(buf);
.
.
.
strfmon(buf, size, "%n", value); (1)
The money program demonstrates how the strfmon() function works. The source file for this sample program is available in the /usr/i18n/examples/money directory.
struct lconv *app_conv;
You can use the following features, which are contained in the structure lconv, in program-defined routines:
These considerations are discussed in the following sections.
The X/Open UNIX standard specifies a native-language message system that contains a definition of message text source files, the gencat command to generate message catalogs from these source files, and a set of Standard C Library functions to retrieve individual messages from one or more catalogs at run time.
The following example shows how an internationalized program retrieves a message from a catalog:
#include <stdio.h> (1) #include <locale.h> (2) #include <nl_types.h> (3) #include "prog_msg.h" (4) main() { nl_catd catd; (5) setlocale(LC_ALL, ); (6) catd = catopen("prog.cat", NL_CAT_LOCALE); (7) puts(catgets(catd, SETN, HELLO_MSG, "Hello, world!")); (8) catclose(catd); (9) }
.
.
.
A message catalog can contain one or more message sets and individual messages are ordered within each set.
This descriptor is returned by the function that opens the catalog. The descriptor is also passed as an argument to the function that closes the catalog.
The locale name set for the LC_MESSAGES variable is the locale used by the catopen() and catgets() functions in this example. Typically, the system manager or user sets only the LANG environment variable to a particular locale name and the same locale name is used for LC_MESSAGES.
The NL_CAT_LOCALE argument specifies that the program will use the locale name set for LC_MESSAGES. The catopen() function uses the value set for the NLSPATH environment variable to determine the location of the message catalog. The call returns the message catalog descriptor to the catd variable.
The first argument to this call is a call to the catgets() function, which retrieves the appropriate text for the message with the HELLO_MSG identifier. This message is contained in the message set identified by the SETN constant. Note that the catgets() function allows one message translation to be held within the program source. This is the translation that will be used in the event that the program cannot get the message from the message catalog.
Refer to Chapter 3 for information about creating and using message catalogs.
Consider the following example:
printf(catgets(catd, set_id, WRONG_OWNER_MSG, "%s is owned by %s\n"), folder_name, user_name);The preceding statement uses a message catalog but assumes a particular language construction (a noun followed by a verb in passive voice followed by a noun). Passive-verb constructions are not part of all languages; therefore, message translation might mean printing user_name before folder_name. In other words, the translator might need to change the construction of the message so that the user sees the translated equivalent of "John_Smith owns JULY_REVENUE" rather than "JULY_REVENUE is owned by John_Smith."
To overcome the problems imposed by fixed ordering of message elements, the format specifiers for the printf() routine have been extended so that format conversion applies to the nth argument in an argument list rather than to the next unused argument. To apply the format conversion extension, replace the % conversion character with the sequence %digit $, where digit specifies the position of the argument in the argument list. The following example illustrates how the programmer applies this feature to the format string "%s is owned by %s\n":
printf(catgets(catd, set_id, WRONG_OWNER_MSG, "%1$s is owned by %2$s\n"), folder_name, user_name);
The construction of the string "%1$s is owned by %2$s", which is the default value for the WRONG_OWNER_MSG entry in the program's message file, can then be changed by the translator to the non-English equivalent of:
WRONG_OWNER_MSG "%2$s owns %1$s\n"
Consider the following example:
.
.
.
int day; int month; int year;
.
.
.
scanf("%d/%d/%d", &month, &day, &year);
.
.
.
The format string in this statement is governed by the assumption that all users use a United States English format (mm/dd/yy) to input dates. In an internationalized program, you use extended format specifiers to support requirements that language may impose on the order of string elements. For example:
.
.
.
scanf(catgets(catd, NL_SETD, DATE_STRING, "%1$d/%2$d/%3$d"), &month, &day, &year);
.
.
.
The default "%1$d/%2$d/%3$d" value for the DATE_STRING message is still appropriate only for countries where users use the format mm/dd/yy to enter dates. However, for countries in which the order or formatting would be different, the translator can change the entry in the program's message file. For example:
DATE_STRING "%2$d/%1$d/%3$d"
DATE_STRING "%2$d.%1$d.%3$d"
The category argument can be one of the following:
The locale_name argument is one of the following values:
setlocale(LC_ALL, "");
.
.
.
nl_catd catd; (1) char buf[BUFSIZ]; (2)
.
.
.
setlocale(LC_ALL, ); (3) catd = catopen(CAT_NAME, 0); (4)
.
.
.
printf(catgets(catd, NL_SETD, LOCALE_PROMPT_MSG, "Enter locale name: ")); (5) gets(buf); (6) setlocale(LC_ALL, buf); (7)
.
.
.
To make sure that the variable is large enough to accommodate locale names on different systems, you should set its maximum size to the constant BUFSIZ, which is defined by the system vendor in /usr/include/stdio.h.
The CAT_NAME constant is defined in the program's own header file.
The NL_SETD constant specifies the default message set number in a message catalog and is defined in /usr/include/nl_types.h. The identifier LOCALE_PROMPT_MSG specifies the prompt string translation in the default message set.
Sometimes a program needs to vary the locale only for a particular category of data. For example, consider a program that processes different country-specific files that contain monetary values. Before processing data in each file, the program might reinitialize a program variable to a new locale name and then use that variable value to reset only the LC_MONETARY category of the locale.