2 Developing Internationalized Software

This chapter explains how language, codeset, and cultural differences change the way you implement basic coding operations. After reading this chapter, you will be ready to examine an application that applies the program development techniques that are suggested. Such an application is provided on line in the /usr/examples/i18n/xpg4demo directory. Refer to the README document in that directory for an introduction to the application and how you can compile and run it with different locales. Parts of the xpg4demo application are used as examples in this and other chapters.

One of the primary functions of most computer programs is to manipulate data, some or all of which may involve interaction between the program and a computer user. In commercial situations, it is important that such interactions take place in the native language of each user. Cultural data should also observe the correct customs.

When you write programs to support multilanguage operation, you must consider the fact that languages can be represented within the computer system by one or more codesets. Because of the requirements of different languages, characters in codesets may vary in both size (8-bit, 16-bit, and so on) and binary representation.

You can satisfy the preceding requirements by writing programs that make no hard-coded assumptions about language, cultural data, or character encodings. Such programs are said to be internationalized. Data specific to each supported language, territory, and codeset combination are held separately from the program code and can be bound to the run-time environment by language-initialization functions.

Tru64 UNIX provides the following facilities for developing internationalized software, defining localization data, and announcing specific language requirements:

Library functions that handle extended character codes and that provide language- and codeset-independent character classification, case conversion, number format conversion, and string collation

Library functions that let programs dynamically determine cultural and language-specific data

A message system that allows program messages to be held apart from the program code, translated into different languages, and retrieved by a program at run time

An initialization function that binds a program at run time to the linguistic and cultural requirements of each user

The rest of this chapter describes each of these facilities in more detail.

The discussion and examples in this chapter focus on functions provided in the Standard C Library. Refer to Chapter 4 and Chapter 5 for information about using functions in the curses, X, and Motif libraries.

2.1 Using Codesets

In the past, most UNIX systems were based on the 7-bit ASCII codeset. However, most non-English languages include characters in addition to those contained in the ASCII codeset.

The X/Open UNIX standard does not require an operating system to supply any particular codesets in addition to ASCII. The standard does specify requirements for the interfaces that manipulate characters so that programs are able to handle characters from whatever codeset is available on a given system.

The first group of the International Standards Organization (ISO) codesets covered only the major European languages. In this group, several codesets allow for the mixing of major languages within a single codeset. All of these codesets are a superset of the ASCII codeset, and therefore systems can support non-English languages without invalidating existing software that is not internationalized. A Tru64 UNIX operating system always includes a locale for the United States that uses the ISO 8859-1 (ISO Latin-1) codeset.

Subsets that support localized variants of the operating system may include locales based on additional ISO codesets. For example, the optional language variant subsets included with Tru64 UNIX to support Czech, Hungarian, Polish, Russian, Slovak, and Slovene provide locales based on the ISO 8859-2 (Latin-2) codeset. Following is a complete list of ISO codesets with the languages that they support:

ISO 8859-1, Latin-1
Western European languages, including Catalan, Danish, Dutch, English, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish

ISO 8859-2, Latin-2
Eastern European languages, including Albanian, Czech, English, German, Hungarian, Polish, Rumanian, Serbo-Croatian, Slovak, and Slovene

ISO 8859-3, Latin-3
Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, and Turkish

ISO 8859-4, Latin-4
Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, and Lithuanian

ISO 8859-5, Latin/Cyrillic
Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croatian, and Ukranian

ISO 8859-6, Latin/Arabic
Arabic

ISO 8859-7, Latin/Greek
Greek

ISO 8859-8, Latin/Hebrew
Hebrew

ISO 8859-9, Latin-5
Danish, Dutch, English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, and Turkish

ISO 8859-10, Latin-6
Danish, English, Estonian, Faroese, Finnish, German, Greenlandic, Icelandic, Sami (Lappish), Latvian, Lithuanian, Norwegian, and Swedish

ISO 8859-15, Latin-9
Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, and Swedish

Another ISO codeset supported by utilities on a standard operating system is ISO 6937:1983. This codeset, which accommodates both 7-bit and 8-bit characters, is used for text communication over communication networks and interchange media, such as magnetic tape and disks.

The codesets discussed up to this point address the requirements of languages whose characters can be stored in a single byte. Such codesets do not meet the needs of Asian languages, whose characters can occupy multiple bytes. The operating system software supplies the following codesets through subsets that support Asian languages and countries:

eucJP (Japanese Extended UNIX Code)

SJIS (Shift JIS)

deckanji (DEC Kanji)

sdeckanji (Super DEC Kanji)

deckorean (DEC Korean)

eucKR (Korean Extended UNIX Code)

TACTIS (Thai API Consortium/Thai Industrial Standard)

dechanzi (DEC Hanzi)

dechanyu (DEC Hanyu)

eucTW (Taiwanese Extended UNIX Code)

big5 (BIG-5)

These codesets are supplied when you install Asian-language variant subsets of the operating system software. Also supplied are a specialized terminal driver and associated utilities that must be available on your system to support the input and display of Asian characters at run time.

Codesets developed for PC systems are commonly called code pages. There are PC code pages that correspond to most of the language-specific codesets developed for UNIX systems. The operating system supports PC codesets mostly through converters that can change file data from one type of encoding format to another. The operating system also supplies a limited number of locales for which characters are defined in PC code page format. For detailed information about code page support, see code_page(5).

The Unicode and ISO/IEC 10646 standards specify the Universal Character Set (UCS), which allows character units to be processed for all languages, including Asian languages, by using the same set of rules. The operating system supports the UCS-4 (32-bit) encoding of this character set in locales that also provide local cultural data, such as collating sequences and date and monetary formats. These locales are derived from similar locales that use UNIX codesets. Therefore, only the characters appropriate for the set of languages supported by the underlying UNIX locale are defined as valid characters in the UCS-4 version.

Two other encoding formats are defined by the Unicode and ISO/IEC 10646 standards:

UCS-2, the 16-bit implementation of the UCS

UTF-8, a UCS transformation format for handling file data containing characters coded in more than one byte

The operating system supports these encoding formats through both locales and codeset converters. Locales whose name extensions include .UTF-8 handle file data in UTF-8 format as well as supporting UCS-4 process code. Among these locales are special variants (*.UTF-8@euro locales) that also support the euro monetary character. There is also one locale, universal.UTF-8, that an application can use along with the fold_string_w() function to process the full range of characters defined by the Unicode and ISO/IEC 10646 standards. This particular locale differs from most others because it does not provide access to local cultural conventions. See Unicode(5) for detailed information about support for the UCS-2, UCS-4, and UTF-8 encoding formats. See euro(5) for more information about the euro monetary character.

Reference pages are available for all the codesets that the operating system supports. For more information on a specific codeset, refer to its reference page. For information on how codesets are supported for a particular local language, refer to the reference page for that language. Reference pages for languages, particularly Asian languages, may note additional codesets that are not supported in a locale but for which there is a codeset converter.

The following sections discuss important issues that affect the way you write source code when your program must process characters in different codesets:

Ensuring data transparency

Using in-code literals

Manipulating multibyte characters

Converting between multibyte-character and wide-character data

Rules for multibyte characters

Classifying characters

Converting characters (case)

Comparing strings

2.1.1 Ensuring Data Transparency

As discussed in Section 2.1, internationalized software must accommodate a wide variety of character-encoding schemes. Programs cannot assume that a particular codeset is on all systems that conform to requirements in the X/Open UNIX CAE specifications, nor that individual characters occupy a fixed number of bits.

Another legacy of the historical dependence of UNIX systems on 7-bit ASCII character encoding is that some programs use the most significant bit of a byte for their own internal purposes. This was a dubious programming practice, although quite safe when characters in the underlying codeset always mapped to the remaining 7 bits of the byte. In the world of international codesets, the practice of using the most significant bit of a byte for program purposes must be avoided.

2.1.2 Using In-Code Literals

When writing internationalized software, using in-code literals can cause problems. Consider, for example, the following conditional statement:

if ((c = getchar()) == \141)

This condition assumes that lowercase a is always represented by a fixed octal value, which may not be true for all codesets. The following statement represents an improvement in that it substitutes a character constant for the octal value:

if ((c = getchar()) == 'a')

This example still presents problems, however, because the getchar() function operates on bytes. The statement would not work correctly if the next character in the input stream spanned multiple bytes. The following statement substitutes the getwchar() function for the getchar() function. The statement works correctly with any codeset because a is a member of the PCS and is transformed into the same wide-character value in all locales.

if ((c = getwchar()) == L'a')

The X/Open UNIX standard specifies that each member of the source character set and each escape sequence in character constants and string literals is converted to the same member of the execution character set in all locales. It is therefore safe for you to use any of the characters in the PCS as a character constant or in string literals. Note that non-English characters are not included in the PCS and may not translate correctly when used as literals. Consider the following example:

if ((c = getwchar()) == L' à ')

The accented character à may not be represented in the codeset's source character set, execution character set, or both; or the binary value of the accented character may not be translatable from one set to the other. When source files specify non-English characters in constants, the results are undefined.

The following example shows how to construct a test for a constant that for whatever reason may be a non-English character. The constant has been defined in a message catalog with the symbolic identifier MSG_ID. Statements in the example retrieve the value for MSG_ID from the message catalog, which is locale specific and bound to the program at run time.


.
.
.
char *schar;      [1]
wchar_t wchar;    [2]

.
.
.
schar = catgets(catd,NL_SETD,MSG_ID,"a");  [3]
if (mbtowc (&wchar,schar,MB_CUR_MAX) == -1)  [4]
        error();
if ((c = getwchar()) == wchar)  [5]

.
.
.

Declares a pointer to schar as char. [Return to example]

Declares the variable wchar to be of type wchar_t. [Return to example]

Calls the catgets() function to retrieve the value of MSG_ID from the message catalog for the user's locale.
The catgets() function returns a value as an array of bytes so the value is returned to the schar variable. If the accented character is not available in the locale's codeset, the test is made against the unaccented base character (a). [Return to example]

Tests to make sure the value contained in schar represents a valid multibyte character; if yes, converts it to a wide-character value and stores the results in the variable wchar.
If schar does not contain a valid multibyte character, signals an error. [Return to example]

Codes the conditional statement to include the value contained in wchar as the constant. [Return to example]

Refer to Chapter 3 for more information about message catalogs and the catgets() function. See Section 2.1.4 for information about converting multibyte characters and strings to wide-character data that your program can process.

2.1.3 Manipulating Characters That Span Multiple Bytes

Tru64 UNIX provides all the interfaces (such as putwc(), getwc(), fputws(), and fgetws()) that are needed to support codesets with characters that span multiple bytes. Language variant subsets of the operating system must be installed to supply the locales and facilities that make this support operational. On systems where such locales are not available, or are available and not bound to the program at run time, the *ws* and *wc* functions are merely synonyms for the associated single-byte functions (such as putc(), getc(), fputs(), and fgets()).

2.1.4 Converting Between Multibyte-Character and Wide-Character Data

On an internationalized system, data can be encoded as either multibyte-character or wide-character data.

Multibyte encoding is typically the encoding used when data is stored in a file or generated for external use or data interchange. Multibyte encoding has the following disadvantages:

Multibyte characters are not represented by a fixed number of bytes per character, even in the same codeset, so the size of a character in a multibyte data record can vary from one character to the next.

The parsing rules for retrieving character codes from a multibyte data record are locale dependent.

Because of these disadvantages, wide-character encoding, which allocates a fixed number of bytes per character, is typically used for internal processing by programs; in fact, internal process code is another way of referring to data in wide-character format. The size of a wide character varies from one system implementation to another. On Tru64 UNIX systems, the size for a wide character is set to 4 bytes (32 bits), a setting that optimizes performance for the Alpha processor.

Library routines that print, scan, input, or output text can automatically convert data from multibyte characters to wide characters or from wide characters to multibyte characters, as appropriate for the operation. However, applications almost always have additional statements or requirements for which conversion to and from multibyte characters needs to be explicit.

The following example is from a program module that reads records from a database of employee data. In this case, the programmer wants to process the data in fixed-width units, so uses the mbstowcs( ) function to explicitly convert an employee's first and last names from multibyte-character to wide-character encoding.

/*
 * The employee record is normalized with the following format, which
 * is locale independent:  Badge number, First Name, Surname,
 * Cost Center, Date of Join in the `yy/mm/dd' format. Each field is
 * separated by a TAB. The space character is allowed in the First
 * Name and Surname fields.
 */
static const char *dbOutFormat = "%ld\t%S\t%S\t%S\t%02d/%02d/%02d\n";
static const char *dbInFormat = "%ld %[^\t] %[^\t] %S %02d/%02d/%02d\n";

.
.
.
sscanf(record, dbInFormat,
                   &emp->badge_num,
                   firstname,
                   surname,
                   emp->cost_center,
                   &emp->date_of_join.tm_year,
                   &emp->date_of_join.tm_mon,
                   &emp->date_of_join.tm_mday);
            (void) mbstowcs(emp->first_name, firstname, FIRSTNAME_MAX+1);
            (void) mbstowcs(emp->surname, surname, SURNAME_MAX+1);

.
.
.

Refer to Section A.9 for a complete list of functions that work directly with multibyte data.

2.1.5 Rules for Multibyte Characters in Source and Execution Codesets

Both the source and execution character set variants of the same codeset can contain multibyte characters. The encodings do not have to be the same, but both set variants observe certain rules in codesets that meet X/Open requirements. PC code pages and UCS-based codesets may adhere to some or most of these rules, but the codesets native to any UNIX system that conforms to X/Open standards must adhere to all of them.

The characters defined in the Portable Character Set must be present in both sets.

The existence, meaning, and encoding of any additional members are locale specific.

A character may have a state-dependent encoding. A string of characters may contain a shift-state character that affects the system's interpretation of the following bytes until another shift-state character is encountered.

While in the initial shift state, all characters from the basic character set retain their usual interpretation and do not alter the shift state.

The interpretation for subsequent bytes in the sequence is a function of the current shift state.

A byte with all bits set to zero is interpreted as a null character, independent of the shift state.

A byte with all bits zero must not occur in the second or subsequent bytes of a multibyte character.

The source variant of a codeset must observe the following additional rules:

A comment, string literal, character constant, or header name must begin and end in the initial shift state.

A comment, string literal, character constant, or header name must consist of a sequence of valid multibyte characters.

The C language compiler also supports trigraph sequences when you specify the -std1 or -std flag on the cc command line. Trigraph sequences, which are part of the ANSI C specification, allow users to enter the full range of basic characters in programs, even if their keyboards do not support all characters in the source codeset. The following trigraph sequences are currently defined, each of which is replaced by the corresponding single character:

Trigraph Sequence	Single Character
`??=`	`#`
`??(`	`[`
`??/`	`\`
`??'`	`^`
`??<`	`{`
`??)`	`]`
`??!`	`\|`
`??>`	`}`
`??-`	`~`

2.1.6 Classifying Characters

Another feature of program operation that depends on the locale is character classification; that is, determining whether a particular character code refers to an uppercase alphabetic, lowercase alphabetic, digit, punctuation, control, or space character.

In the past, many programs classified characters according to whether the character's value fell between certain numerical limits. For example, the following statement tests for all uppercase alphabetic characters:

if (c >= 'A' && c <= 'Z')

This statement is valid for the ASCII codeset, in which all uppercase letters have values in the range 0x41 to 0x5a (A to Z). However, the statement is not valid for the ISO 8859-1 codeset, in which uppercase letters occupy the ranges 0x41 to 0x5a, 0xc0 to 0xd6, and 0xd8 to 0xdf. In the EBCDIC codeset, character values are different again and, in this case, even the uppercase English letters have a different encoding.

When you write internationalized programs, classify characters by calling the appropriate internationalization function. For example:

if (iswupper (c))

Internationalization functions classify wide-character code values according to ctype information in the user's locale. Refer to Section A.2 for a complete list and description of character classification functions.

2.1.7 Converting Characters

You can do case conversion of ASCII characters with statements such as the following ones, which convert the character in a_var first to lowercase and then to uppercase:

a_var |= 0x20;

.
.
.
a_var &= 0xdf;

The preceding statements are not safe to use in internationalized programs because they:

Assume ASCII-coded character values

Can convert invalid values

The correct way to handle case conversion is to call the towlower() function for conversion to lowercase and the towupper() function for conversion to uppercase. For example:

a_var = towlower(a_var);

.
.
.
a_var = towupper(a_var);

These functions use information specified in the user's locale and are independent of the codeset where characters are defined. The functions return the argument unchanged if input is invalid. Refer to Section A.3 for more detailed discussion of case conversion functions.

2.1.8 Comparing Strings

UNIX systems have always provided functions for comparing character strings. The following statement, for example, compares the strings s1 and s2, returning an integer greater than, equal to, or less than zero, depending on whether the value of s1 is greater than, equal to, or less than the value of s2 in the machine-collating sequence:


.
.
.
int cmp_val;
char *s1;
char *s2;

.
.
.
cmp_val = strcmp(s1, s2);

.
.
.

Many languages, however, require more complex collation algorithms than a simple numerical sort. For example, multiple passes may be required for the following reasons:

Ordering accented characters within a particular character class for a language (for example, a, á, à, and so on)

Collating certain multiple character sequences as a single character (for example, the Welsh character ch, which collates after c and before d)

Collating certain single characters as a 2-character sequence (for example, the German character sharp s, which collates as ss)

Ignoring certain characters during collation (for example, hyphens in dictionary words)

String comparison in an international environment thus depends on the codeset and language. This dependency means that additional functions are required to compare strings according to collating sequence information in the user's locale. These functions include:

strcoll(), which uses collation information defined in the user's locale rather than performing a simple numeric comparison as does the strcmp() function

wcscoll(), which performs the same operation as strcoll(), except that it operates on wide characters

wcsxfrm(), which transforms a wide-character string by using collating sequence information in the user's locale so that the resulting string can be compared using the wcscmp() function
If two strings are being compared only for equality, you can use strcmp() or wcscmp(), which are faster in most environments than wcscoll().

2.2 Handling Cultural Data

Cultural data refers to items of information that can vary between languages or territories.

For example:

In the United Kingdom and the United States, a period represents the radix character and a comma represents the thousands separator in decimal numbers. In Germany, the same two characters are used in decimal numbers with exactly the opposite meaning.

In the United States, the date October 7, 1986 is represented as 10/7/1986, whereas in the United Kingdom, the same date is represented as 7/10/1986. This example indicates that cultural data items can vary when the same language is spoken.

Date delimiters, as well as the order of year, month, and day, can vary among countries. In Germany, for example, the date October 7, 1986 is represented as 7.10.1986 rather than as 7/10/1986.

Currency symbols can vary both in terms of the characters used and where they are placed in a currency value; that is, currency symbols can precede, follow, or be embedded in the value.

You cannot make assumptions about cultural data when writing internationalized programs. Your program must operate according to the local customs of users. The X/Open UNIX standard specifies that this requirement be met through a database of cultural data items that a program can access at run time, plus a set of associated interfaces. The following sections discuss this database and the functions used to extract and process its data items.

2.2.1 The langinfo Database

The language information database, named langinfo, contains items that represent the cultural details of each locale supported on the system. The langinfo database contains the following information for each locale, as required by the X/Open UNIX standard:

Codeset name

Date and time formats

Names of the days of the week

Names of the months of the year

Abbreviations for names of days

Abbreviations for names of months

Radix character (the character that separates whole and fractional quantities

Thousands separator character

Affirmative and negative responses for yes/no queries

Currency symbol and its position within a currency value

Emperor/Era name and year (for Japanese locales)

2.2.2 Querying the langinfo Database

You can extract cultural data items from the langinfo database by calling the nl_langinfo() function. This function takes an item argument that is one of several constants defined in the /usr/include/langinfo.h header file. The function returns a pointer to the string with the value for item in the current locale. The following example shows a call to nl_langinfo() that extracts the string for formatting date and time information. This value is associated with the constant D_T_FMT.

nl_langinfo(D_T_FMT);

2.2.3 Generating and Interpreting Date and Time Strings That Observe Local Customs

Programs often generate date and time strings. Internationalized programs generate strings that observe the local customs of the user. You can meet this requirement by calling the strftime() or wcsftime() function. Both functions indirectly use the langinfo database. The difference is that wcsftime() converts date and time to wide-character format.

In the following example, the strftime() function generates a date string as defined by the D_FMT item in the langinfo database:


.
.
.
setlocale(LC_ALL, "");  [1]

.
.
.
clock = time((time_t*)NULL);  [2]
tm = localtime(&clock);  [3]

.
.
.
strftime(buf, size, "%x", tm);  [4]
puts(buf);  [5]

.
.
.

Binds the program at run time to the locale set for the system or individual user. [Return to example]

Calls the time() subroutine to return the time value, relative to Coordinated Universal Time, to the clock variable. [Return to example]

Calls the localtime() function to convert the value contained in clock to a value that can be stored in a tm structure, whose members represent values for year, month, day, hour, minute, and so forth. [Return to example]

Calls strftime() to generate a date string formatted as defined in the user's locale from the value contained in the tm structure.
The buf argument is a pointer to a string variable in which the date string is returned. The size argument contains the maximum size of buf. The "%x" argument specifies conversion specifications, similar to the format strings used with the printf() and scanf() functions. The "%x" argument is replaced in the output string by representation appropriate for the locale. [Return to example]

Calls the puts() function to copy the string contained in buf to the standard output stream (stdout) and to append a newline character. [Return to example]

The following example shows how to use strftime() and nl_langinfo() in combination to generate a date and time string. Assume that the same calls to the setlocale(), time(), and localtime() interfaces have been made here as shown in the preceding example. The only difference is that a call to nl_langinfo() has replaced the format string argument in the call to strftime():


.
.
.
strftime(buf, size, nl_langinfo(D_T_FMT), tm);
puts(buf);

.
.
.

To convert a string to a date/time value, the reverse of the operation performed by strftime( ), you can use the strptime( ) function. The strptime( ) supports a number of conversion specifiers that behave in a locale-dependent manner.

2.2.4 Formatting Monetary Values

The strfmon() function formats monetary values according to information in the locale that is bound to the program at run time. For example:

strfmon(buf, size, "%n", value);

This statement formats the double-precision floating-point value contained in the value variable. The "%n" argument is the format specification that is replaced by the format defined in the run-time locale. The results are returned to the buf array, whose maximum length is contained in the size variable.

The money program demonstrates how the strfmon() function works. The source file for this sample program is available in the /usr/i18n/examples/money directory.

2.2.5 Formatting Numeric Values in Program-Specific Ways

You may want to perform your own conversions of numeric quantities, monetary or otherwise, by using specific formatting details in the user's locale. The localeconv() function, which has no arguments, returns all the number formatting details defined in the locale to a structure declared in your program. For example:

struct lconv *app_conv;

You can use the following features, which are contained in the lconv structure, in program-defined routines:

Radix character

Thousands separator character

Digit grouping size

International currency symbol

Local currency symbol

Radix character for monetary values

Thousands separator for monetary values

Digit grouping size for monetary values

Positive sign

Negative sign

Number of fractional digits to be displayed

Parenthesis symbols for negative monetary values

2.2.6 Using the langinfo Database for Other Tasks

Functions in addition to the ones discussed so far use the langinfo database to determine settings for specific items of cultural data. For example, the wscanf(), wprintf(), and wcstod() functions determine the appropriate radix character from information in the langinfo database.

2.3 Handling Text Presentation and Input

The language of the program user affects:

The way program messages are defined and accessed

How the program presents output text

How the program processes input text

These considerations are discussed in the following sections.

2.3.1 Creating and Using Messages

Programs need to communicate with users in their own language. This requirement places some constraints on the way program messages are defined and accessed. More specifically, messages are defined in a file that is independent of the program source code and are not compiled into object files. Because messages are in a separate file, they can be translated into different languages and stored in a form that is linked to the program at run time. Programs can then retrieve message text translations that are appropriate for the user's language.

The X/Open UNIX standard specifies:

A messaging system that contains a definition of message text source files

The gencat command to generate message catalogs from these source files

A set of library functions to retrieve individual messages from one or more catalogs at run time

The following example shows how an internationalized program retrieves a message from a catalog:

#include <stdio.h>     [1]
 
#include <locale.h>    [2]
#include <nl_types.h>  [3]
#include "prog_msg.h"      [4]
main()
{
      nl_catd catd;  [5]
      setlocale(LC_ALL, "");  [6]
      catd = catopen("prog.cat", NL_CAT_LOCALE);  [7]
      puts(catgets(catd, SETN, HELLO_MSG, "Hello, world!")); [8]
      catclose(catd);  [9]
}

.
.
.

Includes the header file for the Standard C Library. [Return to example]

Includes the /usr/include/locale.h header file, which declares the setlocale() function and associated constants and variables. [Return to example]

Includes the /usr/include/nl_types.h header file, which declares the catopen(), catgets(), and catclose() functions. [Return to example]

Includes the program-specific prog_msg.h header file, which sets constants to identify the message set (SETN) and specific messages (HELLO_MSG being one) that are used by this program module.
A message catalog can contain one or more message sets and individual messages are ordered within each set. [Return to example]

Declares a message catalog descriptor catd to be of type nl_catd.
This descriptor is returned by the function that opens the catalog. The descriptor is also passed as an argument to the function that closes the catalog. [Return to example]

Calls the setlocale() function to bind the program's locale categories to settings for the user's locale environment variables.
The locale name set for the LC_MESSAGES category is the locale used by the catopen() and catgets() functions in this example. Typically, the system manager or user sets only the LANG or LC_ALL environment variable to a particular locale name, and this operation implicitly sets the LC_MESSAGES variable as well. [Return to example]

Calls the catopen() function to open the prog.cat message catalog for use by this program.
The NL_CAT_LOCALE argument specifies that the program will use the locale name set for LC_MESSAGES. The catopen() function uses the value set for the NLSPATH environment variable to determine the location of the message catalog. The call returns the message catalog descriptor to the catd variable. [Return to example]

Calls the puts() function to display the message.
The first argument to this call is a call to the catgets() function, which retrieves the appropriate text for the message with the HELLO_MSG identifier. This message is contained in the message set identified by the SETN constant. The final argument to catgets() is the default text to be used if the messaging call cannot retrieve the translated text from the catalog. Default text is usually in English. [Return to example]

Calls the catclose() function to close the message catalog whose descriptor is contained in the catd variable. [Return to example]

Refer to Chapter 3 for information about creating and using message catalogs.

2.3.2 Formatting Output Text

Successful translation of messages into different languages depends not only on making messages independent of the program source code but also on careful construction of message strings within the program.

Consider the following example:

printf(catgets(catd, set_id, WRONG_OWNER_MSG,
               "%s is owned by %s\n"),
               folder_name, user_name);

The preceding statement uses a message catalog but assumes a particular language construction (a noun followed by a verb in passive voice followed by a noun). Passive-verb constructions are not part of all languages; therefore, message translation might mean printing user_name before folder_name. In other words, the translator might need to change the construction of the message so that the user sees the translated equivalent of "John_Smith owns JULY_REVENUE" rather than "JULY_REVENUE is owned by John_Smith."

To overcome the problems imposed by fixed ordering of message elements, the format specifiers for the printf() routine have been extended so that format conversion applies to the nth argument in an argument list rather than to the next unused argument. To apply the format conversion extension, replace the % conversion character with the sequence %digit $, where digit specifies the position of the argument in the argument list. The following example illustrates how the programmer applies this feature to the format string "%s is owned by %s\n":

printf(catgets(catd, set_id, WRONG_OWNER_MSG,
               "%1$s is owned by %2$s\n"),
               folder_name, user_name);

The construction of the string "%1$s is owned by %2$s", which is the default value for the WRONG_OWNER_MSG entry in the program's message file, can then be changed by the translator to the non-English equivalent of:

WRONG_OWNER_MSG        "%2$s owns %1$s\n"

2.3.3 Scanning Input Text

The string construction issues that are discussed for output text in Section 2.3.2 also apply to input text. For example, in different countries there are different conventions that apply to the order in which users specify the elements of a date or there are differences in characters that are input to delimit parts of monetary or other numeric strings. Therefore, the scanf() family of functions also support extended format conversion specifiers to allow for variation in the way that users enter elements of a string.

Consider the following example:


.
.
.
int day;
int month;
int year;

.
.
.
scanf("%d/%d/%d", &month, &day, &year);

.
.
.

The format string in this statement is governed by the assumption that all users use a United States format (mm/dd/yyyy) to input dates. In an internationalized program, you use extended format specifiers to support requirements that language may impose on the order of string elements. For example:


.
.
.
scanf(catgets(catd, NL_SETD, DATE_STRING,
              "%1$d/%2$d/%3$d"), &month, &day, &year);

.
.
.

The default "%1$d/%2$d/%3$d" value for the DATE_STRING message is still appropriate only for countries where users use the format mm/dd/yyyy to enter dates. However, for countries in which the order or formatting would be different, the translator can change the entry in the program's message file. For example:

British English (dd/mm/yyyy):
```
DATE_STRING        "%2$d/%1$d/%3$d"
```

German (dd.mm.yyyy)
```
DATE_STRING        "%2$d.%1$d.%3$d"
 
```

2.4 Binding a Locale to the Run-Time Environment

For an internationalized program to operate correctly, it must bind to localized data that is appropriate for the user at run time. The setlocale() function performs this task. You can call setlocale() to:

Bind to locale settings that are already in effect for the user's process

Bind to locale settings controlled by the program

Query current locale settings without changing them

The call takes two arguments: category and locale_name.

The category argument specifies whether you want to query, change, or use all or a specific section of a locale. Values for category and what they represent are as follows:

LC_ALL, all sections of a locale

LC_CTYPE, the locale section that classifies characters

LC_COLLATE, the locale section that specifies character collation order

LC_MESSAGES, the locale section that specifies yes/no responses and program messages

LC_MONETARY, the locale section that specifies special characters used in monetary values

LC_NUMERIC, the locale section that specifies the characters used for decimal point and thousands separator

LC_TIME, the locale section that specifies names and abbreviations for days of the week and months of the year, and other strings and formatting conventions that govern expressions of date and time

The locale_name argument is one of the following values:

An empty string ("") to bind the program at run time to the locale name set for category by the system manager or user

A locale name to change the locale that may already be set for category

NULL to determine the locale name currently set for category

2.4.1 Binding to the Locale Set for the System or User

Typically, the system manager or user sets the LANG or LC_ALL environment variable to the name of a locale; setting either of these variables automatically sets all locale category variables to the same locale name. On occasion (if they do not use LC_ALL), system managers or individual users may set locale category variables to different locale names. Usually, internationalized programs contain the following call, which initializes all locale categories in the program to environment variable settings already in effect for the user:

setlocale(LC_ALL, "");

2.4.2 Changing Locales During Program Execution

Some internationalized programs may need to prompt the user for a locale name or change locales during program execution. The following example shows how to call setlocale() when you want to explicitly initialize or reinitialize all locale categories to the same locale name:


.
.
.
nl_catd catd;  [1]
char buf[BUFSIZ];  [2]

.
.
.
setlocale(LC_ALL, "");  [3]
catd = catopen(CAT_NAME, NL_CAT_LOCALE);  [4]

.
.
.
printf(catgets(catd, NL_SETD, LOCALE_PROMPT_MSG,
               "Enter locale name: "));      [5]
gets(buf);  [6]
setlocale(LC_ALL, buf);  [7]

.
.
.

Declares a catalog descriptor catd as type nl_catd. [Return to example]

Declares the buf variable into which the locale name will later be stored.
To make sure that the variable is large enough to accommodate locale names on different systems, you should set its maximum size to the BUFSIZ constant, which is defined by the system vendor in /usr/include/stdio.h. [Return to example]

Calls setlocale() to initialize the program's locale settings to those in effect for the user who runs the program. [Return to example]

Calls catopen() to open the message catalog that contains the program's messages; returns the catalog's descriptor to the catd variable.
The CAT_NAME constant is defined in the program's own header file. [Return to example]

Prompts the user for a new locale name.
The NL_SETD constant specifies the default message set number in a message catalog and is defined in /usr/include/nl_types.h. The LOCALE_PROMPT_MSG identifier specifies the prompt string translation in the default message set. [Return to example]

Calls the gets() function to read the locale name typed by the user into the buf variable. [Return to example]

Calls setlocale() with buf as the locale_name argument to reinitialize all portions of the locale. [Return to example]

Sometimes a program needs to vary the locale only for a particular category of data. For example, consider a program that processes different country-specific files that contain monetary values. Before processing data in each file, the program might reinitialize a program variable to a new locale name and then use that variable value to reset only the LC_MONETARY category of the locale.