2 Developing Internationalized Software

This chapter explains how language, codeset, and cultural differences change the way you implement basic coding operations. After reading this chapter, you will be ready to examine a complete application that applies the program development techniques that have been suggested. Such an application is provided on line in the directory /usr/examples/i18n/xpg4demo. Refer to the README document in that directory for an introduction to the application and how you can compile and run it with different locales. Parts of the xpg4demo application are used as examples in this and other chapters.

One of the primary functions of most computer programs is to manipulate data, some or all of which may involve interaction between the program and a computer user. In commercial situations, it is important that such interactions take place in the native language of each user. Cultural data should also observe the correct customs.

When you write programs to support multilanguage operation, you must also consider the fact that languages can be represented within the computer system by one or more codesets. Because of the requirements of different languages, characters in codesets may vary in both size (8-bit, 16-bit, and so on) and binary representation.

You can satisfy the preceding requirements by writing programs that make no hard-coded assumptions about language, cultural data, or character encodings. Such programs are said to be internationalized. Data specific to each supported language, territory, and codeset combination are held separately from the program code and can be bound to the run-time environment by language-initialization functions.

Digital UNIX provides the following facilities for developing internationalized software, defining localization data, and announcing specific language requirements:

Library functions that handle extended character codes and that provide language- and codeset-independent character classification, case conversion, number format conversion, and string collation
Library functions that let programs determine cultural and language-specific data dynamically
A message system that allows program messages to be held apart from the program code, translated into different languages, and retrieved by a program at run time
An initialization function that binds a program at run time to the specified language requirements of each user

The rest of this chapter describes each of the these facilities in more detail.

The discussion and examples in this chapter focus on functions provided in the Standard C Library. Refer to Chapter 4 and Chapter 5 for information about using functions in the curses, X, and Motif libraries.

2.1 Using Codesets

In the past, most UNIX systems were based on the 7-bit ASCII codeset. However, most languages other than English include characters in addition to those contained in the ASCII codeset. The X/Open UNIX standard does not require an operating system to supply any particular codesets in addition to ASCII. The guide does specify requirements for the interfaces that manipulate characters so that programs are able to handle characters from whatever codeset is available on a given system.

The ISO codesets cover the major European languages. Several of these codesets allow for the mixing of major languages within a single codeset. All ISO codesets are a superset of the ASCII codeset and therefore allow systems to support languages other than English without invalidating existing software that is not internationalized. The Digital UNIX operating system provides locales that use the ISO 8859-1 (Latin 1) and ISO 8859-7 (Latin/Greek) codesets.

Subsets that support localized variants of the operating system may include locales based on additional ISO codesets. For example, the optional language variant subsets included with Digital UNIX to support Czech, Hungarian, Polish, Russian, Slovak, and Slovene provide locales based on the ISO 8859-2 (Latin 2) codeset. Following is a complete list of ISO codesets, along with the languages that they support:

ISO 8859-1, Latin 1
Western European languages, including Catalan

ISO 8859-2, Latin 2

Eastern European languages

ISO 8859-3, Latin 3

Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, and Turkish

ISO 8859-4, Latin 4

Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, and Lithuanian

ISO 8859-5, Latin/Cyrillic

Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croatian, and Ukranian

ISO 8859-6, Latin/Arabic

Arabic

ISO 8859-7, Latin/Greek

Greek

ISO 8859-8, Latin/Hebrew

Hebrew

ISO 8859-9, Latin 5

Danish, Dutch, English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, and Turkish

ISO 8859-10, Latin 6

Danish, English, Estonian, Finnish, German, Greenlandic, Icelandic, Sami (Lappish), Latvian, Lithuanian, Norwegian, Faroese, and Swedish

Another ISO codeset supported by utilities on a standard Digital UNIX operating system is ISO 6937:1983. This codeset, which accommodates both 7-bit and 8-bit characters, is used for text communication over communication networks and interchange media, such as magnetic tape and disks.

The codesets discussed up to this point address the requirements of languages whose characters can be stored in a single byte. Such codesets do not meet the needs of Asian languages, whose characters can occupy multiple bytes. Digital UNIX supplies the following codesets through subsets that support Asian languages and countries:

eucJP (Japanese Extended UNIX Code)

SJIS (Shift JIS)

deckanji (DEC Kanji)

sdeckanji (Super DEC Kanji)

deckorean (DEC Korean)

eucKR (Korean Extended UNIX Code)

TACTIS (Thai API Consortium/Thai Industrial Standard)

dechanzi (DEC Hanzi)

dechanyu (DEC Hanyu)

eucTW (Taiwanese Extended UNIX Code)

big5 (BIG-5)

These codesets are supplied when you install Asian-language variant subsets of the Digital UNIX product. A specialized terminal driver and associated utilities must be available on your system to support the input and display of Asian characters at run time. These components are also supplied when you install one of the Asian-language variant subsets.

The Unicode and ISO/IEC 10646 standards specify the Universal Character Set (UCS), a character set that allows character units to be processed for all languages, including Asian languages, by using the same set of rules. Digital UNIX supports the USC-4 encoding of this character set. An application parses UCS-4 character encoding in 32-bit units.

Reference pages are available for all the character sets that Digital UNIX supports. For more information on a particular character set, refer to its reference page.

The following sections discuss important issues that affect the way you write source code when your program must process characters in different codesets:

Ensuring data transparency

Using in-code literals

Manipulating multibyte characters

Converting between multibyte-character and wide-character data

Rules for multibyte characters

Classifying characters

Converting characters (case)

Comparing strings

2.1.1 Ensuring Data Transparency

As discussed in Section 2.1, internationalized software must accommodate a wide variety of character-encoding schemes. Programs cannot assume that a particular codeset is on all systems that conform to requirements in the X/Open UNIX CAE specifications, nor that individual characters occupy a fixed number of bits.

Another legacy of the historical dependence of UNIX systems on 7-bit ASCII character encoding is that some programs use the most significant bit of a byte for their own internal purposes. This was a dubious programming practice, although quite safe when characters in the underlying codeset always mapped to the remaining 7 bits of the byte. In the world of international codesets, the practice of using the most significant bit of a byte for program purposes must be avoided.

2.1.2 Using In-Code Literals

When writing internationalized software, using in-code literals can cause problems. Consider, for example, the following conditional statement:

if ((c = getchar()) == \141)

This condition assumes that lowercase a is always represented by a fixed octal value, which may not be true for all codesets. The following statement represents an improvement in that it substitutes a character constant for the octal value:

if ((c = getchar()) == 'a')

This example still presents problems, however, because the getchar() function operates on bytes. The statement would not work correctly if the next character in the input stream were a multibyte value. The following statement substitutes the getwchar() function for the getchar() function. The statement works correctly with any codeset because a is a member of the Portable Character Set and is transformed into the same wide-character value in all locales.

if ((c = getwchar()) == L'a')

The X/Open UNIX standard specifies that each member of the source character set and each escape sequence in character constants and string literals is converted to the same member of the execution character set in all locales. It is therefore safe for you to use any of the characters in the Portable Character Set as a character constant or in string literals. Note that non-English characters are not included in the Portable Character Set and may not translate correctly when used as literals. Consider the following example:

if ((c = getwchar()) == L'[agrave]')

The accented character [agrave] may not be represented in the codeset's source character set, execution character set, or both; or the binary value of the accented character may not be translatable from one set to the other. When source files specify non-English characters in constants, the results are undefined.

The following example shows how to construct a test for a constant that for whatever reason may be a non-English character. The constant has been defined in a message catalog with the symbolic identifier MSG_ID. Statements in the example retrieve the value for MSG_ID from the message catalog, which is locale specific and bound to the program at run time.


.
.
.

char *schar;      (1)
wchar_t wchar;    (2)

.
.
.

schar = catgets(catd,NL_SETD,MSG_ID,"a");  (3)
if (mbtowc (&wchar,schar,MB_CUR_MAX) == -1)  (4)
        error();
if ((c = getwchar()) == wchar)  (5)

.
.
.

Declares a pointer to schar as char.
Declares the variable wchar to be of type wchar_t.
Calls the catgets() function to retrieve the value of MSG_ID from the message catalog for the user's locale.
The catgets() function returns a value as an array of bytes so the value is returned to the schar variable. If the accented character is not available in the locale's codeset, the test is made against the unaccented base character (a).
Because the locale's codeset may contain multibyte characters, tests to make sure the value contained in schar represents a valid multibyte character; if yes, converts it to a wide-character value and stores the results in the variable wchar.
If schar does not contain a valid multibyte character, signals an error.
Codes the conditional statement to include the value contained in wchar as the constant.

Refer to Chapter 3 for more information about message catalogs and the catgets() function. See Section 2.1.4 for information about converting multibyte characters and strings to wide-character data that your program can process.

2.1.3 Manipulating Multibyte Characters

Digital UNIX provides all the interfaces (such as putwc(), getwc(), fputws(), and fgetws()) that are needed to support multibyte Asian codesets. Language variant subsets of the operating system must be installed to supply the locales and facilities that make this support operational. On systems where multibyte locales are not available, or are available and not bound to the program at run time, the *ws* and *wc* functions are merely synonyms for the associated single-byte functions (such as putc(), getc(), fputs(), and fgets()). The interfaces provided for multibyte support are therefore appropriate for use with all locales, not just those with multibyte characters.

2.1.4 Converting Between Multibyte-Character and Wide-Character Data

Some languages, particularly Asian languages, can be encoded as either multibyte-character or wide-character data.

Multibyte encoding is typically the encoding used when data is stored in a file or generated for external use or data interchange. Multibyte encoding has the following disadvantages:

Multibyte characters are not represented by a fixed number of bytes per character, even in the same codeset, so the size of a character in a multibyte data record can vary from one character to the next.

The parsing rules for retrieving character codes from a multibyte data record are locale dependent.

Because of the disadvantages of multibyte encoding, wide-character encoding, which allocates a fixed number of bytes per character, is typically used for internal processing by programs; in fact, internal process code is another way of referring to data in wide-character format. The size of a wide character varies from one system implementation to another. On Digital UNIX systems, the default size for a wide character is set to 4 bytes (32 bits), a setting that optimizes performance for the Alpha processor.

Library routines that print, scan, input, or output text have the capability of automatically converting data from multibyte characters to wide characters or from wide characters to multibyte characters, as appropriate for the operation. However, applications almost always have additional statements or requirements for which conversion to and from multibyte characters needs to be explicit.

The following example is from a program module that reads records from a database of employee data. In this case, the programmer wants to apply a locale-independent format to a record retrieved from a data file and uses the mbstowcs() function to explicitly convert an employee's first and last names from multibyte-character to wide-character encoding.

/*
 * The employee record is normalized with the following format, which
 * is locale independent:  Badge number, First Name, Surname,
 * Cost Center, Date of Join in the `yy/mm/dd' format. Each field is
 * separated by a TAB. The space character is allowed in the First
 * Name and Surname fields.
 */
static const char *dbOutFormat = "%ld\t%S\t%S\t%S\t%02d/%02d/%02d\n";
static const char *dbInFormat = "%ld %[^\t] %[^\t] %S %02d/%02d/%02d\n";

.
.
.

            sscanf(record, dbInFormat,
                   &emp->badge_num,
                   firstname,
                   surname,
                   emp->cost_center,
                   &emp->date_of_join.tm_year,
                   &emp->date_of_join.tm_mon,
                   &emp->date_of_join.tm_mday);
            (void) mbstowcs(emp->first_name, firstname, FIRSTNAME_MAX+1);
            (void) mbstowcs(emp->surname, surname, SURNAME_MAX+1);

.
.
.

Refer to Section A.9 for a complete list of functions that work with multibyte data directly.

2.1.5 Rules for Multibyte Characters in Source and Execution Codesets

You should be aware that both the source and execution character set variants of the same codeset can contain multibyte characters. The encodings do not have to be the same, but both set variants must observe the following rules:

The characters defined in the Portable Character Set must be present in both sets.

The existence, meaning, and encoding of any additional members are locale specific.

A character may have a state-dependent encoding. A string of characters may contain a shift-state character that affects the system's interpretation of the following bytes until another shift-state character is encountered.

While in the initial shift state, all characters from the basic character set retain their usual interpretation and do not alter the shift state.

The interpretation for subsequent bytes in the sequence is a function of the current shift state.

A byte with all bits set to zero is interpreted as a null character, independent of the shift state.

A byte with all bits zero must not occur in the second or subsequent bytes of a multibyte character.

The source variant of a codeset must observe the following additional rules:

A comment, string literal, character constant, or header name must begin and end in the initial shift state.

A comment, string literal, character constant, or header name must consist of a sequence of valid multibyte characters.

The C language compiler also supports trigraph sequences when you specify the -std1 or -std flag on the cc command line. Trigraph sequences, which are part of the ANSI C specification, allow users to enter the full range of basic characters in programs, even if their keyboards do not support all characters in the source codeset. The following trigraph sequences are currently defined, each of which is replaced by the corresponding single character:

Trigraph Sequence	Single Character
`??=`	`#`
`??(`	`[`
`??/`	`\`
`??'`	`^`
`??<`	`{`
`??)`	`]`
`??!`	`\|`
`??>`	`}`
`??-`	`~`

2.1.6 Classifying Characters

Another feature of program operation that depends on the codeset is character classification; that is, determining whether a particular character code refers to an uppercase alphabetic, lowercase alphabetic, digit, punctuation, control, or space character.

In the past, many programs classified characters according to whether the character's value fell between certain numerical limits. For example, the following statement tests for all uppercase alphabetic characters:

if (c >= 'A' && c <= 'Z')

This statement is valid for the ASCII codeset, in which all uppercase letters have values in the range 0x41 to 0x5a (A to Z). However, the statement is not valid for the codeset ISO 8859-1:1987, in which uppercase letters occupy the ranges 0x41 to 0x5a, 0xc0 to 0xd6, and 0xd8 to 0xdf. In the EBCDIC codeset, character values are different again and, in this case, even the uppercase English letters have a different encoding.

When you write internationalized programs, classify characters by calling the appropriate internationalization function. For example:

if (iswupper (c))

Internationalization functions classify wide-character code values according to type information in the user's locale and are independent of the language and codeset. Refer to Section A.2 for a complete list and description of character classification functions.

2.1.7 Converting Characters

You can do case conversion of ASCII characters with statements like the following ones, which convert the character in a_var first to lowercase and then to uppercase:

a_var |= 0x20;

.
.
.

a_var &= 0xdf;

The preceding statements are not safe to use in internationalized programs because they:

Assume ASCII-coded character values
Do not check for valid input

The correct way to handle case conversion is to call the towlower() function for conversion to lowercase and the towupper() function for conversion to uppercase. For example:

a_var = towlower(a_var);

.
.
.

a_var = towupper(a_var);

These functions use information specified in the user's locale and are independent of the codeset where characters are defined. The functions return the argument unchanged if input is invalid. Refer to Section A.3 for more detailed discussion of case conversion functions.

2.1.8 Comparing Strings

UNIX systems have always provided functions for comparing character strings. The following statement, for example, compares the strings s1 and s2, returning an integer greater than, equal to, or less than zero, depending on whether the value of s1 is greater than, equal to, or less than the value of s2 in the machine-collating sequence:


.
.
.

int cmp_val;
char *s1;
char *s2;

.
.
.

cmp_val = strcmp(s1, s2);

.
.
.

Certain languages, however, require collation algorithms that must make multiple passes through the codeset. Multiple passes may be required for the following reasons:

Ordering accented characters within a particular character class for a language (for example, a, [aacute], [agrave], and so on)

Collating certain multiple character sequences as a single character (for example, the Welsh character ch, which collates after c and before d)

Collating certain single characters as a 2-character sequence (for example, the German character sharp s, which collates as ss)

Ignoring certain characters during collation (for example, hyphens in dictionary words)

String comparison in an international environment thus depends on the codeset and language. This dependency means that additional functions are required to compare strings according to collating sequence information in the user's locale. These functions include:

wcscoll(), which performs the same operation as strcmp(), except that it operates on wide characters and uses locale-specific collating information

wcsxfrm(), which transforms a wide-character string by using collating sequence information in the user's locale so that the resulting string can be compared using the wcscmp() function
If two strings are being compared only for equality, you can use strcmp() or wcscmp(), which are faster in most environments than wcscoll().

2.2 Handling Cultural Data

Cultural data refers to items of information that can vary between languages or territories. For example:

In the United Kingdom and the United States, a period represents the radix character and a comma represents the thousands separator in decimal numbers. In Germany, the same two characters are used in decimal numbers with exactly the opposite meaning.

In the United States, the date October 7, 1986 is represented as 10/7/86, whereas in the United Kingdom, the same date is represented as 7/10/86. This example indicates that cultural data items can vary when the same language is spoken.

Date delimiters, as well as the order of year, month, and day, can vary among countries. In Germany, for example, the date October 7, 1986 is represented as 7.10.86 rather than as 7/10/86.

Currency symbols can vary both in terms of the characters used and where they are placed in a currency value; that is, currency symbols can precede, follow, or be embedded in the value.

You cannot make assumptions about cultural data when writing internationalized programs. Your program must operate according to the local customs of users. The X/Open UNIX standard specifies that this requirement be met through a database of cultural data items that a program can access at run time, plus a set of associated interfaces. The following sections discuss this database and the functions used to extract and process its data items.

2.2.1 The langinfo Database

The language information database, named langinfo, contains items that represent the cultural details of each locale supported on the system. The langinfo database on Digital UNIX systems contains the following information for each locale, as required by the X/Open UNIX standard:

Codeset name
Date and time formats
Names of the days of the week
Names of the months of the year
Abbreviations for names of days
Abbreviations for names of months
Radix character
Thousands separator character
Affirmative and negative responses for yes/no queries
Currency symbol and its position within a currency value
Emperor/Era name and year (for Japanese locales)

2.2.2 Querying the langinfo Database

You can extract cultural data items from the langinfo database by calling the nl_langinfo() function. This function takes an item argument that is one of several constants defined in the header file /usr/include/langinfo.h. The function returns a pointer to the string with the associated value for item. The following example shows a call to nl_langinfo() that extracts the string for formatting date and time information. This value is associated with the constant D_T_FMT.

nl_langinfo(D_T_FMT);

2.2.3 Generating and Interpreting Date and Time Strings That Observe Local Customs

Programs often generate date and time strings. Internationalized programs generate strings that observe the local customs of the user. You can meet this requirement by calling the strftime() function, which makes indirect use of the langinfo database.

In the following example, the strftime() function generates a date string as defined by the D_FMT item in the langinfo database:


.
.
.

setlocale(LC_ALL, );  (1)

.
.
.

clock = time((time_t*)NULL);  (2)
tm = localtime(&clock);  (3)

.
.
.

strftime(buf, size, "%x", tm);  (4)
puts(buf);  (5)

.
.
.

Binds the program at run time to the locale set for the system or individual user.
Calls the time() subroutine to return the time value, relative to Coordinated Universal Time, to the clock variable.
Calls the localtime() function to convert the value contained in clock to a value that can be stored in a tm structure, whose members represent values for year, month, day, hour, minute, and so forth.
Calls strftime() to generate a date string formatted as defined in the user's locale from the value contained in the tm structure.
The buf argument is a pointer to a string variable in which the date string is returned. The size argument contains the maximum size of buf. The "%x" argument specifies conversion specifications, similar to the format strings used with the printf() and scanf() functions. The "%x" argument is replaced in the output string by representation appropriate for the locale.
Calls the puts() function to copy the string contained in buf to the standard output stream (stdout) and to append a newline character.

The following example shows how to use strftime() and nl_langinfo() in combination to generate a date and time string. Assume that the same calls to the setlocale(), time(), and localtime() interfaces have been made here as shown in the preceding example. The only difference is that a call to nl_langinfo() has replaced the format string argument in the call to strftime():


.
.
.

strftime(buf, size, nl_langinfo(D_T_FMT), tm);
puts(buf);

.
.
.

To convert a string to a date/time value, the reverse of the operation performed by strftime(), you can use the strptime() function. The strptime() supports a number of conversion specifiers that behave in a locale-dependent manner.

2.2.4 Formatting Monetary Values

The strfmon() function formats monetary values according to information in the locale that is bound to the program at run time. For example:

strfmon(buf, size, "%n", value);  (1)

This statement formats the double-precision floating-point value contained in the variable value. The argument "%n" is the format specification that is replaced by the format defined in the run-time locale. The results are returned to the array buf, whose maximum length is contained in the variable size.

The money program demonstrates how the strfmon() function works. The source file for this sample program is available in the /usr/i18n/examples/money directory.

2.2.5 Formatting Numeric Values in Program-Specific Ways

You may want to perform your own conversions of numeric quantities, monetary or otherwise, by using specific formatting details in the user's locale. The localeconv() function, which has no arguments, returns all the number formatting details defined in the locale to a structure declared in your program. For example:

struct lconv *app_conv;

You can use the following features, which are contained in the structure lconv, in program-defined routines:

Radix character
Thousands separator character
Digit grouping size
International currency symbol
Local currency symbol
Radix character for monetary values
Thousands separator for monetary values
Digit grouping size for monetary values
Positive sign
Negative sign
Number of fractional digits to be displayed
Parenthesis symbols for negative monetary values

2.2.6 Using the langinfo Database for Other Tasks

Functions in addition to the ones discussed so far use the langinfo database to determine settings for specific items of cultural data. For example, the scanf(), printf(), and wcstod() functions determine the appropriate radix character from information in the langinfo database.

2.3 Handling Text Presentation and Input

The language of the program user affects:

The way program messages are defined and accessed
How the program presents output text
How the program processes input text

These considerations are discussed in the following sections.

2.3.1 Creating and Using Messages

Programs need to communicate with users in their own language. This requirement places some constraints on the way program messages are defined and accessed. More specifically, messages are defined in a file that is independent of the program source code and are not compiled into object files. Because messages are in a separate file, they can be translated into different languages and stored in a form that is linked to the program at run time. Programs can then retrieve message text translations that are appropriate for the user's language.

The X/Open UNIX standard specifies a native-language message system that contains a definition of message text source files, the gencat command to generate message catalogs from these source files, and a set of Standard C Library functions to retrieve individual messages from one or more catalogs at run time.

The following example shows how an internationalized program retrieves a message from a catalog:

#include <stdio.h>     (1)

#include <locale.h>    (2)
#include <nl_types.h>  (3)
#include "prog_msg.h"      (4)
main()
{
      nl_catd catd;  (5)
      setlocale(LC_ALL, );  (6)
      catd = catopen("prog.cat", NL_CAT_LOCALE);  (7)
      puts(catgets(catd, SETN, HELLO_MSG, "Hello, world!")); (8)
      catclose(catd);  (9)
}

.
.
.

Includes the header file for the Standard C Library.

Includes the header file /usr/include/locale.h, which declares the setlocale() function and associated constants and variables.

Includes the header file /usr/include/nl_types.h, which declares the catopen(), catgets(), and catclose() functions.

Includes the program-specific header file, prog_msg.h, which sets constants to identify the message set (SETN) and specific messages (HELLO_MSG being one) that are used by this program module.

A message catalog can contain one or more message sets and individual messages are ordered within each set.

Declares a message catalog descriptor catd to be of type nl_catd.

This descriptor is returned by the function that opens the catalog. The descriptor is also passed as an argument to the function that closes the catalog.

Calls the setlocale() function to bind the program to settings for the user's locale environment variables.
The locale name set for the LC_MESSAGES variable is the locale used by the catopen() and catgets() functions in this example. Typically, the system manager or user sets only the LANG environment variable to a particular locale name and the same locale name is used for LC_MESSAGES.

Calls the catopen() function to open the message catalog named prog.cat for use by this program.
The NL_CAT_LOCALE argument specifies that the program will use the locale name set for LC_MESSAGES. The catopen() function uses the value set for the NLSPATH environment variable to determine the location of the message catalog. The call returns the message catalog descriptor to the catd variable.

Calls the puts() function to display the message.

The first argument to this call is a call to the catgets() function, which retrieves the appropriate text for the message with the HELLO_MSG identifier. This message is contained in the message set identified by the SETN constant. Note that the catgets() function allows one message translation to be held within the program source. This is the translation that will be used in the event that the program cannot get the message from the message catalog.

Calls the catclose() function to close the message catalog whose descriptor is contained in the catd variable.

Refer to Chapter 3 for information about creating and using message catalogs.

2.3.2 Formatting Output Text

Successful translation of messages into different languages depends not only on making messages independent of the program source code but also on careful construction of message strings within the program.

Consider the following example:

printf(catgets(catd, set_id, WRONG_OWNER_MSG,
               "%s is owned by %s\n"),
               folder_name, user_name);

The preceding statement uses a message catalog but assumes a particular language construction (a noun followed by a verb in passive voice followed by a noun). Passive-verb constructions are not part of all languages; therefore, message translation might mean printing user_name before folder_name. In other words, the translator might need to change the construction of the message so that the user sees the translated equivalent of "John_Smith owns JULY_REVENUE" rather than "JULY_REVENUE is owned by John_Smith."

To overcome the problems imposed by fixed ordering of message elements, the format specifiers for the printf() routine have been extended so that format conversion applies to the nth argument in an argument list rather than to the next unused argument. To apply the format conversion extension, replace the % conversion character with the sequence %digit $, where digit specifies the position of the argument in the argument list. The following example illustrates how the programmer applies this feature to the format string "%s is owned by %s\n":

printf(catgets(catd, set_id, WRONG_OWNER_MSG,
               "%1$s is owned by %2$s\n"),
               folder_name, user_name);

The construction of the string "%1$s is owned by %2$s", which is the default value for the WRONG_OWNER_MSG entry in the program's message file, can then be changed by the translator to the non-English equivalent of:

WRONG_OWNER_MSG        "%2$s owns %1$s\n"

2.3.3 Scanning Input Text

The string construction issues that are discussed for output text in Section 2.3.2 also apply to input text. For example, in different countries there are different conventions that apply to the order in which users specify the elements of a date or there are differences in characters that are input to delimit parts of monetary or other numeric strings. Therefore, the scanf() family of functions also support extended format conversion specifiers to allow for variation in the way that users enter elements of a string.

Consider the following example:


.
.
.

int day;
int month;
int year;

.
.
.

scanf("%d/%d/%d", &month, &day, &year);

.
.
.

The format string in this statement is governed by the assumption that all users use a United States English format (mm/dd/yy) to input dates. In an internationalized program, you use extended format specifiers to support requirements that language may impose on the order of string elements. For example:


.
.
.

scanf(catgets(catd, NL_SETD, DATE_STRING,
              "%1$d/%2$d/%3$d"), &month, &day, &year);

.
.
.

The default "%1$d/%2$d/%3$d" value for the DATE_STRING message is still appropriate only for countries where users use the format mm/dd/yy to enter dates. However, for countries in which the order or formatting would be different, the translator can change the entry in the program's message file. For example:

British English (dd/mm/yy):
```
DATE_STRING        "%2$d/%1$d/%3$d"
```

German (dd.mm.yy)

DATE_STRING        "%2$d.%1$d.%3$d"

2.4 Binding a Locale to the Run-Time Environment

For an internationalized program to operate correctly, it must bind to localized data that is appropriate for the user at run time. The setlocale() function performs this task. You can call setlocale() to:

Bind to locale settings that are already in effect for the user's process

Bind to locale settings controlled by the program

Query current locale settings without changing them

The call takes two arguments: category and locale_name.

The category argument can be one of the following:

LC_ALL to use, change, or query all portions of the user's locale
LC_CTYPE to use, change, or query the portion of the user's locale that classifies characters
LC_COLLATE to use, change, or query the portion of the user's locale that specifies character collation order
LC_MESSAGES to use, change, or query the portion of the user's locale that specifies yes/no responses and program messages
LC_MONETARY to use, change, or query the portion of the user's locale that specifies special characters used in monetary values
LC_NUMERIC to use, change, or query the portion of the user's locale that specifies the characters used for decimal point and thousands separator
LC_TIME to use, change, or query the portion of the user's locale that specifies names and abbreviations for days of the week and months of the year, and other strings and formatting conventions that govern expressions of date and time

The locale_name argument is one of the following values:

An empty string () to bind the program at run time to the locale name set for category by the system manager or user

A locale name to change the locale that may already be set for category

NULL to find out the locale name currently set for category

2.4.1 Binding to the Locale Set for the System or User

Typically, the system manager or user sets the LANG environment variable to the name of a locale; setting the LANG variable automatically sets all portions, or categories, of the locale to the same locale name. On occasion, system managers or individual users may set different locale categories to different locale names. Usually, internationalized programs contain the following call, which initializes all locale categories in the program to settings already in effect for the user:

setlocale(LC_ALL, "");

2.4.2 Changing Locales During Program Execution

Some internationalized programs may need to prompt the user for a locale name or change locales during program execution. The following example shows how to call setlocale() when you want to explicitly initialize or reinitialize all locale categories to the same locale name:


.
.
.

nl_catd catd;  (1)
char buf[BUFSIZ];  (2)

.
.
.

setlocale(LC_ALL, );  (3)
catd = catopen(CAT_NAME, 0);  (4)

.
.
.

printf(catgets(catd, NL_SETD, LOCALE_PROMPT_MSG,
               "Enter locale name: "));      (5)
gets(buf);  (6)
setlocale(LC_ALL, buf);  (7)

.
.
.

Declares a catalog descriptor catd as type nl_catd.
Declares the buf variable into which the locale name will later be stored.
To make sure that the variable is large enough to accommodate locale names on different systems, you should set its maximum size to the constant BUFSIZ, which is defined by the system vendor in /usr/include/stdio.h.
Calls setlocale() to initialize the program's locale settings to those in effect for the user who runs the program.
Calls catopen() to open the message catalog that contains the program's messages; returns the catalog's descriptor to the catd variable.
The CAT_NAME constant is defined in the program's own header file.
Prompts the user for a new locale name.
The NL_SETD constant specifies the default message set number in a message catalog and is defined in /usr/include/nl_types.h. The identifier LOCALE_PROMPT_MSG specifies the prompt string translation in the default message set.
Calls the gets() function to read the locale name typed by the user into the buf variable.
Calls setlocale() with buf as the locale_name argument to reinitialize all portions of the locale.

Sometimes a program needs to vary the locale only for a particular category of data. For example, consider a program that processes different country-specific files that contain monetary values. Before processing data in each file, the program might reinitialize a program variable to a new locale name and then use that variable value to reset only the LC_MONETARY category of the locale.