This chapter explains how language, codeset, and cultural differences
change the way you implement basic coding operations.
After reading this chapter,
you will be ready to examine an application that applies the program development
techniques that are suggested.
Such an application is provided
on line in the
/usr/examples/i18n/xpg4demo
directory.
Refer
to the
README
document in that directory for an introduction
to the application and how you can compile and run it with different locales.
Parts of the
xpg4demo
application are used as examples
in this and other chapters.
One of the primary functions of most computer programs is to manipulate data, some or all of which may involve interaction between the program and a computer user. In commercial situations, it is important that such interactions take place in the native language of each user. Cultural data should also observe the correct customs.
When you write programs to support multilanguage operation, you must consider the fact that languages can be represented within the computer system by one or more codesets. Because of the requirements of different languages, characters in codesets may vary in both size (8-bit, 16-bit, and so on) and binary representation.
You can satisfy the preceding requirements by writing programs that make no hard-coded assumptions about language, cultural data, or character encodings. Such programs are said to be internationalized. Data specific to each supported language, territory, and codeset combination are held separately from the program code and can be bound to the run-time environment by language-initialization functions.
Tru64 UNIX provides the following facilities for developing internationalized software, defining localization data, and announcing specific language requirements:
Library functions that handle extended character codes and that provide language- and codeset-independent character classification, case conversion, number format conversion, and string collation
Library functions that let programs dynamically determine cultural and language-specific data
A message system that allows program messages to be held apart from the program code, translated into different languages, and retrieved by a program at run time
An initialization function that binds a program at run time to the linguistic and cultural requirements of each user
The rest of this chapter describes each of these facilities in more detail.
The discussion and examples in this chapter focus on functions provided
in the Standard C Library.
Refer to
Chapter 4
and
Chapter 5
for information about using functions in the
curses
, X, and Motif libraries.
2.1 Using Codesets
In the past, most UNIX systems were based on the 7-bit ASCII codeset. However, most non-English languages include characters in addition to those contained in the ASCII codeset.
The X/Open UNIX standard does not require an operating system to supply any particular codesets in addition to ASCII. The standard does specify requirements for the interfaces that manipulate characters so that programs are able to handle characters from whatever codeset is available on a given system.
The first group of the International Standards Organization (ISO) codesets covered only the major European languages. In this group, several codesets allow for the mixing of major languages within a single codeset. All of these codesets are a superset of the ASCII codeset, and therefore systems can support non-English languages without invalidating existing software that is not internationalized. A Tru64 UNIX operating system always includes a locale for the United States that uses the ISO 8859-1 (ISO Latin-1) codeset.
Subsets that support localized variants of the operating system may include locales based on additional ISO codesets. For example, the optional language variant subsets included with Tru64 UNIX to support Czech, Hungarian, Polish, Russian, Slovak, and Slovene provide locales based on the ISO 8859-2 (Latin-2) codeset. Following is a complete list of ISO codesets with the languages that they support:
ISO 8859-1, Latin-1
Western European languages, including Catalan, Danish, Dutch, English, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish
ISO 8859-2, Latin-2
Eastern European languages, including Albanian, Czech, English, German, Hungarian, Polish, Rumanian, Serbo-Croatian, Slovak, and Slovene
ISO 8859-3, Latin-3
Afrikaans, Catalan, Dutch, English, Esperanto, German, Italian, Maltese, Spanish, and Turkish
ISO 8859-4, Latin-4
Danish, English, Estonian, Finnish, German, Greenlandic, Lappish, Latvian, and Lithuanian
ISO 8859-5, Latin/Cyrillic
Bulgarian, Byelorussian, English, Macedonian, Russian, Serbo-Croatian, and Ukranian
ISO 8859-6, Latin/Arabic
Arabic
ISO 8859-7, Latin/Greek
Greek
ISO 8859-8, Latin/Hebrew
Hebrew
ISO 8859-9, Latin-5
Danish, Dutch, English, Finnish, French, German, Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, and Turkish
ISO 8859-10, Latin-6
Danish, English, Estonian, Faroese, Finnish, German, Greenlandic, Icelandic, Sami (Lappish), Latvian, Lithuanian, Norwegian, and Swedish
ISO 8859-15, Latin-9
Albanian, Basque, Breton, Catalan, Danish, Dutch, English, Estonian, Faroese, Finnish, French, Frisian, Galician, German, Greenlandic, Icelandic, Irish Gaelic, Italian, Latin, Luxemburgish, Norwegian, Portuguese, Rhaeto-Romanic, Scottish Gaelic, and Swedish
Another ISO codeset supported by utilities on a standard operating system is ISO 6937:1983. This codeset, which accommodates both 7-bit and 8-bit characters, is used for text communication over communication networks and interchange media, such as magnetic tape and disks.
The codesets discussed up to this point address the requirements of languages whose characters can be stored in a single byte. Such codesets do not meet the needs of Asian languages, whose characters can occupy multiple bytes. The operating system software supplies the following codesets through subsets that support Asian languages and countries:
eucJP (Japanese Extended UNIX Code)
SJIS (Shift JIS)
deckanji (DEC Kanji)
sdeckanji (Super DEC Kanji)
deckorean (DEC Korean)
eucKR (Korean Extended UNIX Code)
TACTIS (Thai API Consortium/Thai Industrial Standard)
dechanzi (DEC Hanzi)
dechanyu (DEC Hanyu)
eucTW (Taiwanese Extended UNIX Code)
big5 (BIG-5)
These codesets are supplied when you install Asian-language variant subsets of the operating system software. Also supplied are a specialized terminal driver and associated utilities that must be available on your system to support the input and display of Asian characters at run time.
Codesets developed
for PC systems are commonly called code pages.
There are PC code pages that
correspond to most of the language-specific codesets developed for UNIX systems.
The operating system supports PC codesets mostly through converters that can
change file data from one type of encoding format to another.
The operating
system also supplies a limited number of locales for which characters are
defined in PC code page format.
For detailed information about code page support,
see
code_page
(5).
The Unicode and ISO/IEC 10646 standards specify the Universal Character Set (UCS), which allows character units to be processed for all languages, including Asian languages, by using the same set of rules. The operating system supports the UCS-4 (32-bit) encoding of this character set in locales that also provide local cultural data, such as collating sequences and date and monetary formats. These locales are derived from similar locales that use UNIX codesets. Therefore, only the characters appropriate for the set of languages supported by the underlying UNIX locale are defined as valid characters in the UCS-4 version.
Two other encoding formats are defined by the Unicode and ISO/IEC 10646 standards:
UCS-2, the 16-bit implementation of the UCS
UTF-8, a UCS transformation format for handling file data containing characters coded in more than one byte
The operating system supports these encoding formats through
both locales and codeset converters.
Locales whose name extensions include
.UTF-8
handle file data in UTF-8 format as well as supporting UCS-4
process code.
Among these locales are special variants (*.UTF-8@euro
locales) that also support the euro monetary character.
There is
also one locale,
universal.UTF-8
, that an application can
use along with the
fold_string_w()
function to process
the full range of characters defined by the Unicode and ISO/IEC 10646 standards.
This particular locale differs from most others because it does not provide
access to local cultural conventions.
See
Unicode
(5)
for detailed
information about support for the UCS-2, UCS-4, and UTF-8 encoding formats.
See
euro
(5)
for more information about the euro monetary character.
Reference pages are available for all the codesets that the operating system supports. For more information on a specific codeset, refer to its reference page. For information on how codesets are supported for a particular local language, refer to the reference page for that language. Reference pages for languages, particularly Asian languages, may note additional codesets that are not supported in a locale but for which there is a codeset converter.
The following sections discuss important issues that affect the way you write source code when your program must process characters in different codesets:
Ensuring data transparency
Using in-code literals
Manipulating multibyte characters
Converting between multibyte-character and wide-character data
Rules for multibyte characters
Classifying characters
Converting characters (case)
Comparing strings
2.1.1 Ensuring Data Transparency
As discussed in Section 2.1, internationalized software must accommodate a wide variety of character-encoding schemes. Programs cannot assume that a particular codeset is on all systems that conform to requirements in the X/Open UNIX CAE specifications, nor that individual characters occupy a fixed number of bits.
Another legacy of the historical dependence of UNIX systems on
7-bit ASCII character
encoding is that some programs use the most significant bit of a byte for
their own internal purposes.
This was a dubious programming practice, although
quite safe when characters in the underlying codeset always mapped to the
remaining 7 bits of the byte.
In the world of international codesets, the
practice of using the most significant bit of a byte for program purposes
must be avoided.
2.1.2 Using In-Code Literals
When writing internationalized software, using in-code literals can cause problems. Consider, for example, the following conditional statement:
if ((c = getchar()) == \141)
This condition assumes that lowercase
a
is always
represented by a fixed octal value, which may not be true for all codesets.
The following statement represents an improvement in that it substitutes
a character constant for the octal value:
if ((c = getchar()) == 'a')
This example still presents problems, however, because the
getchar()
function operates on bytes.
The statement would not work
correctly
if the next character in the input stream
spanned multiple bytes.
The following statement substitutes the
getwchar()
function for the
getchar()
function.
The statement works correctly with any codeset because
a
is a member of the PCS and is transformed into the same wide-character value
in all locales.
if ((c = getwchar()) == L'a')
The X/Open UNIX standard specifies that each member of the source character set and each escape sequence in character constants and string literals is converted to the same member of the execution character set in all locales. It is therefore safe for you to use any of the characters in the PCS as a character constant or in string literals. Note that non-English characters are not included in the PCS and may not translate correctly when used as literals. Consider the following example:
if ((c = getwchar()) == L' à ')
The accented character
à
may not be represented
in the codeset's source character set, execution character set, or both; or
the binary value of the accented character may not be translatable from one
set to the other.
When source files specify non-English characters in constants,
the results are undefined.
The following example shows how to construct a test for a constant
that for whatever reason may be a non-English character.
The
constant has been defined in a message catalog with the symbolic identifier
MSG_ID
.
Statements in the example retrieve the value for
MSG_ID
from the message catalog, which is locale specific and bound
to
the program at run time.
.
.
.
char *schar; [1] wchar_t wchar; [2]
.
.
.
schar = catgets(catd,NL_SETD,MSG_ID,"a"); [3] if (mbtowc (&wchar,schar,MB_CUR_MAX) == -1) [4] error(); if ((c = getwchar()) == wchar) [5]
.
.
.
Declares a pointer to
schar
as
char
. [Return to example]
Declares the variable
wchar
to be of type
wchar_t
. [Return to example]
Calls the
catgets()
function to retrieve
the value of
MSG_ID
from the message catalog for the user's
locale.
The
catgets()
function returns a value as an array
of bytes so the value is returned to the
schar
variable.
If the accented character is not available in the locale's codeset, the test
is made against the unaccented base character (a
).
[Return to example]
Tests to make sure the value contained in
schar
represents a valid multibyte character; if yes, converts it to a wide-character
value and stores the results in the variable
wchar
.
If
schar
does not contain a valid multibyte character,
signals an error. [Return to example]
Codes the conditional statement to include the value contained
in
wchar
as the constant. [Return to example]
Refer to
Chapter 3
for more information about
message catalogs and the
catgets()
function.
See
Section 2.1.4
for information about converting multibyte
characters and strings to wide-character data that your program can process.
2.1.3 Manipulating Characters That Span Multiple Bytes
Tru64 UNIX
provides all the interfaces (such as
putwc()
,
getwc()
,
fputws()
, and
fgetws()
)
that are needed to support codesets with characters that span multiple bytes.
Language variant subsets of the operating system must be installed to supply
the locales and facilities that make this support operational.
On systems
where such locales are not available, or are available and not bound to the
program at run time, the
*ws*
and
*wc*
functions are merely synonyms for the associated single-byte functions (such
as
putc()
,
getc()
,
fputs()
, and
fgets()
).
2.1.4 Converting Between Multibyte-Character and Wide-Character Data
On an internationalized system, data can be encoded as either multibyte-character or wide-character data.
Multibyte encoding is typically the encoding used when data is stored in a file or generated for external use or data interchange. Multibyte encoding has the following disadvantages:
Multibyte characters are not represented by a fixed number of bytes per character, even in the same codeset, so the size of a character in a multibyte data record can vary from one character to the next.
The parsing rules for retrieving character codes from a multibyte data record are locale dependent.
Because of these disadvantages, wide-character encoding, which allocates a fixed number of bytes per character, is typically used for internal processing by programs; in fact, internal process code is another way of referring to data in wide-character format. The size of a wide character varies from one system implementation to another. On Tru64 UNIX systems, the size for a wide character is set to 4 bytes (32 bits), a setting that optimizes performance for the Alpha processor.
Library routines that print, scan, input, or output text can automatically convert data from multibyte characters to wide characters or from wide characters to multibyte characters, as appropriate for the operation. However, applications almost always have additional statements or requirements for which conversion to and from multibyte characters needs to be explicit.
The following
example is from a program module that reads records from a database of employee
data.
In this case, the programmer wants to process the data in fixed-width
units, so uses the
mbstowcs( )
function to explicitly
convert an employee's first and last names from multibyte-character to wide-character
encoding.
/* * The employee record is normalized with the following format, which * is locale independent: Badge number, First Name, Surname, * Cost Center, Date of Join in the `yy/mm/dd' format. Each field is * separated by a TAB. The space character is allowed in the First * Name and Surname fields. */ static const char *dbOutFormat = "%ld\t%S\t%S\t%S\t%02d/%02d/%02d\n"; static const char *dbInFormat = "%ld %[^\t] %[^\t] %S %02d/%02d/%02d\n";
.
.
.
sscanf(record, dbInFormat, &emp->badge_num, firstname, surname, emp->cost_center, &emp->date_of_join.tm_year, &emp->date_of_join.tm_mon, &emp->date_of_join.tm_mday); (void) mbstowcs(emp->first_name, firstname, FIRSTNAME_MAX+1); (void) mbstowcs(emp->surname, surname, SURNAME_MAX+1);
.
.
.
Refer to
Section A.9
for a complete
list of functions that work directly with multibyte data.
2.1.5 Rules for Multibyte Characters in Source and Execution Codesets
Both the source and execution character set variants of the same codeset can contain multibyte characters. The encodings do not have to be the same, but both set variants observe certain rules in codesets that meet X/Open requirements. PC code pages and UCS-based codesets may adhere to some or most of these rules, but the codesets native to any UNIX system that conforms to X/Open standards must adhere to all of them.
The characters defined in the Portable Character Set must be present in both sets.
The existence, meaning, and encoding of any additional members are locale specific.
A character may have a state-dependent encoding. A string of characters may contain a shift-state character that affects the system's interpretation of the following bytes until another shift-state character is encountered.
While in the initial shift state, all characters from the basic character set retain their usual interpretation and do not alter the shift state.
The interpretation for subsequent bytes in the sequence is a function of the current shift state.
A byte with all bits set to zero is interpreted as a null character, independent of the shift state.
A byte with all bits zero must not occur in the second or subsequent bytes of a multibyte character.
The source variant of a codeset must observe the following additional rules:
A comment, string literal, character constant, or header name must begin and end in the initial shift state.
A comment, string literal, character constant, or header name must consist of a sequence of valid multibyte characters.
The C language compiler also supports
trigraph sequences
when you specify the
-std1
or
-std
flag on the
cc
command line.
Trigraph sequences,
which are part of the ANSI C specification, allow users to enter the full
range of basic characters in programs, even if their keyboards do not support
all characters in the source codeset.
The following trigraph sequences are
currently defined, each of which is replaced by the corresponding single character:
Trigraph Sequence | Single Character |
??= |
# |
??( |
[ |
??/ |
\ |
??' |
^ |
??< |
{ |
??) |
] |
??! |
| |
??> |
} |
??- |
~ |
Another feature of program operation that depends on the locale is character classification; that is, determining whether a particular character code refers to an uppercase alphabetic, lowercase alphabetic, digit, punctuation, control, or space character.
In the past, many programs classified characters according to whether the character's value fell between certain numerical limits. For example, the following statement tests for all uppercase alphabetic characters:
if (c >= 'A' && c <= 'Z')
This statement is valid for the ASCII codeset, in which all uppercase
letters have values in the range
0x41
to
0x5a
(A to Z).
However, the statement is not valid for the ISO 8859-1
codeset, in which uppercase letters occupy the ranges
0x41
to
0x5a
,
0xc0
to
0xd6
,
and
0xd8
to
0xdf
.
In the EBCDIC codeset,
character values are different again and, in this case, even the uppercase
English letters have a different encoding.
When you write internationalized programs, classify characters by calling the appropriate internationalization function. For example:
if (iswupper (c))
Internationalization functions classify wide-character code values according
to
ctype
information in the user's locale.
Refer to
Section A.2
for a complete list and description of character
classification functions.
2.1.7 Converting Characters
You can do case conversion
of ASCII characters with statements such as the following ones, which convert
the character in
a_var
first to lowercase and then to uppercase:
a_var |= 0x20;
.
.
.
a_var &= 0xdf;
The preceding statements are not safe to use in internationalized programs because they:
Assume ASCII-coded character values
Can convert invalid values
The correct way to handle
case conversion is to call the
towlower()
function for
conversion to lowercase and the
towupper()
function for
conversion to uppercase.
For example:
a_var = towlower(a_var);
.
.
.
a_var = towupper(a_var);
These functions use information specified
in the user's locale and are independent of the codeset where characters are
defined.
The functions return the argument unchanged if input is invalid.
Refer to
Section A.3
for more detailed discussion
of case conversion functions.
2.1.8 Comparing Strings
UNIX systems have always provided
functions for comparing character strings.
The following statement, for example,
compares the strings
s1
and
s2
, returning
an integer greater than, equal to, or less than zero, depending on whether
the value of
s1
is greater than, equal to, or less than
the value of
s2
in the machine-collating sequence:
.
.
.
int cmp_val; char *s1; char *s2;
.
.
.
cmp_val = strcmp(s1, s2);
.
.
.
Many languages, however, require more complex collation algorithms than a simple numerical sort. For example, multiple passes may be required for the following reasons:
Ordering accented characters within a particular character class for a language (for example, a, á, à, and so on)
Collating certain multiple character sequences as a single character (for example, the Welsh character ch, which collates after c and before d)
Collating certain single characters as a 2-character sequence (for example, the German character sharp s, which collates as ss)
Ignoring certain characters during collation (for example, hyphens in dictionary words)
String comparison in an international environment thus depends on the codeset and language. This dependency means that additional functions are required to compare strings according to collating sequence information in the user's locale. These functions include:
strcoll()
,
which uses collation information defined in the user's locale rather than
performing a simple numeric comparison as does the
strcmp()
function
wcscoll()
,
which performs the same operation as
strcoll()
, except
that it operates on wide characters
wcsxfrm()
,
which transforms a wide-character string by using collating sequence information
in the user's locale so that the resulting string can be compared using the
wcscmp()
function
If two strings are being compared
only for equality, you can use
strcmp()
or
wcscmp()
, which are faster in most environments than
wcscoll()
.
Cultural data refers to items of information that can vary between languages or territories.
For example:
In the United Kingdom and the United States, a period represents the radix character and a comma represents the thousands separator in decimal numbers. In Germany, the same two characters are used in decimal numbers with exactly the opposite meaning.
In the United States, the date October 7, 1986 is represented as 10/7/1986, whereas in the United Kingdom, the same date is represented as 7/10/1986. This example indicates that cultural data items can vary when the same language is spoken.
Date delimiters, as well as the order of year, month, and day, can vary among countries. In Germany, for example, the date October 7, 1986 is represented as 7.10.1986 rather than as 7/10/1986.
Currency symbols can vary both in terms of the characters used and where they are placed in a currency value; that is, currency symbols can precede, follow, or be embedded in the value.
You cannot make assumptions about cultural data when writing internationalized
programs.
Your program must operate according to the local customs of users.
The X/Open UNIX standard specifies that this requirement be met through a
database of cultural data items that a program can access at run time, plus
a set of associated interfaces.
The following sections discuss this database
and the functions used to extract and process its data items.
2.2.1 The langinfo Database
The language
information database, named
langinfo
, contains items that
represent the cultural details of each locale supported on the system.
The
langinfo
database contains the following information for each locale,
as required by the X/Open UNIX standard:
Codeset name
Date and time formats
Names of the days of the week
Names of the months of the year
Abbreviations for names of days
Abbreviations for names of months
Radix character (the character that separates whole and fractional quantities
Thousands separator character
Affirmative and negative responses for yes/no queries
Currency symbol and its position within a currency value
Emperor/Era name and year (for Japanese locales)
2.2.2 Querying the langinfo Database
You can extract cultural data
items from the
langinfo
database by calling the
nl_langinfo()
function.
This function takes an
item
argument that is one of several constants defined in the
/usr/include/langinfo.h
header file.
The function returns a pointer
to the string with the value for
item
in the current
locale.
The following example shows a call to
nl_langinfo()
that extracts the string for formatting date and time information.
This value is associated with the constant
D_T_FMT
.
nl_langinfo(D_T_FMT);
2.2.3 Generating and Interpreting Date and Time Strings That Observe Local Customs
Programs often
generate date and time strings.
Internationalized programs generate strings
that observe the local customs of the
user.
You can meet this requirement by calling the
strftime()
or
wcsftime()
function.
Both functions
indirectly use the
langinfo
database.
The difference is
that
wcsftime()
converts date and time to wide-character
format.
In the following example, the
strftime()
function generates a date string as defined by the
D_FMT
item in the
langinfo
database:
.
.
.
setlocale(LC_ALL, ""); [1]
.
.
.
clock = time((time_t*)NULL); [2] tm = localtime(&clock); [3]
.
.
.
strftime(buf, size, "%x", tm); [4] puts(buf); [5]
.
.
.
Binds the program at run time to the locale set for the system or individual user. [Return to example]
Calls the
time()
subroutine to return the
time value, relative to Coordinated Universal Time, to the
clock
variable. [Return to example]
Calls the
localtime()
function to convert
the value contained in
clock
to a value that can be stored
in a
tm
structure, whose members represent values for year,
month, day, hour, minute, and so forth. [Return to example]
Calls
strftime()
to generate a date string
formatted as defined in the user's locale from the value contained in the
tm
structure.
The
buf
argument is a pointer to a string variable
in which the date string is returned.
The
size
argument
contains the maximum size of
buf
.
The
"%x"
argument specifies conversion specifications, similar to the format strings
used with the
printf()
and
scanf()
functions.
The
"%x"
argument is replaced in the output string by
representation appropriate for the locale. [Return to example]
Calls the
puts()
function to copy the string
contained in
buf
to the standard output stream (stdout
) and to append a newline character. [Return to example]
The following example shows how to use
strftime()
and
nl_langinfo()
in combination to generate a date and
time string.
Assume that the same calls to the
setlocale()
,
time()
, and
localtime()
interfaces have been
made here as shown in the preceding example.
The only difference is that a call to
nl_langinfo()
has replaced the format string argument in the call
to
strftime()
:
.
.
.
strftime(buf, size, nl_langinfo(D_T_FMT), tm); puts(buf);
.
.
.
To convert a string to a date/time value, the reverse of the operation
performed by
strftime( )
, you can use the
strptime( )
function.
The
strptime( )
supports a number of conversion specifiers that behave in a locale-dependent
manner.
2.2.4 Formatting Monetary Values
The
strfmon()
function formats monetary values according to information
in the locale that is bound to the program at run time.
For example:
strfmon(buf, size, "%n", value);
This statement formats the double-precision floating-point value contained
in the
value
variable.
The
"%n"
argument
is the format specification that is replaced by the format defined in the
run-time locale.
The results are returned to the
buf
array,
whose maximum length is contained in the
size
variable.
The
money
program demonstrates how the
strfmon()
function works.
The source file for this sample program
is available in the
/usr/i18n/examples/money
directory.
2.2.5 Formatting Numeric Values in Program-Specific Ways
You may
want to perform your own conversions of numeric quantities, monetary or otherwise,
by using specific formatting details in the user's locale.
The
localeconv()
function, which has no arguments, returns all the number
formatting details defined in the locale to a structure declared in your program.
For example:
struct lconv *app_conv;
You can use the following features, which are contained in the
lconv
structure, in program-defined routines:
Radix character
Thousands separator character
Digit grouping size
International currency symbol
Local currency symbol
Radix character for monetary values
Thousands separator for monetary values
Digit grouping size for monetary values
Positive sign
Negative sign
Number of fractional digits to be displayed
Parenthesis symbols for negative monetary values
2.2.6 Using the langinfo Database for Other Tasks
Functions in addition to the ones discussed so far use the
langinfo
database to determine settings for specific items of cultural
data.
For example, the
wscanf()
,
wprintf()
, and
wcstod()
functions determine the appropriate radix character from information in the
langinfo
database.
2.3 Handling Text Presentation and Input
The language of the program user affects:
The way program messages are defined and accessed
How the program presents output text
How the program processes input text
These considerations are discussed in the following sections.
2.3.1 Creating and Using Messages
Programs need to communicate with users in their own language. This requirement places some constraints on the way program messages are defined and accessed. More specifically, messages are defined in a file that is independent of the program source code and are not compiled into object files. Because messages are in a separate file, they can be translated into different languages and stored in a form that is linked to the program at run time. Programs can then retrieve message text translations that are appropriate for the user's language.
The X/Open UNIX standard specifies:
A messaging system that contains a definition of message text source files
The
gencat
command to generate message
catalogs from these source files
A set of library functions to retrieve individual messages from one or more catalogs at run time
The following example shows how an internationalized program retrieves a message from a catalog:
#include <stdio.h> [1] #include <locale.h> [2] #include <nl_types.h> [3] #include "prog_msg.h" [4] main() { nl_catd catd; [5] setlocale(LC_ALL, ""); [6] catd = catopen("prog.cat", NL_CAT_LOCALE); [7] puts(catgets(catd, SETN, HELLO_MSG, "Hello, world!")); [8] catclose(catd); [9] }
.
.
.
Includes the header file for the Standard C Library. [Return to example]
Includes the
/usr/include/locale.h
header
file, which declares the
setlocale()
function and associated
constants and variables. [Return to example]
Includes the
/usr/include/nl_types.h
header
file, which declares the
catopen()
,
catgets()
, and
catclose()
functions. [Return to example]
Includes the program-specific
prog_msg.h
header file, which sets constants to identify the message set (SETN) and specific
messages (HELLO_MSG being one) that are used by this program module.
A message catalog can contain one or more message sets and individual messages are ordered within each set. [Return to example]
Declares a message catalog descriptor
catd
to be of type
nl_catd
.
This descriptor is returned by the function that opens the catalog. The descriptor is also passed as an argument to the function that closes the catalog. [Return to example]
Calls the
setlocale()
function to bind the
program's locale categories to settings for the user's
locale environment variables.
The locale name set for the
LC_MESSAGES
category
is the locale used by the
catopen()
and
catgets()
functions in this example.
Typically, the system manager or user
sets only the
LANG
or
LC_ALL
environment
variable to a particular locale name, and this operation implicitly sets the
LC_MESSAGES
variable as well. [Return to example]
Calls the
catopen()
function to open the
prog.cat
message catalog for use by this program.
The
NL_CAT_LOCALE
argument specifies that the program
will use the locale name set for
LC_MESSAGES
.
The
catopen()
function uses the value set for
the
NLSPATH
environment variable to determine the location
of the message catalog.
The call returns the message catalog descriptor to
the
catd
variable. [Return to example]
Calls the
puts()
function to display the
message.
The first argument to this call is a call to the
catgets()
function, which retrieves the appropriate text for the message
with the
HELLO_MSG
identifier.
This message is contained
in the message set identified by the
SETN
constant.
The
final argument to
catgets()
is the default text to be used
if the messaging call cannot retrieve the translated text from the catalog.
Default text is usually in English. [Return to example]
Calls the
catclose()
function to close the
message catalog whose descriptor is contained in the
catd
variable. [Return to example]
Refer to
Chapter 3
for information about creating
and using message catalogs.
2.3.2 Formatting Output Text
Successful translation of messages into different languages depends not only on making messages independent of the program source code but also on careful construction of message strings within the program.
Consider the following example:
printf(catgets(catd, set_id, WRONG_OWNER_MSG, "%s is owned by %s\n"), folder_name, user_name);
The preceding statement
uses a message catalog but
assumes a particular language
construction (a noun followed by a verb in passive voice followed by a noun).
Passive-verb constructions
are not part of all languages; therefore, message translation might mean printing
user_name
before
folder_name
.
In other words,
the translator might need to change the construction of the message so that
the user sees the translated equivalent of "John_Smith owns JULY_REVENUE"
rather than "JULY_REVENUE is owned by John_Smith."
To overcome the problems imposed
by fixed ordering of message elements, the format specifiers for the
printf()
routine have been extended so that format conversion applies
to the
nth argument in an argument list rather
than to the next unused argument.
To apply the format conversion extension,
replace the
%
conversion character with the sequence
%digit
$
, where
digit
specifies the position of the argument in the argument
list.
The following example illustrates how the programmer applies this feature
to the format string
"%s is owned by %s\n"
:
printf(catgets(catd, set_id, WRONG_OWNER_MSG, "%1$s is owned by %2$s\n"), folder_name, user_name);
The construction
of the string
"%1$s is owned by %2$s"
, which is the default
value for the
WRONG_OWNER_MSG
entry in the program's message
file, can then be changed by the translator to the non-English equivalent
of:
WRONG_OWNER_MSG "%2$s owns %1$s\n"
The
string construction issues that are discussed for output text in
Section 2.3.2
also apply to input text.
For example, in different countries there are different
conventions that apply to the order in which users specify the elements of
a date or there are differences in characters that are input to delimit parts
of monetary or other numeric strings.
Therefore, the
scanf()
family of functions also support extended format conversion specifiers to
allow for variation in the way that users enter elements of a string.
Consider the following example:
.
.
.
int day; int month; int year;
.
.
.
scanf("%d/%d/%d", &month, &day, &year);
.
.
.
The format string in this statement is governed by the assumption that all users use a United States format (mm/dd/yyyy) to input dates. In an internationalized program, you use extended format specifiers to support requirements that language may impose on the order of string elements. For example:
.
.
.
scanf(catgets(catd, NL_SETD, DATE_STRING, "%1$d/%2$d/%3$d"), &month, &day, &year);
.
.
.
The default
"%1$d/%2$d/%3$d"
value for the DATE_STRING
message
is still appropriate
only for countries where users use the format mm/dd/yyyy to enter dates.
However,
for countries in which the order or formatting would be different, the translator
can change the entry in the program's message file.
For example:
British English (dd/mm/yyyy):
DATE_STRING "%2$d/%1$d/%3$d"
German (dd.mm.yyyy)
DATE_STRING "%2$d.%1$d.%3$d"
2.4 Binding a Locale to the Run-Time Environment
For an internationalized program to operate correctly, it must
bind to localized data that is appropriate for the user at run time.
The
setlocale()
function performs this task.
You can call
setlocale()
to:
Bind to locale settings that are already in effect for the user's process
Bind to locale settings controlled by the program
Query current locale settings without changing them
The call takes two arguments: category and locale_name.
The category argument specifies whether you want to query, change, or use all or a specific section of a locale. Values for category and what they represent are as follows:
LC_ALL
, all sections of a locale
LC_CTYPE
, the locale section that classifies
characters
LC_COLLATE
, the locale section that specifies
character collation order
LC_MESSAGES
, the locale section that specifies
yes/no responses and program messages
LC_MONETARY
, the locale section that specifies
special characters used in monetary values
LC_NUMERIC
, the locale section that specifies
the characters used for decimal point and thousands separator
LC_TIME
, the locale section that specifies
names and abbreviations for days of the week and months of the year, and other
strings and formatting conventions that govern expressions of date and time
The locale_name argument is one of the following values:
An empty string (""
) to bind the program
at run time to the locale name set for
category
by the system manager or user
A locale name to change the locale that may already be set for category
NULL
to determine the locale name currently
set for
category
2.4.1 Binding to the Locale Set for the System or User
Typically, the system manager or user sets the
LANG
or
LC_ALL
environment variable to the name
of a locale; setting either of these variables automatically sets all locale
category variables to the same locale name.
On occasion (if they do not use
LC_ALL
), system managers or individual users may set locale category
variables to different locale names.
Usually, internationalized programs contain
the following call, which initializes all locale categories in the program
to environment variable settings already in effect for the user:
setlocale(LC_ALL, "");
2.4.2 Changing Locales During Program Execution
Some
internationalized programs may need to prompt the user for a locale name or
change locales during program execution.
The following example shows how to
call
setlocale()
when you want to explicitly initialize
or reinitialize all locale categories to the same locale name:
.
.
.
nl_catd catd; [1] char buf[BUFSIZ]; [2]
.
.
.
setlocale(LC_ALL, ""); [3] catd = catopen(CAT_NAME, NL_CAT_LOCALE); [4]
.
.
.
printf(catgets(catd, NL_SETD, LOCALE_PROMPT_MSG, "Enter locale name: ")); [5] gets(buf); [6] setlocale(LC_ALL, buf); [7]
.
.
.
Declares a catalog descriptor
catd
as type
nl_catd
. [Return to example]
Declares the
buf
variable into which the
locale name will later be stored.
To make sure that the variable is large enough to accommodate locale
names on different systems, you should set its maximum size to the
BUFSIZ
constant, which is defined by the system vendor in
/usr/include/stdio.h
. [Return to example]
Calls
setlocale()
to initialize the program's
locale settings to those in effect for the user who runs the program.
[Return to example]
Calls
catopen()
to open the message catalog
that contains the program's messages; returns the catalog's descriptor to
the
catd
variable.
The
CAT_NAME
constant is defined in the program's
own header file. [Return to example]
Prompts the user for a new locale name.
The
NL_SETD
constant specifies the default message
set number in a message catalog and is defined in
/usr/include/nl_types.h
.
The
LOCALE_PROMPT_MSG
identifier specifies
the prompt string translation in the default message set. [Return to example]
Calls the
gets()
function to read the locale
name typed by the user into the
buf
variable. [Return to example]
Calls
setlocale()
with
buf
as the
locale_name
argument to reinitialize all
portions of the locale. [Return to example]
Sometimes a program needs to vary the locale only for a particular
category of data.
For example, consider a program that processes different
country-specific files that contain monetary values.
Before processing data
in each file, the program might reinitialize a program variable to a new locale
name and then use that variable value to reset only the
LC_MONETARY
category of the locale.