C Using Internationalization Features

This appendix describes the internationalization features of the operating system. These features provide users with the ability to process data and to interact with the system in a manner appropriate to their native language, customs, and geographic region (their locale).

After reading this appendix, you will be able to do the following:

Understand the concept of locale
Understand what functions are affected by locale
Determine whether a locale has been set (if necessary)
Set your locale (if necessary)
Change your locale or aspects of your locale (if necessary)

If your site is in the United States and you plan to use the American English language and its conventions, there is no need to set a locale because the system default is American English.

If your site is outside the United States, the locale will most likely have already been specified by the system administrator. If the locale has already been set, you may want to only skim this appendix for background information on internationalization. If the locale has not been set, the information in this appendix is essential to you.

C.1 Understanding Locale

Because Digital UNIX is an internationalized operating system, it can present information in a variety of ways. Users tell the operating system how to process and present information in a way appropriate for their language, country, and cultural customs by specifying a locale. See Section C.4 for information about how to specify a locale.

A locale generally consists of three parts: language, territory, and codeset. All three are important for specifying how information is processed and displayed:

Language specifies the native language (for example, German, French, English).
Territory specifies the geographic area (for example, Germany, France, Great Britain).
Codeset specifies the coded character set that is used for the locale (for example, ISO 8859/1, the ISO Latin-1 codeset).

At this point, some background information about codesets may be helpful.

The ASCII codeset has traditionally been used on UNIX systems to express American English. Each letter of the English alphabet (A to Z, a to z) as well as digits, control characters, and symbols are uniquely identified using only 7 of the 8 bits in a standard byte. However, the introduction of new codesets or expansion of old ones has been necessary to include non-English characters. Because so many programs rely on ASCII characters in one way or another, the most commonly used codesets begin with ASCII and build from there.

By using all 8 bits of a standard byte, a single codeset can uniquely identify characters in several alphabetic languages. The most popular codesets are a series called ISO 8859. The first in the series is called ISO 8859/1, the second is ISO 8859/2, and so on through ISO 8859/10. The ISO 8859/1 codeset, often called Latin-1, supports English and other Western European languages.

To identify all ideographic symbols in Asian languages, such as Chinese and Japanese, character encoding requires more than one byte. Numerous codesets using multibyte character encoding, which is not supported by the ISO 8859 series of codesets, have been developed for Asian languages.

C.2 How Locale Affects Processing and Display of Data

As previously mentioned, the locale specified on your system influences how information is processed and displayed. Specifically, locale affects how the software:

Collates (sorts) data
Formats date and time expressions
Formats monetary and other numeric expressions
Displays messages
Prompts for yes/no responses

The following sections describe the items in this list.

C.2.1 Collation

Collation is the action of arranging elements of a set into a particular order. Collation always follows a set of rules. Some languages require collation rules that are not used in English.

Multilevel
Some languages include groups of characters that all sort to the same primary location. Additional sort rules apply to order characters within the same group. For example, the French characters a, á, à, and â all sort to the same primary location. Words that begin with these characters collate the same location, at which point words are sorted within the group. These words are in correct French order:
a
á
abord
âpre
après
âpreté
azur
One-to-two character mapping
In some languages, certain single characters are treated as if they were two characters. For example, the German sharp s () is sorted as if it were "ss".
Multiple-to-one character mapping
Some languages treat a string of characters as if it were a single element. For example, the Spanish ch and ll sequences are treated as unique characters in the Spanish alphabet. The following words are in correct Spanish order:
canto
construir
curioso
chapa
chocolate
dama
Ignored characters
Some collation rules ignore certain characters. For example, if the hyphen (-) is defined as a character to be ignored, the strings "re-locate" and "relocate" sort to the same position.

Note
This means that you cannot assume that the range [A to z, a to z] includes every letter of an alphabet. For example, the Danish alphabet includes three characters that sort after z.

C.2.2 Date and Time Formats

Users around the world express dates and times with different formatting conventions. When specifying day and month names, people in the United States generally express dates with an expression like the following one:

Tuesday, May 22, 1996

The French, on the other hand, express dates this way:

mardi, 22 mai 1996

The following examples show alternative formats for the date, March 20, 1996. A given format is not the only way to write the date in the listed country:

3/20/96 (United States)

20/3/96 (Great Britain)

20.3.96 (France and Germany)

20-III-96 (Italy)

96/3/20 (Japan)

2/3/20 (Japan, Emperor format)

In Japan's Emperor format, the year (2, in the preceding example) is expressed as the number of years that the current emperor has reigned.

As with dates, there are many conventions for expressing the time of day. In the United States, people often use the 12-hour clock with its a.m. and p.m. designations. People in most other countries use the 24-hour clock to express the time.

In addition to the 12-hour/24-hour clock differences, punctuation for written times can vary, for example:

3:20 p.m. (United States)

15h20 (France)

15.20 (Germany)

15:20 (Japan)

C.2.3 Numeric and Monetary Formats

The characters used to format numeric and monetary values vary from place to place. In the United States, the convention is to use a period (.) as the radix character (the character that separates whole and fractional quantities), and a comma (,) as the thousands separator. In many European countries, these conventions are reversed. For example:

1,234.56 (United States)

1.234,56 (France)

Here are some sample formats for monetary items:

$1,234.56 (United States, dollars)

kr1.234,56 (Norway, krona)

SFrs.1,234.56 (Switzerland, Swiss francs)

Note that some formats for monetary amounts include more than two places for fractional digits.

C.2.4 Messages

Programs are sometimes written with English messages embedded in the program itself. In an internationalized program, messages are kept in a separate file and replaced in the program with calls to a messaging system. Messages kept in a separate file can be translated and made available to the program. When translated messages are available, users can interact with the system in their native language.

C.2.5 Yes/No Prompts

Many programs ask questions that need a positive or negative response. Those programs typically look for the English string literals y or yes, n or no. An internationalized program lets users enter the characters or words that are appropriate to their language. For example, a French user should be able to enter o or oui.

C.3 Determining Whether a Locale Has Been Set

If your system is functioning in accordance with the language and conventions of your country, you can assume that the locale has been set correctly. If you are not sure whether or not your locale has been set, enter the locale command to display current settings of the locale environment variables, for example:

% locale

LANG=fr_FR.ISO8859-1
LC_COLLATE="fr_FR.ISO8859-1"
LC_CTYPE="fr_FR.ISO8859-1"
LC_MONETARY="fr_FR.ISO8859-1"
LC_NUMERIC="fr_FR.ISO8859-1"
LC_TIME="fr_FR.ISO8859-1"
LC_MESSAGES="fr_FR.ISO8859-1"
LC_ALL=

The locale environment variables, described in Section C.4.1, define the locale names used for messages, collation, codeset, numeric formats, monetary formats, date and time formats, and yes/no responses:

LANG

LC_COLLATE

LC_CTYPE

LC_NUMERIC

LC_MONETARY

LC_TIME

LC_MESSAGES

LC_ALL

In most cases, only the LANG variable has been set to a locale name, which then applies to other locale variables with the exception of LC_ALL.

C.4 Setting a Locale

When you specify a locale, you specify a locale name that indicates language, territory, and codeset. On Digital UNIX systems, locale names adhere to the following format:

lang_terr.codeset

lang: Is a 2-letter, lowercase abbreviation for the language name. The abbreviations are specified in ISO 639 Code for the Representation of Names of Languages, for example: en (English), fr (French), de (German, from "Deutsch"), ja (Japanese).

terr: Is a 2-letter, uppercase abbreviation for the territory name. The abbreviations are specified in ISO 3116 Codes for the Representation of Names of Countries, for example: US (United States), NL (the Netherlands), FR (France), DE (Germany, from "Deutschland"), JP (Japan).

codeset: Is a string that identifies the codeset, for example: ISO8859-1 (ISO 8859/1), SJIS (Shift Japanese Industrial Standard), AJEC (Advanced Japanese EUC).

Full locale names include: en_US.ISO8859-1 (English, incorporating customs for the United States), fr_FR.ISO8859-1 (French, incorporating customs for France), de_DE.ISO8859-1 (German, incorporating customs for Germany).

A locale can be set by the system administrator or an individual user. If your system administrator sets the locale at your site, it is likely that a default locale has been specified for all systems, including yours. You can override the default locale if you want to do that.

To set a locale, you assign a locale name to one or more environment variables. The easiest way to do this is to assign a locale name to the LANG environment variable because this variable covers all the pieces of a locale (codeset, collating sequence, numeric, monetary, and date and time formats, messages, and so forth).

Table C-1 lists the locales available when you install the subset, Single-byte European Locales. Additional locales may be available if language-variant software for the operating system is installed on your system.

Table C-1: Locale Names

Language	Country	Codeset	Locale Name
-	-	ASCII	C
-	-	ASCII	POSIX
Danish	Denmark	Latin-1	da_DK.ISO8859-1
German	Switzerland	Latin-1	de_CH.ISO8859-1
German	Germany	Latin-1	de_DE.ISO8859-1
Greek	Greece	Latin-7	el_GR.ISO8859-7
English	Great Britain	Latin-1	en_GB.ISO8859-1
English	United States	Latin-1	en_US.ISO8859-1
Spanish	Spain	Latin-1	es_ES.ISO8859-1
Finnish	Finland	Latin-1	fi_FI.ISO8859-1
French	Belgium	Latin-1	fr_BE.ISO8859-1
French	Canada	Latin-1	fr_CA.ISO8859-1
French	Switzerland	Latin-1	fr_CH.ISO8859-1
French	France	Latin-1	fr_FR.ISO8859-1
Italian	Italy	Latin-1	it_IT.ISO8859-1
Dutch	Belgium	Latin-1	nl_BE.ISO8859-1
Dutch	The Netherlands	Latin-1	nl_NL.ISO8859-1
Norwegian	Norway	Latin-1	no_NO.ISO8859-1
Portuguese	Portugal	Latin-1	pt_PT.ISO8859-1
Swedish	Sweden	Latin-1	sv_SE.ISO8859-1
Turkish	Turkey	Latin-9	tr_TR.ISO8859-9

The C locale is the default if no locales are set on your system. The POSIX locale is equivalent to the C locale; only letters in the English alphabet are included in the ASCII codeset that is specified for the POSIX and C locales.

C.4.1 Locale Categories

Table C-2 describes environment variables that influence locale functions.

Table C-2: Environment Variables That Influence Locale Functions

Variable	Description
`LC_COLLATE`	Specifies the collating sequence to use when sorting strings and when character ranges occur in patterns.
`LC_CTYPE`	Specifies the character classification (codeset) information.
`LC_MONETARY`	Specifies monetary formats.
`LC_NUMERIC`	Specifies numeric formats.
`LC_MESSAGES`	Specifies the language in which messages will appear if translations are available. In addition, this variable specifies strings for affirmative and negative responses.
`LC_TIME`	Specifies date and time formats.
`LC_ALL`	Overrides all preceding variables and the `LANG` environment variable. In general, this variable is used only in programs and should not be set by system managers and users. See the following section on limitations of locale variables for more information.

As is true for the LANG variable, all of the variables in Table C-2 can be assigned locale names. Consider the case where your company is located in the United States but the prevalent language spoken by employees is Spanish. The LANG environment variable could be set to the name of a Spanish language locale and the LC_NUMERIC and LC_MONETARY variables set to the name of a United States English locale. The explicit setting of the LC_NUMERIC and LC_MONETARY variables overrides what they were implicitly set to by LANG. The LC_CTYPE, LC_MESSAGES, LC_TIME, and LC_COLLATE variables would still be implicitly set to the Spanish locale. The following are the variable assignments for the C shell to implement this example:

setenv LANG es_ES.ISO8859-1
setenv LC_NUMERIC en_US.ISO8859-1
setenv LC_MONETARY en_US.ISO8859-1

The following are the same variable assignments for the Bourne and Korn shells:

LANG=es_ES.ISO8859-1
export LANG
LC_NUMERIC=en_US.ISO8859-1
export LC_NUMERIC
LC_MONETARY=en_US.ISO8859-1
export LC_MONETARY

Sometimes different versions of the same locale are available locally to meet the needs of certain languages or software applications. The names of such locales end with the at sign (@) plus a modifier field. For example, the collating sequence used for the telephone book in some languages is different from the collating sequence used for dictionaries. If the standard locale for a language defined the dictionary collating sequence, another version of the locale might exist to support the telephone book collating sequence. In this case the alternative locale version might have a name like en_FR.ISO8859-1@phone.

C.4.2 Limitations of Locale Settings

The ability to set locale allows you to tailor your environment, but it does not protect you from making mistakes. The following sections discuss problems that can arise when you define locale variables.

C.4.2.1 Locale Settings Are Not Validated

There is nothing to prevent you from defining implausible combinations of locale names for different aspects of a locale. For example, you could set the LANG environment variable to a French locale and the LC_CTYPE variable to a Norwegian locale. The results would probably be undesirable; for example, French message translations would likely contain characters not specified in the Norwegian locale. If you define locale variables in addition to LANG, you are responsible for ensuring a valid combination of locale settings.

C.4.2.2 File Data Is Not Bound to a Locale

The system has no way of knowing what locale was set when a file was created. Therefore, the system cannot prevent you from processing the file's data using a different locale. For example, suppose you copy to your system a file that was created when the LANG variable was set to a German locale. If, on your system, LANG is set to a French locale and you use the grep command to search for a string in the file, the grep command will use French collation and pattern matching rules on the German data. It is therefore your responsibility to know what kind of language data a file contains and to set the locale accordingly.

C.4.2.3 Setting LC_ALL Overrides All Other Locale Variables

The LC_ALL variable overrides all other locale-dependent environment variables, even if you set it before setting category-specific variables, such as LC_COLLATE. The only way to cancel the influence of LC_ALL is to undefine the variable. For example, enter the command unsetenv LC_ALL.

The LC_ALL variable is available for users familiar with the System V environment. In that environment, users set locale either by setting LC_ALL or by setting all the locale category variables individually.